Unsupervised learning with K-means

# This gives a perspective into the density and separation of the formed clusters silhouette_avg = silhouette_score(data_transformed, cluster_labels) print("For n_clusters =", n_clusters, "The average silhouette_score is :", silhouette_avg) # Compute the silhouette scores for each sample sample_silhouette_values = silhouette_samples(data_transformed, cluster_labels) y_lower = 10 for i in range(n_clusters): # Aggregate the silhouette scores for samples belonging to # cluster i, and sort them ith_cluster_silhouette_values = sample_silhouette_values[cluster_labels == i] ith_cluster_silhouette_values.sort() size_cluster_i = ith_cluster_silhouette_values.shape[0] y_upper = y_lower + size_cluster_i color = cm.nipy_spectral(float(i) / n_clusters) ax1.fill_betweenx(np.arange(y_lower, y_upper), 0, ith_cluster_silhouette_values, facecolor=color, edgecolor=color, alpha=0.7) # Label the silhouette plots with their cluster numbers at the middle ax1.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i)) # Compute the new y_lower for next plot y_lower = y_upper + 10 # 10 for the 0 samples ax1.set_title("The silhouette plot for the various clusters.") ax1.set_xlabel("The silhouette coefficient values") ax1.set_ylabel("Cluster label") # The vertical line for average silhouette score of all the values ax1.axvline(x=silhouette_avg, color="red", linestyle="–") ax1.set_yticks([]) # Clear the yaxis labels / ticks ax1.set_xticks([-0.1, 0, 0.2, 0.4, 0.6, 0.8, 1]) # 2nd Plot showing the actual clusters formed colors = cm.nipy_spectral(cluster_labels.astype(float) / n_clusters) ax2.scatter(data_transformed[:, 0], data_transformed[:, 1], marker='.', s=30, lw=0, alpha=0.7, c=colors, edgecolor='k') # Labeling the clusters centers = clusterer.cluster_centers_ # Draw white circles at cluster centers ax2.scatter(centers[:, 0], centers[:, 1], marker='o', c="white", alpha=1, s=200, edgecolor='k') for i, c in enumerate(centers): ax2.scatter(c[0], c[1], marker='$%d$' % i, alpha=1, s=50, edgecolor='k') ax2.set_title("The visualization of the clustered data.") ax2.set_xlabel("Feature space for the 1st feature") ax2.set_ylabel("Feature space for the 2nd feature") plt.suptitle(("Silhouette analysis for KMeans clustering on sample data " "with n_clusters = %d" % n_clusters), fontsize=14, fontweight='bold')plt.show()The output of this algorithm is below..The graphs are separated by amount of clusters..The left graphs show us the Silhouette score and the ones on the right has clusters visualization.For n_clusters = 2 The average silhouette_score is : 0.24604273339845253For n_clusters = 3 The average silhouette_score is : 0.20670607133321856For n_clusters = 4 The average silhouette_score is : 0.188444914764597For n_clusters = 5 The average silhouette_score is : 0.19090629903451375For n_clusters = 6 The average silhouette_score is : 0.18047082769472864As you can see, the Silhouette score about the clusters is not so good in any scenario that we plot..The best to do here is to rebuild the dataset with other features..Maybe delete a few of them can improve the results.ConclusionUnsupervised learning using Cluster Analysis can be very easy to produce and at the same time very useful and important to a lot of knowledge areas, for example, what segments of customers a company have, to know better ways to advertising themselves or to create better products for each of their segments, besides that, another example is to know about some species in biology, the researchers can cluster the animals, or cells, or "something" according to their characteristics.It's important to remember that K-means is not the only technique for clustering data..There are other useful methods like Hierarchical Clustering, which is another easy to learn algorithm, or Expectation-maximization, a widely used method by Data Scientists.ReferencesMethods to determining the optimal number of clustersHow to determine the optimal number of clusters for K-means Plot silhouette analysis Silhouette Analysis Dataset used here. More details

Leave a Reply