10 Tips for Choosing the Optimal Number of Clusters

The cValid package can be used to simultaneously compare multiple clustering algorithms, to identify the best clustering approach and the optimal number of clusters.

We will compare k-means, hierarchical and PAM clustering.

intern <- clValid(mammals_scaled, nClust = 2:24, clMethods = c("hierarchical","kmeans","pam"), validation = "internal")# Summarysummary(intern) %>% kable() %>% kable_styling()Clustering Methods: hierarchical kmeans pam Cluster sizes: 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 Validation Measures: 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 hierarchical Connectivity 4.

1829 10.

5746 13.

2579 20.

1579 22.

8508 25.

8258 32.

6270 35.

3032 38.

2905 39.

2405 41.

2405 45.

7742 47.

2742 50.

6075 52.

6075 55.

8575 58.

7242 60.

7242 63.

2242 65.

2242 67.

2242 69.

2242 71.

2242 Dunn 0.

3595 0.

3086 0.

3282 0.

2978 0.

3430 0.

3430 0.

4390 0.

4390 0.

5804 0.

5938 0.

5938 0.

8497 0.

8497 0.

5848 0.

5848 0.

4926 0.

9138 0.

9138 0.

8892 0.

9049 0.

9335 1.

0558 2.

1253 Silhouette 0.

5098 0.

5091 0.

4592 0.

4077 0.

4077 0.

3664 0.

3484 0.

4060 0.

3801 0.

3749 0.

3322 0.

3646 0.

3418 0.

2650 0.

2317 0.

2166 0.

2469 0.

2213 0.

1659 0.

1207 0.

1050 0.

0832 0.

0691kmeans Connectivity 7.

2385 10.

5746 15.

8159 20.

1579 22.

8508 25.

8258 33.

5198 35.

3032 38.

2905 39.

2405 41.

2405 45.

7742 47.

2742 51.

8909 53.

8909 57.

1409 58.

7242 60.

7242 63.

2242 65.

2242 67.

2242 69.

2242 71.

2242 Dunn 0.

2070 0.

3086 0.

2884 0.

2978 0.

3430 0.

3430 0.

3861 0.

4390 0.

5804 0.

5938 0.

5938 0.

8497 0.

8497 0.

5866 0.

5866 0.

5725 0.

9138 0.

9138 0.

8892 0.

9049 0.

9335 1.

0558 2.

1253 Silhouette 0.

5122 0.

5091 0.

4260 0.

4077 0.

4077 0.

3664 0.

3676 0.

4060 0.

3801 0.

3749 0.

3322 0.

3646 0.

3418 0.

2811 0.

2478 0.

2402 0.

2469 0.

2213 0.

1659 0.

1207 0.

1050 0.

0832 0.

0691pam Connectivity 7.

2385 14.

1385 17.

4746 24.

0024 26.

6857 32.

0413 33.

8913 36.

0579 38.

6607 40.

6607 42.

7869 45.

7742 47.

2742 51.

7242 53.

7242 56.

9742 58.

7242 60.

7242 62.

7242 64.

7242 66.

7242 69.

2242 71.

2242 Dunn 0.

2070 0.

1462 0.

2180 0.

2180 0.

2978 0.

2980 0.

4390 0.

4390 0.

4390 0.

4390 0.

4390 0.

8497 0.

8497 0.

5314 0.

5314 0.

4782 0.

9138 0.

9138 0.

8333 0.

8189 0.

7937 1.

0558 2.

1253 Silhouette 0.

5122 0.

3716 0.

4250 0.

3581 0.

3587 0.

3318 0.

3606 0.

3592 0.

3664 0.

3237 0.

3665 0.

3646 0.

3418 0.

2830 0.

2497 0.

2389 0.

2469 0.

2213 0.

1758 0.

1598 0.

1380 0.

0832 0.

0691Optimal Scores: Score Method ClustersConnectivity 4.

1829 hierarchical 2 Dunn 2.

1253 hierarchical 24 Silhouette 0.

5122 kmeans 2Connectivity and Silhouette are both measurements of connectedness while the Dunn Index is the ratio of the smallest distance between observations not in the same cluster to the largest intra-cluster distance.

Extracting Features of ClustersWe would like to answer questions like “what is it that makes this cluster unique from others?” and “what are the clusters that are similar to one another”As mentioned earlier it’s difficult to assess the quality of results from clustering.

We don’t have true labels so clustering is a good EDA starting point for exploring the differences between these clusters in greater detail.

Let’s select five clusters and interrogate the features of these clusters.

# Compute dissimilarity matrix with euclidean distancesd <- dist(mammals_scaled, method = "euclidean")# Hierarchical clustering using Ward's methodres.

hc <- hclust(d, method = "ward.

D2" )# Cut tree into 5 groupsgrp <- cutree(res.

hc, k = 5)# Visualizeplot(res.

hc, cex = 0.

6) # plot treerect.

hclust(res.

hc, k = 5, border = 2:5) # add rectangle# Execution of k-means with k=5final <- kmeans(mammals_scaled, 5, nstart = 30)fviz_cluster(final, data = mammals_scaled) + theme_minimal() + ggtitle("k = 5")Let’s extract the clusters and add them back to our initial data to do some descriptive statistics at the cluster level:as.

data.

frame(mammals_scaled) %>% mutate(Cluster = final$cluster) %>% group_by(Cluster) %>% summarise_all("mean") %>% kable() %>% kable_styling()We see that cluster 2, composed solely of the Rabbit has a high ash content.

Group 3 composed of the seal and dolphin are high in fat, which makes sense given the harsh demands of such a cold climate whil group 4 has a large lactose content.

mammals_df <- as.

data.

frame(mammals_scaled) %>% rownames_to_column()cluster_pos <- as.

data.

frame(final$cluster) %>% rownames_to_column()colnames(cluster_pos) <- c("rowname", "cluster")mammals_final <- inner_join(cluster_pos, mammals_df)ggRadar(mammals_final[-1], aes(group = cluster), rescale = FALSE, legend.

position = "none", size = 1, interactive = FALSE, use.

label = TRUE) + facet_wrap(~cluster) + scale_y_discrete(breaks = NULL) + # don't show tickstheme(axis.

text.

x = element_text(size = 10)) + scale_fill_manual(values = rep("#1c6193", nrow(mammals_final))) +scale_color_manual(values = rep("#1c6193", nrow(mammals_final))) +ggtitle("Mammals Milk Attributes")mammals_df <- as.

data.

frame(mammals_scaled)mammals_df$cluster <- final$clustermammals_df$cluster <- as.

character(mammals_df$cluster)ggpairs(mammals_df, 1:5, mapping = ggplot2::aes(color = cluster, alpha = 0.

5), diag = list(continuous = wrap("densityDiag")), lower=list(continuous = wrap("points", alpha=0.

9)))# plot specific graphs from previous matrix with scatterplotg <- ggplot(mammals_df, aes(x = water, y = lactose, color = cluster)) + geom_point() + theme(legend.

position = "bottom")ggExtra::ggMarginal(g, type = "histogram", bins = 20, color = "grey", fill = "blue")b <- ggplot(mammals_df, aes(x = protein, y = fat, color = cluster)) + geom_point() + theme(legend.

position = "bottom")ggExtra::ggMarginal(b, type = "histogram", bins = 20, color = "grey", fill = "blue")ggplot(mammals_df, aes(x = cluster, y = protein)) + geom_boxplot(aes(fill = cluster))ggplot(mammals_df, aes(x = cluster, y = fat)) + geom_boxplot(aes(fill = cluster))ggplot(mammals_df, aes(x = cluster, y = lactose)) + geom_boxplot(aes(fill = cluster))ggplot(mammals_df, aes(x = cluster, y = ash)) + geom_boxplot(aes(fill = cluster))ggplot(mammals_df, aes(x = cluster, y = water)) + geom_boxplot(aes(fill = cluster))# Parallel coordiante plots allow us to put each feature on seperate column and lines connecting each columnggparcoord(data = mammals_df, columns = 1:5, groupColumn = 6, alphaLines = 0.

4, title = "Parallel Coordinate Plot for the Mammals Milk Data", scale = "globalminmax", showPoints = TRUE) + theme(legend.

position = "bottom")If you find this article useful feel free to share it with others or recommend this article!.????As always, if you have any questions or comments feel free to leave your feedback below or you can always reach me on LinkedIn.

Till then, see you in the next post!.????.. More details

Leave a Reply