Hierarchical Clustering on Categorical Data in RAnastasia ReusovaBlockedUnblockFollowFollowingApr 1, 2018This was my first attempt to perform customer clustering on real-life data, and it’s been a valuable experience.

While articles and blog posts about clustering using numerical variables on the net are abundant, it took me some time to find solutions for categorical data, which is, indeed, less straightforward if you think of it.

Methods for categorical data clustering are still being developed — I will try one or the other in a different post.

On the other hand, I have come across opinions that clustering categorical data might not produce a sensible result — and partially, this is true (there’s an amazing discussion at CrossValidated).

At a certain point, I thought “What am doing, why not just break it all down in cohorts.

” But cohort analysis is not always sensible as well, especially in case you get more categorical variables with higher number of levels — you can easily skimming through 5–7 cohorts might be easy, but what if you have 22 variables with 5 levels each (say, it’s a customer survey with discrete scores of 1,2,3,4,5), and you need to see what are the distinctive groups of customers you have — you would have 22×5 cohorts.

Nobody wants to do this.

Clustering could appear to be useful.

So this post is all about sharing what I wish I bumped into when I started with clustering.

The clustering process itself contains 3 distinctive steps:Calculating dissimilarity matrix — is arguably the most important decision in clustering, and all your further steps are going to be based on the dissimilarity matrix you’ve made.

Choosing the clustering methodAssessing clustersThis post is going to be sort of beginner level, covering the basics and implementation in R.

Dissimilarity MatrixArguably, this is the backbone of your clustering.

Dissimilarity matrix is a mathematical expression of how different, or distant, the points in a data set are from each other, so you can later group the closest ones together or separate the furthest ones — which is a core idea of clustering.

This is the step where data types differences are important as dissimilarity matrix is based on distances between individual data points.

While it is quite easy to imagine distances between numerical data points (remember Eucledian distances, as an example?), categorical data (factors in R) does not seem as obvious.

In order to calculate a dissimilarity matrix in this case, you would go for something called Gower distance.

I won’t get into the math of it, but I am providing a links here and here.

To do that I prefer to use daisy() with metric = c("gower") from the cluster package.

#—– Dummy Data —–## the data will be sterile clean in order to not get distracted with other issues that might arise, but I will also write about some difficulties I had, outside the codelibrary(dplyr)# ensuring reproducibility for samplingset.

seed(40)# generating random variable set# specifying ordered factors, strings will be converted to factors when using data.

frame()# customer ids come first, we will generate 200 customer ids from 1 to 200id.

s <- c(1:200) %>% factor()budget.

s <- sample(c("small", "med", "large"), 200, replace = T) %>% factor(levels=c("small", "med", "large"), ordered = TRUE)origins.

s <- sample(c("x", "y", "z"), 200, replace = T, prob = c(0.

7, 0.

15, 0.

15))area.

s <- sample(c("area1", "area2", "area3", "area4"), 200, replace = T, prob = c(0.

3, 0.

1, 0.

5, 0.

2))source.

s <- sample(c("facebook", "email", "link", "app"), 200, replace = T, prob = c(0.

1,0.

2, 0.

3, 0.

4))## day of week – probabilities are mocking the demand curvedow.

s <- sample(c("mon", "tue", "wed", "thu", "fri", "sat", "sun"), 200, replace = T, prob = c(0.

1, 0.

1, 0.

2, 0.

2, 0.

1, 0.

1, 0.

2)) %>% factor(levels=c("mon", "tue", "wed", "thu", "fri", "sat", "sun"), ordered = TRUE)# dish dish.

s <- sample(c("delicious", "the one you don't like", "pizza"), 200, replace = T) # by default, data.

frame() will convert all the strings to factorssynthetic.

customers <- data.

frame(id.

s, budget.

s, origins.

s, area.

s, source.

s, dow.

s, dish.

s)#—– Dissimilarity Matrix —–#library(cluster) # to perform different types of hierarchical clustering# package functions used: daisy(), diana(), clusplot()gower.

dist <- daisy(synthetic.

customers[ ,2:7], metric = c("gower"))# class(gower.

dist) ## dissimilarity , distDone with a dissimilarity matrix.

That’s very fast on 200 observations, but can be very computationally expensive in case you have a large data set.

In reality, it is quite likely that you will have to clean the dataset first, perform the necessary transformations from strings to factors and keep an eye on missing values.

In my own case, the dataset contained rows of missing values, which nicely clustered together every time, leading me to assume that I found a treasure until I had a look at the values (meh!).

Clustering Algorithms You could have heard that there is k-means and hierarchical clustering.

In this post, I focus on the latter as it is a more exploratory type, and it can be approached differently: you could choose to follow either agglomerative (bottom-up) or divisive (top-down) way of clustering.

Credits: UC Business Analytics R Programming GuideAgglomerative clustering will start with n clusters, where n is the number of observations, assuming that each of them is its own separate cluster.

Then the algorithm will try to find most similar data points and group them, so they start forming clusters.

In contrast, divisive clustering will go the other way around — assuming all your n data points are one big cluster and dividing most dissimilar ones into separate groups.

If you are thinking which one of them to use, it is always worth trying all the options, but in general, agglomerative clustering is better in discovering small clusters, and is used by most software; divisive clustering — in discovering larger clusters.

I personally like having a look at dendrograms — graphical representation of clustering first to decide which method I will stick to.

As you will see below, some of dendrograms will be pretty balanced , while others will look like a mess.

# The main input for the code below is dissimilarity (distance matrix)# After dissimilarity matrix was calculated, the further steps will be the same for all data types# I prefer to look at the dendrogram and fine the most appealing one first – in this case, I was looking for a more balanced one – to further continue with assessment#———— DIVISIVE CLUSTERING ————#divisive.

clust <- diana(as.

matrix(gower.

dist), diss = TRUE, keep.

diss = TRUE)plot(divisive.

clust, main = "Divisive")#———— AGGLOMERATIVE CLUSTERING ————## I am looking for the most balanced approach# Complete linkages is the approach that best fits this demand – I will leave only this one here, don't want to get it cluttered# completeaggl.

clust.

c <- hclust(gower.

dist, method = "complete")plot(aggl.

clust.

c, main = "Agglomerative, complete linkages")Assessing clusters Here, you will decide between different clustering algorithms and a different number of clusters.

As it often happens with assessment, there is more than one way possible, complemented by your own judgement.

It’s bold and in italics because your own judgement is important — the number of clusters should make practical sense and they way data is divided into groups should make sense too.

Working with categorical variables, you might end up with non-sense clusters because the combination of their values is limited — they are discrete, so is the number of their combinations.

Possibly, you don’t want to have a very small number of clusters either — they are likely to be too general.

In the end, it all comes to your goal and what you do your analysis for.

Conceptually, when clusters are created, you are interested in distinctive groups of data points, such that the distance between them within clusters (or compactness) is minimal while the distance between groups (separation) is as large as possible.

This is intuitively easy to understand: distance between points is a measure of their dissimilarity derived from dissimilarity matrix.

Hence, the assessment of clustering is built around evaluation of compactness and separation.

I will go for 2 approaches here and show that one of them might produce nonsense results:Elbow method: start with it when the compactness of clusters, or similarities within groups are most important for your analysis.

Silhouette method: as a measure of data consistency, the silhouette plot displays a measure of how close each point in one cluster is to points in the neighboring clusters.

In practice, they are very likely to provide different results that might be confusing at a certain point— different number of clusters will correspond to the most compact / most distinctively separated clusters, so judgement and understanding what your data is actually about will be a significant part of making the final decision.

There are also a bunch of measurements that you can analyze for your own case.

I am adding them to the code itself.

# Cluster stats comes out as list while it is more convenient to look at it as a table# This code below will produce a dataframe with observations in columns and variables in row# Not quite tidy data, which will require a tweak for plotting, but I prefer this view as an output here as I find it more comprehensive library(fpc)cstats.

table <- function(dist, tree, k) {clust.

assess <- c("cluster.

number","n","within.

cluster.

ss","average.

within","average.

between", "wb.

ratio","dunn2","avg.

silwidth")clust.

size <- c("cluster.

size")stats.

names <- c()row.

clust <- c()output.

stats <- matrix(ncol = k, nrow = length(clust.

assess))cluster.

sizes <- matrix(ncol = k, nrow = k)for(i in c(1:k)){ row.

clust[i] <- paste("Cluster-", i, " size")}for(i in c(2:k)){ stats.

names[i] <- paste("Test", i-1) for(j in seq_along(clust.

assess)){ output.

stats[j, i] <- unlist(cluster.

stats(d = dist, clustering = cutree(tree, k = i))[clust.

assess])[j] } for(d in 1:k) { cluster.

sizes[d, i] <- unlist(cluster.

stats(d = dist, clustering = cutree(tree, k = i))[clust.

size])[d] dim(cluster.

sizes[d, i]) <- c(length(cluster.

sizes[i]), 1) cluster.

sizes[d, i] }}output.

stats.

df <- data.

frame(output.

stats)cluster.

sizes <- data.

frame(cluster.

sizes)cluster.

sizes[is.

na(cluster.

sizes)] <- 0rows.

all <- c(clust.

assess, row.

clust)# rownames(output.

stats.

df) <- clust.

assessoutput <- rbind(output.

stats.

df, cluster.

sizes)[ ,-1]colnames(output) <- stats.

names[2:k]rownames(output) <- rows.

allis.

num <- sapply(output, is.

numeric)output[is.

num] <- lapply(output[is.

num], round, 2)output}# I am capping the maximum amout of clusters by 7# I want to choose a reasonable number, based on which I will be able to see basic differences between customer groups as a resultstats.

df.

divisive <- cstats.

table(gower.

dist, divisive.

clust, 7)stats.

df.

divisiveLook, average.

within, which is an average distance among observations within clusters, is shrinking, so does within cluster SS.

Average silhouette width is a bit less straightforward, but the reverse relationship is nevertheless there.

See how disproportional the size of clusters is.

I wouldn’t rush into working with incomparable number of observations within clusters.

One of the reasons, the dataset can be imbalanced, and some group of observations will outweigh all the rest in the analysis — this is not good and is likely to lead to biases.

stats.

df.

aggl <-cstats.

table(gower.

dist, aggl.

clust.

c, 7) #complete linkages looks like the most balanced approachstats.

df.

agglNotice how more balanced agglomerative complete linkages hierarchical clustering is compared on the number of observations per group.

# ——— Choosing the number of clusters ———## Using "Elbow" and "Silhouette" methods to identify the best number of clusters# to better picture the trend, I will go for more than 7 clusters.

library(ggplot2)# Elbow# Divisive clusteringggplot(data = data.

frame(t(cstats.

table(gower.

dist, divisive.

clust, 15))), aes(x=cluster.

number, y=within.

cluster.

ss)) + geom_point()+ geom_line()+ ggtitle("Divisive clustering") + labs(x = "Num.

of clusters", y = "Within clusters sum of squares (SS)") + theme(plot.

title = element_text(hjust = 0.

5))So, we’ve produced the “elbow” graph.

It shows how the within sum of squares — as a measure of closeness of observations : the lower it is the closer the observations within the clusters are — changes for the different number of clusters.

Ideally, we should see a distinctive “bend” in the elbow where splitting clusters further gives only minor decrease in the SS.

In the case of a graph below, I would go for something around 7.

Although in this case, one of a clusters will consist of only 2 observations, let’s see what happens with agglomerative clustering.

# Agglomerative clustering,provides a more ambiguous pictureggplot(data = data.

frame(t(cstats.

table(gower.

dist, aggl.

clust.

c, 15))), aes(x=cluster.

number, y=within.

cluster.

ss)) + geom_point()+ geom_line()+ ggtitle("Agglomerative clustering") + labs(x = "Num.

of clusters", y = "Within clusters sum of squares (SS)") + theme(plot.

title = element_text(hjust = 0.

5))Agglomerative “elbow” looks similar to that of divisive, except that agglomerative one looks smoother — with “bends” being less abrupt.

Similarly to divisive clustering, I would go for 7 clusters, but choosing between the two methods, I like the size of the clusters produced by the agglomerative method more — I want something comparable in size.

# Silhouetteggplot(data = data.

frame(t(cstats.

table(gower.

dist, divisive.

clust, 15))), aes(x=cluster.

number, y=avg.

silwidth)) + geom_point()+ geom_line()+ ggtitle("Divisive clustering") + labs(x = "Num.

of clusters", y = "Average silhouette width") + theme(plot.

title = element_text(hjust = 0.

5))When it comes to silhouette assessment, the rule is you should choose the number that maximizes the silhouette coefficient because you want clusters that are distinctive (far) enough to be considered separate.

The silhouette coefficient ranges between -1 and 1, with 1 indicating good consistency within clusters, -1 — not so good.

From the plot above, you would not go for 5 clusters — you would rather prefer 9.

As a comparison, for the “easy” case, the silhouette plot is likely to look like the graph below.

We are not quite, but almost there.

Credits: Data sailorsggplot(data = data.

frame(t(cstats.

table(gower.

dist, aggl.

clust.

c, 15))), aes(x=cluster.

number, y=avg.

silwidth)) + geom_point()+ geom_line()+ ggtitle("Agglomerative clustering") + labs(x = "Num.

of clusters", y = "Average silhouette width") + theme(plot.

title = element_text(hjust = 0.

5))What the silhouette width graph above is saying is “the more you break the dataset, the more distinctive the clusters become”.

Ultimately, you will end up with individual data points — and you don’t want that, and if you try a larger k for the number of clusters — you will see it.

E.

g.

, at k=30, I got the following graph:So-so: the more you split, the better it gets, but we can’t be splitting till individual data points (remember that we have 30 clusters here in the graph above, and only 200 data points).

Summing it all up, agglomerative clustering in this case looks way more balanced to me — the cluster sizes are more or less comparable (look at that cluster with just 2 observations in the divisive section!), and I would go for 7 clusters obtained by this method.

Let’s see how they look and check what’s inside.

The dataset consists of 6 variables which need to be visualized in 2D or 3D, so it’s time for a challenge! The nature of categorical data poses some limitations too, so using some pre-defined solutions might get tricky.

What I want to a) see how observations are clustered, b) know is how observations are distributed across categories, thus I created a) a colored dendrogram, b) a heatmap of observations count per variable within each cluster.

library("ggplot2")library("reshape2")library("purrr")library("dplyr")# let's start with a dendrogramlibrary("dendextend")dendro <- as.

dendrogram(aggl.

clust.

c)dendro.

col <- dendro %>% set("branches_k_color", k = 7, value = c("darkslategray", "darkslategray4", "darkslategray3", "gold3", "darkcyan", "cyan3", "gold3")) %>% set("branches_lwd", 0.

6) %>% set("labels_colors", value = c("darkslategray")) %>% set("labels_cex", 0.

5)ggd1 <- as.

ggdend(dendro.

col)ggplot(ggd1, theme = theme_minimal()) + labs(x = "Num.

observations", y = "Height", title = "Dendrogram, k = 7")# Radial plot looks less cluttered (and cooler)ggplot(ggd1, labels = T) + scale_y_reverse(expand = c(0.

2, 0)) + coord_polar(theta="x")# Time for the heatmap# the 1st step here is to have 1 variable per row# factors have to be converted to characters in order not to be droppedclust.

num <- cutree(aggl.

clust.

c, k = 7)synthetic.

customers.

cl <- cbind(synthetic.

customers, clust.

num)cust.

long <- melt(data.

frame(lapply(synthetic.

customers.

cl, as.

character), stringsAsFactors=FALSE), id = c("id.

s", "clust.

num"), factorsAsStrings=T)cust.

long.

q <- cust.

long %>% group_by(clust.

num, variable, value) %>% mutate(count = n_distinct(id.

s)) %>% distinct(clust.

num, variable, value, count)# heatmap.

c will be suitable in case you want to go for absolute counts – but it doesn't tell much to my tasteheatmap.

c <- ggplot(cust.

long.

q, aes(x = clust.

num, y = factor(value, levels = c("x","y","z", "mon", "tue", "wed", "thu", "fri","sat","sun", "delicious", "the one you don't like", "pizza", "facebook", "email", "link", "app", "area1", "area2", "area3", "area4", "small", "med", "large"), ordered = T))) + geom_tile(aes(fill = count))+ scale_fill_gradient2(low = "darkslategray1", mid = "yellow", high = "turquoise4")# calculating the percent of each factor level in the absolute count of cluster memberscust.

long.

p <- cust.

long.

q %>% group_by(clust.

num, variable) %>% mutate(perc = count / sum(count)) %>% arrange(clust.

num)heatmap.

p <- ggplot(cust.

long.

p, aes(x = clust.

num, y = factor(value, levels = c("x","y","z", "mon", "tue", "wed", "thu", "fri","sat", "sun", "delicious", "the one you don't like", "pizza", "facebook", "email", "link", "app", "area1", "area2", "area3", "area4", "small", "med", "large"), ordered = T))) + geom_tile(aes(fill = perc), alpha = 0.

85)+ labs(title = "Distribution of characteristics across clusters", x = "Cluster number", y = NULL) + geom_hline(yintercept = 3.

5) + geom_hline(yintercept = 10.

5) + geom_hline(yintercept = 13.

5) + geom_hline(yintercept = 17.

5) + geom_hline(yintercept = 21.

5) + scale_fill_gradient2(low = "darkslategray1", mid = "yellow", high = "turquoise4")heatmap.

pHaving a heatmap, you see the how many observations fall into each factor level within initial factors (variables we’ve started with).

The deeper blue corresponds to a higher relative number of observations within a cluster.

In this one, you can also see that day of the week / basket size have almost the same amount of customers in each bin— it might mean those are not definitive for the analysis and might be omitted.

Closing noteIn this post, we’ve gone through dissimilarity matrix calculations, trying out agglomerative and divisive hierarchical clustering methods and had a look at cluster assessment with the “elbow” and “silhouette” methods.

In this post, we’ve gone through dissimilarity matrix calculations, trying out agglomerative and divisive hierarchical clustering methods and had a look at cluster assessment.

Divisive and agglomerative hierarchical clustering are a good place to start exploring, but don’t stop there if your goal is to be a cluster master — there are much more methods and techniques popping up out there.

In comparison with numerical data clustering, the main difference is hidden in the dissimilarity matrix calculation.

From the point of assessment, not all the standard clustering assessment methods will produce reliable and sensible results — the silhouette method will likely to be off.

Lastly, it’s been a while since I did this exercise.

By now, I see some drawbacks behind my approach, and I welcome any feedback.

One general pitfall of my analysis lies beyond clustering itself — my dataset was imbalanced in many ways, which remained unaddressed.

I could see the effect it had on clustering: having 70% of customers belonging to one factor level (nationality in this case), this group was dominating most of the resulting clusters, making it hard to figure out differences within other factor levels.

Balancing the dataset and comparing cluster results is what I’m going to try next, and will write a separate post to cover that.

Finally, if you’re up for cloning, here’s the github link: https://github.

com/khunreus/cluster-categoricalAnd hope you enjoyed it!Here are the sources that I found useful:Hierarchical clustering tutorial (data preparation, clustering, visualization), overall, this blog might be useful for someone interested in business analytics with R: http://uc-r.

github.

io/hc_clustering and https://uc-r.

github.

io/kmeans_clusteringCluster validation: http://www.

sthda.

com/english/articles/29-cluster-validation-essentials/97-cluster-validation-statistics-must-know-methods/An example of document categorization (hierarchical and k-means): https://eight2late.

wordpress.

com/2015/07/22/a-gentle-introduction-to-cluster-analysis-using-r/denextend package is rather interesting, allows to compare cluster structure between different methods: https://cran.

r-project.

org/web/packages/dendextend/vignettes/introduction.

html#the-set-functionThere are not only dendrograms, but also clustergrams: https://www.

r-statistics.

com/2010/06/clustergram-visualization-and-diagnostics-for-cluster-analysis-r-code/Combining clustering heatmap and a dendrogram: https://jcoliver.

github.

io/learn-r/008-ggplot-dendrograms-and-heatmaps.

htmlI personally would be interested in trying out the method introduced in https://www.

ncbi.

nlm.

nih.

gov/pmc/articles/PMC5025633/, their GitHub repository: https://github.

com/khunreus/EnsCat.. More details