Clustering — Unsupervised Machine Learning

etc, etc.

The way unsupervised machine learning, more specifically K-means, works is by clustering similar data points together in high multi-dimensional space.

To understand it more clearly, consider the below plot:Schematic representation of K-MeansAs we can see, on the left, the data points are initially scattered.

Assume, without loss of generality, that we do not know how each data points are related.

In other words, just by looking at the graph we cannot identify that so-and-so points are similar just because they are located near each other (Also, imagine that the data-points are high dimensional — i.

e.

, greater than 3 dimensions).

What clustering does is that, it groups the data-points which are closer to each other into one cluster, irrespective of the number of dimensions, thereby indicating us that the data-points belonging to a single cluster belong to a particular class.

This simple idea has the potential to solve many of the problems which our society faces:Market Segmentation: the process of dividing a market of potential customers into groups, or segments, based on different characteristics.

The segments created are composed of consumers who will respond similarly to marketing strategies and who share traits such as similar interests, needs, or locations.

Social Network Analysis: the process of analysing users of a social media platform who have similar “tastes.

” After identifying the users who possess similar tastes, running targeted advertisements becomes much easier.

Astronomical Data Analysis: the process of analysing unlabelled astronomical data to figure out hidden patterns.

Note: You might be surprised to hear the fact, but the amount of labelled data in today’s world is only a fraction of the colossal amount of data available.

This fact, therefore, further bolsters the advantages of learning how to do unsupervised machine learning.

K-Means AlgorithmChoose K — which is the number of clusters.

Although we are talking about unsupervised machine learning, we must not be under the illusion that an algorithm magically clusters the input dataset into some number of clusters.

We need to specify how many clusters we desire.

Based on domain knowledge one can easily specify the required number of clusters.

Nevertheless, even though you are not familiar with how many clusters are present, there is a technique to determine how to choose ‘K’.

From the set of all available data-points, randomly choose K data-points and call them “cluster centroids.

”Cluster assignment step.

Go through the entire data set, and for each data point, x(i), assign it to one of the cluster centroid to which it is closer to.

How do we determine the “close-ness”?.Well, we do it by calculating the euclidean distance between the said points.

Now, we will have clusters formed.

We denote c(i) as an index of a cluster centroid which is closest to x(i).

Naturally, the range of c(i) is from 1 to K, where K is the no:of clusters which we want.

Move centroid step.

Move the cluster centroids to another location which is determined by the average of the points (i.

e.

the mean of the locations of all points within a cluster) in the cluster to which they belong.

Repeat step 3 and 4 continuously until the move centroid step does not make any significant progress.

Note: If you think that the explanation is a bit hazy, or extremely simplified, then hang on!.It is going to be clear once you get to the see the code.

Implementing K-MeansWe will be using the following dataset on cars, to perform clustering (downloaded from Kaggle):In order to give you a full picture of the dataset, let us view the seaborn pairplot:The entire code base for running K-Means (along with the above dataset) is available in my Github repository as an IPython notebook.

For the sake of clarity, temporarily, you can refer to the below code snippet:The cost function which we use in K-Means to determine how good the clustering is, is called as Distortion cost function.

Essentially it is the average distance of a data-point to the cluster-centroid to which it is assigned to.

In order to visualise the clusters, take two columns out of the available columns from cars.

csv file.

The below visualisation is done by using the columns “hp” and “mpg” (however, you are free to choose any number of columns):K = 22.

K = 43.

K = 84.

K = 16In all the above figures, the first plot shows the dataset, along with the final position of cluster centroids (denoted as triangle.

) The next plot shows the resultant clusters.

As we can see, the dataset appears to have somewhere around 2–4 clusters.

Why only 2–4 clusters, and why not 8 or 16 clusters?.Well, by looking at the plot we can easily recognise that K=8 and K=16 is redundantly trying to cluster the data which are close enough.

This statement seems to be intuitive.

But what if our dataset is high dimensional?.What if we cannot plot it on a 2D plane and visualise whether the choice of ‘K’ in K-Means is right or wrong?.The next section addresses this concern.

Choosing the ‘K’ in K-MeansThe way to choose K, without relying on the domain knowledge or the visualisation, is to follow the elbow-method.

We run K-means several times with different value of ‘K’ (i.

e.

first with only one cluster centroid, then two, etc).

For each run, we collect the output of the cost function and plot it on a graph against ‘K’.

As ‘K’ increases we observe that the cost function J() also decreases.

But after sometime, at say K = 3 or 4 onwards K starts decreasing-slowing.

You will get a graph which looks like an elbow:Empirically, it has been found that the elbow point corresponds to the optimal value of K.

Image compression using K-MeansIt is time to test our knowledge of K-Means and apply it to solve real-life problem.

We will be using K-Means to perform image-compression.

The entire code base for image compression is available in my Github repository as an IPython notebook and as well as Octave code.

Leftmost image depicts the actual image.

Middle image depicts a compressed image, but has a bit of resolution leftover.

The rightmost image depicts a highly compressed and low resolution image.

Compression has been done using K-Means!Consider that you have an image of size 128 X 128 X 3.

If you vectorise the image, you will have a numpy array of size 16384 X 3.

We can treat this image as data points of numeric data, i.

e.

we must ignore the fact that this data represents an image.

More concretely, you can think of this as any other numpy array of size 16384 X 3, where the total number of examples are m = 16384 and the total number of features are n = 3.

If we were to apply K-Means to this dataset, by choosing let’s say K = 16, we will be picking 16 most important data-points (which are nothing but the cluster centroids.

)Once again, if we think of the array now as an image, the only difference being, we will now use only 4 bits (since 2⁴ = 16 = K) to represent the image colour.

Total size of the new image is: 128 X 128 X 4 = 65536 bits.

We still, however, require some overhead of storage.

We used only 4 bits to represent 16 colours.

However, each colour (if we assume RGB format) requires 8 bits per channel.

In other words, R + G + B = 8 + 8 + 8 = 24 bits to represent one colour.

Since we chose K = 16, corresponding to 16 colours, we require 24 X 16 = 384 bits extra.

Therefore, total number of bits to represent the new image: 65536 + 384 = 65920 bits.

Compare this with the original image, which had 128 X 128 pixels, and each pixel is a 24 bit colour — which results in 128 X 128 X 24 = 393216 bits.

Clearly, we have compressed the image by a factor of 6!.Amazing!This same principle will be found in the project which is hosted on my Github repository (linked previously).

Feel free to check it out and try compressing new images with varying values of K.

Remember, a high value of K means that you will not compress the image by a large margin, i.

e.

to say you will retain a lot of the resolution.

However, if you were to choose a small value of K, then the image will be highly compressed and as a result have low resolution.

Congrats for completing the entire article!.If you loved reading the article, hit thumbs up!.It means a lot.

Share it with your friends if you find it interesting.

Moreover, if there are any loopholes in my explanation or in my code, please feel free to point out.

Let us learn and grow together!In my next article, I will be talking about PCA (Principal Component Analysis) and how it helps in speeding up neural network training time.

.

. More details

Leave a Reply