Unsupervised Learning: Clustering

That’s where unsupervised learning comes in.

So what is unsupervised learning?There are three types of unsupervised learning: clustering (what we’re going to focus on), dimensionality reduction, and autoencoding.

Dimensionality reduction (aka data compression) does exactly what it sounds like it does.

It finds ways to shrink and encode your data so that it’s easier, faster, and cheaper to run through a model.

It’s commonly used for images in order to break them down but retain most of the information.

Data compression falls on the assumption that most data is somewhat redundant and can be reformatted to contain information more efficiently.

There are two types of data compression: principal component analysis which finds linear combinations of variables that communicate most of the variance and singular-value decomposition which factorizes a dataset into three smaller matrices.

An Autoencoder is very similar to data compression in the sense that shrinks the data, but it does this through deep learning where the data is input into a neural network that, through weighting, mold an output that is the best representation of the data.

Clustering also does its name justice.

It takes the unlabeled data and organizes it into similar groups.

There are three ways it can do this.

First, there is k-means clustering which creates k mutually exclusive groups.

It does this by assigning k random centroids to the data and assigning the observations to the centroid they’re closest to.

The centroid is centered within these observations and the process repeats until the centroids effectively stop moving.

The difficult part with this is process of choosing a proper k can be complex.

A larger k means smaller groups and therefore more granularity, but you might want the groups to be clustered more broadly.

Below is a source with an interactive visualization that clearly explains k-means clustering further.

Visualizing K-Means ClusteringUnfortunately, despite the fact that k-means is guaranteed to converge, the final cluster configuration to which it…www.

naftaliharris.

comfrom sklearn.

cluster import KMeansk = 10kmeans = KMeans(n_clusters=k).

fit(X)kmeans.

labels_There’s also hierarchical clustering which begins with n clusters, a cluster for each observation.

From there, it combines the closest two clusters into a larger cluster and repeats this until all observations are in a single cluster.

This is called agglomerative clustering and its reverse (one group splitting into many) is called divisive clustering.

You can divide the resulting dendrogram to give you the desired number of clusters.

Hierarchical Clusteringfrom sklearn.

cluster import AgglomerativeClusteringclusters = AgglomerativeClustering(n_clusters=10).

fit(X)clusters.

labels_Lastly, there is probabilistic clustering which is a softer form of clustering which instead of assigning a group to each observation, it assigns a probability of a group.

This is helpful if you want to know how similar an observation is to a group rather than just the group is most similar to.

There are two main challenges of unsupervised learning.

First, specifically with clustering, there is required exploration into the resulting clusters.

The algorithm will split the data, but it will not tell you how it did so or what the similarities are within the clusters which may be the goal of the execution.

Second, it’s difficult to know if it worked properly.

Unlike with supervised learning, there is no accuracy metric that can be used to evaluate it.

Back to the Pokemon!linkWith the dataset of 800 Pokemon, features including things like HP (Hit Points), Attack and Defense, primary type and secondary type, and generation, I decided to see how an algorithm would separate Pokemon into clusters.

There were a few predictions I had, first of which was separating into generations.

I was forming this off the assumption that over the past 21 years, they must have subconsciously made the new Pokemon objectively better.

(They didn’t.

) What I found was groups very similar to the ones I have a sense most people unknowingly make while they play.

There are some Pokemon worth catching, training, and battling with and there are some that you catch for your Pokedex and leave them in your PC (a concept I still don’t understand from the many years ago I played, I mean how do you store the Pokemon you physically catch in a computer?).

My main takeaway? There is an objective difference between legendary and nonlegendary Pokemon.

This makes sense, but the fact that out of 10 clusters made, two are exclusively legendary and one has one nonlegendary that snuck in when that feature was not included in the data that was given to the machine, proves it.

Not only that, but it split the top attackers, top defenders, and a few balanced between.

In terms of nonlegendary Pokemon, it did something peculiar as well.

It created a group of strong Dragon Pokemon, a group of older but strong water/normal type Pokemon, a group of poison/fighting type Pokemon with high HP and attack but low defense, a group of nature rock/bug/water) type Pokemon with high attack and defense but low HP, a group with newer average Pokemon (mostly starters and low evolutions), a group with older average Pokemon, and a group of fairy and ghost type Pokemon with decent HP but low attack and defense.

So… essentially it created these clusters of “you want these”, “you could work with these”, and “don’t waste your time on these.

” There are obviously some questionable decisions here and there, for instance Magikarp, an infamously useless Pokemon pre-evolution, was placed in the older but strong water/normal types when it realistically should have been in the older and average group.

However, its evolution Gyarados happens to be in the same group and this trend spans most groups where the evolutions are all in the same cluster regardless of if one is significantly better than the other.

All in all, there appears to be some method to the madness.

ConclusionClustering and unsupervised learning in general can be a very useful tool if you want to understand the types of data you have.

However, the insight you gain from it depends on you and your understanding on the data.

It’s ultimately up to you to decide what your clusters mean and how to use them.

Also, maybe we should give Magikarp a chance.

Magikarp’s Confusing Evolutionhttps://github.

com/taylorfogarty/launch/blob/master/neural_net_pokemon.

ipynb.

. More details

Leave a Reply