Outlier-Aware Clustering: Beyond K-Means

Outlier-Aware Clustering: Beyond K-MeansCan data science help you choose your next job? First of all, that’s a terrible idea.Photo from IRTI.But hey, let’s see what happens.The approachWhat we propose is a combinational approach of dimensionality reduction for optimized clustering. Yes, I’m sure some of you will still stick to the ol’ PCA/K-Means after reading this article, but I hope you’ll get a new tool in your toolbox that’s just as quick.The approach uses two approaches that are quickly gaining in popularity: UMAP for dimensionality reduction, and HDBSCAN for clustering. We’ve had a lot of success with this combination across multiple projects in human resources for behavioral archetype definitions, and in recommendation systems for customer segmentation.Finding a dataset to play withLet’s use a labor market analysis as an example for this test. With all of the shifting and changing jobs and roles and responsibilities, which industry should you work in? The ones that either already have high demand, or those that are growing.A labor market analysis is the standard approach to determine what are the job market trends. This will give us an indication of job demands over time. Then, we will group each job into a specific category. These categories will be a proxy for each major industry that we might be interested in.In order to determine the sector categories based on the dataset only, we will use the UMAP+HDBSCAN approach.Our action planWe’ll collect a sample dataset for job postings containing the job title, job description, and posting year. From there, we will group the postings together based on the similarity of words to get a sense of the ebbs and flows of each group, based on the number of job postings in that particular cluster, per year.Instead of starting off with an assumption about the clusters that should be selected, we will let some clustering algorithms make those decisions for us. This is always the tricky part when performing an LMA. Does a Social Media Director all under management, marketing, or IT? What about a medical robotics service technician, or an autonomous vehicle food delivery supervisor?By looking at the content of the titles and the descriptions, we will let the categories dictate themselves.The dataKaggle’s Armenian job postings dataset already has a sanitized view of job postings between 2004 and mid-2015. This is a nice dataset, because:It’s fairly representative of most modern economies;It’s a small enough dataset to easily play with;It contains a wide range of jobs, not just specific industries.It was also affected by the 2008–2009 recession, just like everybody else:Number of job postings in the dataset, per year.GDP of Armenia vs. Georgia and Albania, showing the Great Recession.This means that we are ready to organize this data to see how each job sector has been performing.Performing word embedding on each of the job titles, descriptions, and job postsLet’s load all of the relevant libraries for this project.Before we decide to generate a vector for each text element, we’ll remove all of the generic English stop words (“if”, “I”, “because”…), and other generic words for each category. We do this by looking at the most common words for each entry type, generating a custom dictionary, and removing the selected words.After we’ve removed the first list of stop words and print the list of words by count, we notice that there is a lot of repeating words across all job postings. This outputs a clear list of “boring” words:[('responsible', 4711), ('looking', 4290), ('incumbent', 3840), ('position', 3717), ('development', 3538), ('seeking', 3015)…This would’ve been a really good opportunity to use TF-IDF for removing words based on how often they are present, but this dataset was a bit too small for it to be worthwhile, and the edits can be performed manually. (Whatever gets the job done faster is what we’re fans of.)Let’s remove a select few so that the following word embedding will have less common words across the dataset.Now we’re at the beginning of the fun stuff — word embeddings. Although not recommended for long statements or documents (the Yoon Kim CNN works best by having a per-word vector when classifying statements, for example), we know that the removal of low-signal/generic words will give us some better definition in between vectors.Frankeinstein lives: UMAP + HDBSCANOn typical numerical or categorical data, K-Means makes a lot of sense for creating clusters. We can also use this approach a lot when separating simple word embeddings (1 to 4 words), but it loses signal when combining vectors of strings, where the cosine similarities across word embeddings are much more similar. A typical approach for sorting clusters with K-Means involves the following:We will choose a varying number of clusters (usually a slider from 1 to 20).Using the elbow method, we will find the optimal number of clusters.To confirm that the elbow method assist in deciding the number of clusters, we will use the silhouette value so see how well each point fits in its target cluster.However, there are some potential issues with this approach. This is why we want HDBSCAN to take care of these points rather than K-Means:Not all points can or should be clustered.Maybe clusters are not round or spherical. (Cigar-shaped, half-moon, etc.)Why isn’t K-Means ideal for large word embedding clustering? Because it tries to find the center of a spherical cluster. If there is an oblong cluster, the extremities will be cut off and the center will be exposed to outliers.Instead of tweaking the K-Means parameters until the cows come home, let’s merge UMAP and HDBSCAN. This involves:Redefining the space in which the vectors exist by defining a new embedding space on a Riemannian manifold using UMAP;Using HDBSCAN to cluster close but not necessarily spherical clusters together, while ignoring outliers.What is UMAP?The Uniform Manifold Approximation and Projection (UMAP) is a new technique in dimensionality reduction.In PCA, dimensionality reduction is performed by creating vectors that can adequately recreate the location of every point. If my data exists in ℝ³⁰⁰ (hello word embeddings), I can define 10 (or 2, or 20) eigenvectors that can reproduce as many as the points as possible. There’s always a loss of information, but PCA can usually attain 70%-80% representative accuracy.In UMAP, the approach is to create a Riemannian Manifold — an arbitrary shape that is curved, smooth, and connects back to itself. I now have a shape that I untwist and flatten on which all of my points are fairly close to the surface, especially if I have a long enough strand. Think of a ball of twine where both ends meet — this would be a Riemannian manifold in ℝ³.All I need to do at this point is unravel the string, flatten out the cylinder, and scale the data. Boom! You now have a 2D projection.Another way to think about it is when you put the Earth on a map. Mercator and polar projections will have different skews, and those skews can be ignored (or not) depending on the data that we are looking at.Its GitHub page explains it best:But why use a Riemannian manifold? Fairly simple.3D space is defined by rules (gravity, time, distance, etc.). If I say, “Let’s imagine a world where distance and angles are not an issue”, you say “Great! Now a cube is a sphere is a prism, because distances and angles don’t matter.”Now that we are living in ElastiSpace™, we can solve certain problems much more easily. Distance still doesn’t matter, but you quickly realize sharp edges on cubes and prisms mess up your equations. So, I create ElastiSpaceV2™, where spheres and ellipsoids are the same thing, but now cubes have their own category.But what about all of the math you learned in high school? Shouldn’t trigonometry still work after all the time you spent practicing those exercises? ElastiSpaceV2™can’t help you, but Mr. Riemann’s got chu, fam. This is where spheres and ellipses are different and all matters of fancy math still work, so you are able to perform all of the math you want.(This section was paraphrased from this Quora post.)Now back to job postings.Reduce and cluster the vectorsHere’s the code to reduce the dimensionality and start clustering the assembled vectors.There’s two things to notice here:The number of clusters was determined by HDBSCAN, not by us. If you re-run it, you will get some volatility, so save the clusters once you have something that works.There are some unclustered values (indicated by “-1”), about one-third of them. This means that the clusters are only performed on the remaining values.[21 -1 -1 -1 4 -1 -1 25 25 -1 7 12 19 -1 25 -1 25 4 25 8 25 8 -1 -1 -1 -1 -1 -1 -1 -1 -1 5 -1 9 -1 -1 9 1 25 -1 20 -1 -1 -1 -1 -1 -1 -1 12 -1 -1 24 -1 2 25 -1 1 -1 7 12 -1 -1 14 -1 11 25 -1 -1 21 -1 -1 -1 -1 2 9 -1 11 12 9 12 25 11 7 -1 12 25 25 7 -1 -1 -1 -1 -1 -1 25 7 -1 21 -1 -1 19 -1 14 -1 -1 25 25 -1 25 8 25 25 -1 -1 -1 -1 -1 -1 25 12 -1 -1 25 8 -1 25 -1 -1 -1 -1 7 -1 25 -1 22 -1 25 16 9 -1 12 9 -1 7 12 -1 -1 8 -1 -1 8 24 24 7 25 -1 -1 7 7 7 25 7 1 -1 24 12 -1 -1 25 -1 7 -1 -1 8 13 -1 8 8 1 -1 -1 13 13 13 13 16 16 18 0 -1 16 25 -1 -1 -1 4 -1 -1 -1 15 25 -1 -1 6 7 -1 8 -1 25 -1 -1 -1 4 -1 -1 -1 -1 18 25 25 -1 25 20 25 7 7 7 25 -1 -1 25 25 -1 -1 -1 -1 27 -1 11 25 -1 25 25 18 -1 8 25 25 25 2 -1 -1 25 0 -1 -1 25 -1 12 25 -1 -1 -1 -1 7 -1 -1 -1 7 -1 27 25 -1 27 -1 -1 16 -1 -1 -1 -1 -1 -1 25 -1 -1 25 -1 25 14 -1 1 -1 -1 8 -1 1 25 25 7 7 -1 7 25 -1 25 -1 -1 -1 7 -1 7 -1 25 -1 25 12 -1 25 4 7 7 7 7 6 -1 7 -1 6 -1 11 16 19 -1 -1 -1 25 24 -1 16 -1 -1 -1 -1 -1 -1 -1 -1 -1 2 -1 4 7 25 -1 25 4 7 -1 25 25 1 -1 25 -1 -1 4 -1 7 -1 9 25 -1 -1 7 8 8 25 14 -1 -1 4 25 25 -1 8 25 19 -1 -1 25 15 16 -1 -1 15 21 -1 -1 25 -1 7 7 -1 -1 7 7 8 -1 18 2 16 4 -1 -1 9 9 -1 -1 -1 -1 -1 25 -1 -1 12 -1 7 20 7 -1 -1 25 -1 25 25 -1 25 6 25 7 17 -1 7 16 17 7 7 -1 4 15 -1 7 16 25 -1 25 25 -1 -1 -1 1 25 -1 25 -1 16 25 25 24 -1 12 -1 -1 -1 6 -1 25 1 11 -1 7 6 25 25 -1 25 25 -1 -1 25 1 -1 7 1 -1 -1 4 12 14]So what do these clusters represent? Let’s count the top words for each cluster to determine what they contain.Look at how well it broke down!Cluster: 0[('lawyer', 125), ('advocate', 3), ('attorney', 1)]—Cluster: 1[('accountant', 259), ('auditor', 22), ('outsourcing', 4), ('cheef', 1), ('bookkeeper', 1)]—Cluster: 2[('software', 130), ('developer', 125), ('developers', 4), ('programmer', 4), ('programmers', 1)]—Cluster: 3[('developer', 166), ('android', 70), ('ios', 64), ('senior', 48), ('application', 32)]—Cluster: 4[('accountant', 233), ('chief', 225), ('assistant', 29), ('deputy', 26), ('engineer', 11)]—…—Cluster: 25[('specialist', 227), ('expert', 226), ('development', 217), ('coordinator', 157), ('program', 147)]—Cluster: 26[('service', 78), ('manager', 55), ('customer', 49), ('department', 17), ('corporate', 15)]—We can even get a more compact view:Cluster 0: lawyer / advocate / attorneyCluster 1: accountant / auditor / outsourcing / cheef Cluster 2: software / developer / developers / programmer Cluster 3: developer / android / ios / senior Cluster 4: accountant / chief / assistant / deputy Cluster 5: medical / representative / yerevan / armenia Cluster 6: developer / java / php / senior Cluster 7: developer / senior / software / web Cluster 8: preseller / merchandiser / methodologist / 3d Cluster 9: language / translator / english / interpreter Cluster 10: hr / manager / assistant / generalist Cluster 11: manager / project / construction / senior Cluster 12: administrative / assistant / office / manager Cluster 13: quality / assurance / engineer / tester Cluster 14: assistant / executive / director / ceoCluster 15: administrator / system / network / systems Cluster 16: engineer / software / senior / support Cluster 17: qa / engineer / senior / manager Cluster 18: analyst / financial / business / senior Cluster 19: marketing / specialist / pr / digital Cluster 20: legal / lawyer / senior / consultantCluster 21: finance / officer / chief / financial Cluster 22: manager / branch / operator / teller Cluster 23: manager / sales / assistant / retail Cluster 24: agent / sales / manager / international Cluster 25: specialist / expert / development / coordinator Cluster 26: service / manager / customer / departmentAnd here’s a pretty graph, split by cluster.We now have a pretty graph showing us the demand for each job sector, over time. Yay!Overall, this approach is almost as quick to implement as the standard PCA/K-Means. However, with intermediate visualization steps, I find it much easier to tell if we are on the right track or not in terms of clustering.You can find all of the code here: https://github.com/elmathioso/lma_jobsHappy clustering!-Matt.matt@lemay.aihttps://www.linkedin.com/in/mnlemay/Lemay.ai1 (855) LEMAY-AIOther articles you may enjoyRorschach Tests for Deep Learning Image ClassifiersHubris, laziness and playing with toys: the winning attitudes of a startup engineerAI Prototypes for Clients as a Marketing StrategyOther articles you may enjoy from Daniel Shapiro, my CTO:Artificial Intelligence and Bad DataArtificial Intelligence: HyperparametersArtificial Intelligence: Get your users to label your data

Leave a Reply