Market Segmentation with R (PCA & K-means Clustering) — Part 1

A human brain simply can’t operate with that much information in a short period of time.

At least my brain can’t for sure.

Photo by ME.

ME on FacebookThis is where PCA can step in and do the task for you.

Performing PCA on our data, R can transform the correlated 24 variables into a smaller number of uncorrelated variables called the principal components.

With the smaller, compressed set of variables, we can perform further computation with ease, and we can investigate some hidden patterns within the data that was hard to discover at first.

When there are abundant literature/videos/articles out there that provide thorough explanations of PCA, I hope to present a few high-level points about PCA for people who find materials out there too technical:Variability makes data useful.

Imagine a dataset with 10,000 uniform values.

It does not tell you much, and it’s boring.

????Again, PCA’s function is to create a smaller subset of variables (principal components) that capture the variability within the original, much larger dataset.

Each principal component is a linear combination of the initial variables.

Each principal component has an orthogonal relationship with each other.

That means they are uncorrelated.

The first principal component (PC1) captures most variability within the data.

The second principal component (PC2) captures the second most.

The third principal components (PC3) captures the third most…and so onIn addition, here are a couple of terms you should know if you are planning to run PCA for your project:Loading describes the relationship between the original variables and the new principal component.

Specifically, it describes the weight given to an original variable when calculating a new principal component.

Score describes the relationship between the original data and the newly generated axis.

In other words, score is the new value for a data row in the principal component space.

Proportion of Variance indicates the share of the total data variability each principal component accounts for.

It is often used with Cumulative Proportion to evaluate the usefulness of a principal component.

Cumulative Proportion represents the cumulative proportion of variance explained by consecutive principal components.

The cumulative proportion explained by all principal components equals 1 (100% of data variability are explained).

Running PCA in RBefore you run a PCA, you should take a look at your data correlation.

If your data is not highly correlated, you might not need a PCA at all!# Creating a correlation plot library(ggpcorrplot)cormat <- round(cor(raw), 2)ggcorrplot(cormat, hc.

order = TRUE, type = “lower”, outline.

color = “white”)Correlation PlotAs the graph shows, our variables are quite correlated.

We can proceed to PCA happily ✌.

️# PCApr_out <-prcomp(raw, center = TRUE, scale = TRUE) #Scaling data before PCA is usually advisable!.summary(pr_out)PCA SummaryThere are 24 new principal components because we had 24 variables in the first place.

The first principal component accounts for 28% of the data variance.

The second principal component accounts for 8.

8%.

The third accounts for 7.

6%…We can use a scree plot to visualize this:# Screeplotpr_var <- pr_out$sdev ^ 2pve <- pr_var / sum(pr_var)plot(pve, xlab = "Principal Component", ylab = "Proportion of Variance Explained", ylim = c(0,1), type = 'b')Scree plotX-axis describes the number of principal component(s), and y-axis describes the proportion of variance explained (PVE) by each.

The variance explained drastically decreases after PC2.

This spot is often called an elbow point, indicating the number of PCs that should be used for the analysis.

# Cumulative PVE plotplot(cumsum(pve), xlab = "Principal Component", ylab = "Cumulative Proportion of Variance Explained", ylim =c(0,1), type = 'b')Cumulative Proportion of VarianceIf we choose only 2 principal components, they will yield less than 40% of the total variance in data.

This number is perhaps not enough.

Another rule of choosing the number of PCs is to choose PCs with eigenvalues higher than 1.

This is called the Kaiser rule, and it is controversial.

You can find many debates on this topic online.

Basically, there isn’t a single best way to decide the best number of PCs.

People use PCA for different purposes, and it is always important to think about what you want to get out of your PCA analysis before making the decision.

In our case, since we are using PCA to determine meaningful and actionable market segmentation, one criterion we should definitely consider is whether the PCs we decide on make sense in the real-world and business settings.

Interpreting ResultsLet’s pick the first 5 PCs for now, since 5 components are not too hard to work with, and it follows the Kaiser rule.

Next, we want to make meanings out of these PCs.

Remember that loadings describe the weights given to each raw variable in calculating the new principal component?.They are key to help us interpret the PCA results.

When directly working with the PCA loadings can be tricky and confusing, we can rotate these loadings to make interpretation easier.

There are multiple rotation methods out there, and we will use a method called “varimax”.

(Note, this step of rotation is NOT a part of the PCA.

It simply helps to interpret our results.

Here is a good thread on the topic.

)# Rotate loadingsrot_loading <- varimax(pr_out$rotation[, 1:5])rot_loadingVarimax-rotated loadings up to Q12Here’s an incomplete portion of the varimax-rotated loadings up to Q12.

The numbers in the table correspond to the relationships between our questions (raw variables) and the selected components.

If the number is positive, the variable positively contributes to the component.

If it’s negative, then they are negatively related.

Larger the number, stronger the relationship.

With these loadings, we can refer back to our questionnaire to get some ideas about what each PC is about.

Let’s look at PC1, for example.

I noticed that Q10, Q3 & Q7 negatively contribute to PC1.

On the other hand, I see that Q8 & Q11 positively contribute to PC1.

Checking the questionnaire, I realized that Q10, Q3 & Q7 are questions related to the style of the charger, when Q8 & Q11 focus on the functionality of the product.

Therefore, we can make a temporary conclusion that PC1 describes people’s preference for the product’s functionality.

It makes sense that people who value functionality more might not care too much about style.

Then, you can move on to PC2 and follow the same procedure to interpret each PC.

I will not go through the complete process here, and I hope you got the idea.

Once you go through all PCs and feel like each describes unique, logically-coherent traits, and you believe they make business sense, you’re good for the next step.

However, if you feel like some information is missing or is repetitive within the PCs, you can consider going back and including more PCs, or you can eliminate some.

You might have to go through several iterations until you get a satisfying result.

We’re done!!Just kidding.

But you are halfway there.

You’ve walked through the process of compressing a large dataset to a smaller one with a few variables that can help you identify different customer groups out there using PCA.

In the next post, I will introduce how to segment our customers based on the PCs we obtained using a clustering method.

Lastly, #HappyInternationalWomensDay to all the amazing superwomen out there ????????.????.????!Thanks for reading!.If you ????.it, ????.it.

Feel free to connect with me on Linkedin!.

. More details

Leave a Reply