Principal Component Analysis (PCA) 101, using R

PCA can reduce dimensionality but it wont reduce the number of features / variables in your data.

What this means is that you might discover that you can explain 99% of variance in your 1000 feature dataset by just using 3 principal components but you still need those 1000 features to construct those 3 principal components, this also means that in the case of predicting on future data you still need those same 1000 features on your new observations to construct the corresponding principal components.

Right, right enough of that, how does it work?Since this is purely introductory I’ll skip the math and give you a quick rundown of the workings of PCA:Standardize the data (Center and scale).

Calculate the Eigenvectors and Eigenvalues from the covariance matrix or correlation matrix (One could also use Singular Vector Decomposition).

Sort the Eigenvalues in descending order and choose the K largest Eigenvectors (Where K is the desired number of dimensions of the new feature subspace k ≤ d).

Construct the projection matrix W from the selected K Eigenvectors.

Transform the original dataset X via W to obtain a K-dimensional feature subspace Y.

This might sound a bit complicated if you haven’t had a few courses in algebra, but the gist of it is to transform our data from it’s initial state X to a subspace Y with K dimensions where K is — more often than not — less than the original dimensions of X.

Thankfully this is easily done using R!PCA on our tumor dataSo now we understand a bit about how PCA works and that should be enough for now.

Lets actually try it out:wdbc.

pr <- prcomp(wdbc[c(3:32)], center = TRUE, scale = TRUE)summary(wdbc.

pr)This is pretty self-explanatory, the ‘prcomp’ function runs PCA on the data we supply it, in our case that’s ‘wdbc[c(3:32)]’ which is our data excluding the ID and diagnosis variables, then we tell R to center and scale our data (thus standardizing the data).

Finally we call for a summary:The values of the first 10 principal componentsRecall that a property of PCA is that our components are sorted from largest to smallest with regard to their standard deviation (Eigenvalues).

So let’s make sense of these:Standard deviation: This is simply the eigenvalues in our case since the data has been centered and scaled (standardized)Proportion of Variance: This is the amount of variance the component accounts for in the data, ie.

PC1 accounts for >44% of total variance in the data alone!Cumulative Proportion: This is simply the accumulated amount of explained variance, ie.

if we used the first 10 components we would be able to account for >95% of total variance in the data.

Right, so how many components do we want?.We obviously want to be able to explain as much variance as possible but to do that we would need all 30 components, at the same time we want to reduce the number of dimensions so we definitely want less than 30!Since we standardized our data and we now have the corresponding eigenvalues of each PC we can actually use these to draw a boundary for us.

Since an eigenvalues <1 would mean that the component actually explains less than a single explanatory variable we would like to discard those.

If our data is well suited for PCA we should be able to discard these components while retaining at least 70–80% of cumulative variance.

Lets plot and see:screeplot(wdbc.

pr, type = "l", npcs = 15, main = "Screeplot of the first 10 PCs")abline(h = 1, col="red", lty=5)legend("topright", legend=c("Eigenvalue = 1"), col=c("red"), lty=5, cex=0.

6)cumpro <- cumsum(wdbc.

pr\$sdev^2 / sum(wdbc.

pr\$sdev^2))plot(cumpro[0:15], xlab = "PC #", ylab = "Amount of explained variance", main = "Cumulative variance plot")abline(v = 6, col="blue", lty=5)abline(h = 0.

88759, col="blue", lty=5)legend("topleft", legend=c("Cut-off @ PC6"), col=c("blue"), lty=5, cex=0.

6)Screeplot of the Eigenvalues of the first 15 PCs (left) & Cumulative variance plot (right)We notice is that the first 6 components has an Eigenvalue >1 and explains almost 90% of variance, this is great!.We can effectively reduce dimensionality from 30 to 6 while only “loosing” about 10% of variance!We also notice that we can actually explain more than 60% of variance with just the first two components.

Let’s try plotting these:plot(wdbc.

pr\$x[,1],wdbc.

pr\$x[,2], xlab="PC1 (44.

3%)", ylab = "PC2 (19%)", main = "PC1 / PC2 – plot")Alright, this isn’t really too telling but consider for a moment that this is representing 60%+ of variance in a 30 dimensional dataset.

But what do we see from this?.There’s some clustering going on in the upper/middle-right.

Lets also consider for a moment what the goal of this analysis actually is.

We want to explain difference between malignant and benign tumors.

Let’s actually add the response variable (diagnosis) to the plot and see if we can make better sense of it:library("factoextra")fviz_pca_ind(wdbc.

pr, geom.

ind = "point", pointshape = 21, pointsize = 2, fill.

ind = wdbc\$diagnosis, col.

ind = "black", palette = "jco", addEllipses = TRUE, label = "var", col.

var = "black", repel = TRUE, legend.

title = "Diagnosis") + ggtitle("2D PCA-plot from 30 feature dataset") + theme(plot.

title = element_text(hjust = 0.

5))This is essentially the exact same plot with some fancy ellipses and colors corresponding to the diagnosis of the subject and now we see the beauty of PCA.

With just the first two components we can clearly see some separation between the benign and malignant tumors.

This is a clear indication that the data is well-suited for some kind of classification model (like discriminant analysis).

What’s next?Our next immediate goal is to construct some kind of model using the first 6 principal components to predict whether a tumor is benign or malignant and then compare it to a model using the original 30 variables.

We’ll take a look at this in the next article (which should be up tomorrow or the day after)!Thanks for reading!Additional resources:Making sense of principal component analysis, eigenvectors & eigenvaluesbegingroup\$ Imagine a big family dinner, where everybody starts asking you about PCA.

First you explain it to your…stats.

stackexchange.

comUnderstanding PCA (Principal Component Analysis) with PythonGetting stuck in the sea of variables to analyze your data ?.Feeling lost in deciding which features to choose so that…towardsdatascience.

comA One-Stop Shop for Principal Component AnalysisAt the beginning of the textbook I used for my graduate stat theory class, the authors (George Casella and Roger…towardsdatascience.

comDimensionality Reduction For Dummies — Part 1: IntuitionDimensionality Reduction with PCA and SVD.

Explained in a simple, visual, and intuitive way.

From the big picture to…towardsdatascience.

comPredicting breast cancer using PCA + LDA in R | KaggleEdit descriptionwww.

kaggle.

comHow to project a new vector onto PCA space?Thanks for contributing an answer to Cross Validated!.Please be sure to answer the question.

Provide details and share…stats.

stackexchange.

comPCA – Principal Component Analysis EssentialsStatistical tools for data analysis and visualizationwww.

sthda.

com.