# Principal Component Analysis Deciphered

Principal Component Analysis DecipheredHandling the curse of dimensionalityVaishnavi MalhotraBlockedUnblockFollowFollowingMar 14Authors: Vaishnavi Malhotra and Neda Zolaktafsource: https://www.

freepik.

com/free-photos-vectors/backgroundIn machine learning, we often have to deal with high-dimensional data.

But not all features that we use in our model may in fact not be related to the response variable.

Adding many features in the hope that our model would learn better and give accurate results often results in a problem which we generally refer to as ‘the Curse of Dimensionality’, which states:As the number of features or dimensions grows, the amount of data we need to generalize accurately, grows exponentially.

To overcome this problem we need to identify the most important features in our dataset.

One such method to identify the principal features from the dataset, thereby reducing the dimensionality of the dataset, is Principal Component Analysis (PCA).

source: https://www.

com/watch?v=f1fXCRtSUWUIn the video above, consider how a larger picture of the dog is repeatedly shredded and re-attached to form four smaller pictures of the same dog.

Intuitively, selecting the right features would result in a lower-dimensional form without losing much information.

PCA emphasizes this variation and brings out the dominant patterns in a dataset.

What exactly is PCA? ????PCA takes in a large set of variables, uses the dependencies between these variables to represent it in a more manageable, lower-dimensional form, without losing too much information.

PCA serves as a good tool for data exploration and is often done as part of exploratory data analysis (EDA).

Suppose we have n observations and d variables in our dataset and we wish to study the relationship between different variables as part of EDA.

For a larger value of d, let’s say 60, we get d(d-1)/2 two-dimensional scatter plots.

Such a huge number of plots (1770, in this case) makes it certainly difficult to identify the relationship between features.

Further, these 2D plots contain only a fraction of the total information present in the dataset.

This is when PCA comes into the picture.

PCA is a technique for feature extraction — so it combines the input variables in a specific way, then gets rid of the “least important” variables while still retaining the most valuable parts (or principal components) of all of the variables!Principal Components you say?A principal component is a normalized linear combination of the original features in the dataset.

Suppose we start with d-dimensional vectors and want to summarize them by projecting down into a k-dimensional subspace such that the axes of the new subspace point into the directions of the highest variance of the data.

Our final result would be the projection of the original vectors on to k directions, termed as Principal Components(PC).

Fig.

1: Plot between Ad Spending (in 1000s) and Population (in 10,000s) taken from a subset of the advertising data (ISLR) for 100 cities.

The blue dot denotes the mean (μ).

As evident from the plot (Fig.

1), the first principal component (the green solid line) direction has the maximum data variance, and it also defines the line that is closest to all n of the observations.

The first principal component captures most of the information contained in the features such that larger the variability captured by the first PC, the larger information captured by component.

Fig.

2: First and second principal components in a subset of the advertising data (ISLR).

The direction of the second principal component is given by the blue dotted line (Fig.

2).

It is also a linear combination of the original features which captures the remaining variance in the dataset such that the correlation between first and second principal component is zero, and thus their directions are orthogonal or perpendicular to each other.

Similarly, for d features in our dataset, we can construct up to d distinct principal components.

But how many principal components do we need?Choosing the right number of principal components is essential to ensure that PCA is effective.

A dataset containing n observations and d features accounts for min(n − 1, d) distinct principal components.

But we are only interested in the first few important components that are enough to explain a good amount of variation in the dataset.

One way to determine this is to look at the cumulative explained variance ratio which is a function of the number of components.

A scree plot depicts this ratio explained by each of the principal components.

The elbows of the plot signify the optimal number of principal components.

Fig.

3: Cumulative explained variance ratio after PCA on LFW face recognition dataset.

The curve shown in Fig.

3 quantifies how much of the total, the 200-dimensional variance is contained within the first n components.

For example, we see that with the faces the first 40 components contain more than 80% of the variance, while we need around 150 components to describe close to 100% of the variance.

Where would you use PCA?PCA has been widely used in many domains, such as computer vision and image compression.

It is mainly used for the following applications:Data visualization: PCA allows you to visualize high dimensional objects into a lower dimension.

Partial least squares: PCA features can be used as the basis for a linear model in partial least squares.

Dimensionality reduction: Reduces features dimensionality, losing only a small amount of information.

Outlier detection (improving data quality): Projects a set of variables in fewer dimensions and highlights extraneous values.

How is PCA formulated though?Given a matrix X, which corresponds to n observations with d features, and an input k, the main objective of PCA is to decompose matrix X into two smaller matrices, Z and W, such that X= ZW, where Z has dimensions n*k and W has dimensions k*d (see Fig.

4).

Each row of W is called a principal component.

Fig.

4: PCA decomposes matrix X into two smaller matrix Z and W.

In PCA, we minimize the squared error of the following objective function:There are three common approaches to solve PCA, which we describe below.

Singular Value Decomposition (SVD)This approach first uses the Singular Value Decomposition (SVD) algorithm to find an orthogonal W.

Then it uses the orthogonal Wto compute Z as follows.

2.

Alternating Minimization This is an iterative approach that alternates between:Fixing Z, and finding optimal values for WFixing W, and finding optimal values for Z3.

Stochastic Gradient DescentThis is an iterative approach, for when the matrix X is very big.

On each iteration, it picks a random example i and features j and updates W and Z asPCA in action: Feature ReductionWe already know that, by definition, PCA eliminates the less important features and helps produce visual representations of those features.

Let’s see how this really applies to a feature reduction problem in practice.

For this example, we will use Iris dataset.

The data contains four attributes: Sepal length, Sepal width, Petal length, Petal width across three species namely Setosa, Versicolor, VirginicaAfter applying PCA, 95% variance is captured by 2 principal components.

PCA in action: Feature ExtractionIn an earlier example, we saw how PCA can be a useful tool for visualization and feature reduction.

In this example, we will explore PCA as a feature extraction technique.

For this, we will use the LFW facial recognition dataset.

Images contain a large amount of information, and processing all features extracted from such images often require a huge amount of computational resources.

We address this issue by identifying a combination of the most significant features that accurately describe the dataset.

datasets.

fetch_lfw_people.

The dataset consists of 1867 images each having a 62×47 resolution.

import numpy as npimport matplotlib.

pyplot as pltimport warningsfrom sklearn.

catch_warnings(): warnings.

filterwarnings("ignore",category=DeprecationWarning) faces = fetch_lfw_people(min_faces_per_person=40)# plot imagesfig, axes = plt.

subplots(3, 10, figsize=(12, 4), subplot_kw={'xticks':[], 'yticks':[]}, gridspec_kw=dict(hspace=0.

1, wspace=0.

1))for i, ax in enumerate(axes.

flat): ax.

imshow(faces.

data[i].

reshape(62, 47), cmap='bone')Applying PCA on the datasetTo produce a quick demo, we simply use scikit-learn’s PCA module to perform dimension reduction on the face dataset and select 150 components(eigenfaces) in order to maximize the variance of the dataset.

from sklearn.

decomposition import PCAfaces_pca = PCA(n_components=150, svd_solver=’randomized’).

fit(faces.

data)# Plot principal componentsfig, axes = plt.

subplots(3, 10, figsize=(12, 4), subplot_kw={'xticks':[], 'yticks':[]}, gridspec_kw=dict(hspace=0.

1, wspace=0.

1))for i, ax in enumerate(axes.

flat): ax.

imshow(faces_pca.

components_[i].

reshape(62, 47), cmap='bone')Now we will use the principal components to form a projected image of faces and compare it with the original dataset.

components = faces_pca.

transform(faces.

data)projected = faces_pca.

inverse_transform(components)# Plot the resultsfig, ax = plt.

subplots(2, 15, figsize=(15, 2.

5), subplot_kw={‘xticks’:[], ‘yticks’:[]}, gridspec_kw=dict(hspace=0.

1, wspace=0.

1))for i in range(15): ax[0, i].

imshow(faces.

data[i].

reshape(62, 47), cmap=’binary_r’) ax[1, i].

imshow(projected[i].

reshape(62, 47), cmap=’binary_r’) ax[0, 0].

set_ylabel(‘complete.resolution’)ax[1, 0].

set_ylabel(‘150-D.projections’);As we can see the principal features extracted using PCA capture most of the variance in the dataset and thus, the projections formed by these 150 principal components are quite close to images in the original dataset.

Things to RememberHere are some important points one should remember while doing PCA:Before doing PCA, data should first be normalized.

This is important as different variables in the dataset may be measured in different units.

PCA on an un-normalized dataset results in higher eigenvalues for the variable having maximum variance corresponding to the eigenvector of its first PC.

PCA can be applied only on numerical data.

Thus, if the data has categorical variables too they must be converted to numerical values.

Such variables can be represented using a 1-of-N coding scheme without imposing an artificial ordering.

However, a PCA is NOT to be conducted when most of the independent features are categorical.

CATPCA can instead be used to convert categories into numeric values through optimal scaling.

What did we learn?So we started with the curse of dimensionality and discussed how principal component analysis is effective in dimensionality reduction, data visualization in EDA and feature extraction.

If implemented properly, it can be effective in a wide variety of disciplines.

But PCA also has limitations that must be considered like patterns that are highly correlated may be unresolved because all principal components are uncorrelated, the structure of the data must be linear, and PCA tends to be influenced by outliers in the data.

Other variants of PCA can be explored to tackle these limitations, but let’s leave it out for a later time.

????Reads and ReferencesMost part of this post is based on the book An Introduction to Statistical learning.

The book also includes code snippets for implementing PCA in R.

Here’s a very good blog A One-Stop Shop for Principal Component Analysis by Matt Bremshttps://www.

cs.

ubc.

ca/~schmidtm/Courses/340-F15/https://jakevdp.

github.

io/PythonDataScienceHandbook/05.

09-principal-component-analysis.

htmlhttp://setosa.

io/ev/principal-component-analysis/https://en.

wikipedia.

org/wiki/Principal_component_analysishttps://stats.

stackexchange.

com/questions/2691/making-sense-of-principal-component-analysis-eigenvectors-eigenvaluesCode reference: https://scikit-learn.

org/stable/auto_examples/index.

htmlFig 1 and Fig 2 source: http://www-bcf.

usc.

edu/~gareth/ISL/ISLR%20Sixth%20Printing.

pdf.. More details