Visualizing Principal Component Analysis with Matrix Transformations

Visualizing Principal Component Analysis with Matrix TransformationsA guide to understanding eigenvalues, eigenvectors, and principal componentsAndrew KrugerBlockedUnblockFollowFollowingJan 20Principal Component Analysis (PCA) is a method of decomposing data into correlated components by identifying eigenvalues and eigenvectors.

The following is meant to help visualize what these different values represent and how they’re calculated.

First I’ll show how matrices can be used to transform data, then how those matrices are used in PCA.

Matrix TransformationsFor each of the following, I will apply matrix transformations to this circle and grid:Let’s use this as our “data” image to help visualize what happens with each transformation.

Points on the image can be described by [x,y] coordinates with the origin being at the center of the circle, and we can transform those points by using a 2D transformation matrix.

For each example, I’ll show the transformed data image in blue, with the original data image in green.

Scaling MatrixA scaling matrix is a diagonal matrix with all non-diagonal elements being zero.

If a diagonal element is less than one, it makes the data image smaller in that direction.

If it’s larger than one, it makes the data image larger in that direction.

For example, if we set vₓ=1.

2 and vᵧ=0.

6, it will get wider (vₓ>1) and shorter (vᵧ<1).

Rotation MatrixA rotation matrix will rotate the data around the origin by an angle θ without changing its shape, and follows:Here I rotate the image by a positive 20°​​.

Note that it rotates counter-clockwise.

Shear MatrixA shear matrix will basically tilt an axis by having non-diagonal elements that are not zero.

The larger the λ, the greater the shear.

Here the x-values are shifted to create the shear.

Here the y-values are shifted.

Symmetric MatrixThe symmetric matrix will essentially rotate the x- and y-axes in opposite directions.

Being a symmetric matrix only requires that each non-diagonal element i,j is the same as the element j,i.

The sign of the non-diagonal elements determines the direction of the skew.

Also, the diagonal elements are independent of each other, and the matrix is still symmetric regardless of their values.

Matrix DecompositionA property of symmetric matrices is that they can be broken into three matrices with the relationship:where Q is an orthogonal matrix (Q=−Qᵀ​​ ) and D is a diagonal matrix.

Notice that the rotation matrix is orthogonal (R(θ)=−R(θ)​ᵀ =−R(−θ)) and the scaling matrix is diagonal.

This means our symmetric matrix can actually be replaced by a combination of rotation and scaling matrices:So a symmetric transformation of the data image is the same thing as rotating, scaling along the x- and y-axes, then rotating back.

Here I’ll use three transformations (rotate, scale, de-rotate) to make the same final transformation as the symmetric example above.

Another way to think of it is the symmetric matrix A is the same as the scaling matrix Sᵥ but it’s just being scaled at an angle θ relative to the x- and y-axes.

The only reason A isn’t a diagonal matrix is it’s a measure of how it scales relative to the x- and y-axes.

If we create a new set of axes that are rotated an angle θ (as shown below), and make a scaling matrix that’s measured relative to those axes instead, it would be a diagonal matrix.

These axes that A is scaling along are the principal component axes.

In the diagonal scaling matrix that’s equivalent to A, the diagonal elements are the amount the data extends along the principal component axes.

They describe the shape of the data, telling us if the data is longer or shorter in those different directions.

Those diagonal elements are the eigenvalues.

The rotation matrices contain a set of vectors that give the rotations of the principal component axes.

Those vectors are the eigenvectors.

A single eigenvalue and its corresponding eigenvector give the extent and direction of a principal component.

Example with DataNow let’s find the principal components of a set of random data points.

Let’s make some data so that’s centered at the origin (which is necessary for the matrix transformations to be correct), and is tilted 30°​​​​ from the x-axis (so, the slope is Δy/Δx=tan30°).

A covariance matrix shows the covariance of two vector elements in a dataset.

If the two vector elements vary together, they will have a higher covariance.

If a change in one element is completely independent of another, their covariance goes to zero.

The slope in the data means the x- and y-values are not independent, so it will have non-zero off-diagonal values.

Let’s look at the covariance matrix for the data.

Notice this matrix is symmetric.

This is because the covariance of the i and j elements is the same as the j and i elements (they’re the covariance of the same two elements).

The covariance matrix C can thus be decomposed:V is a matrix where each column is a different eigenvector, and D is the diagonal matrix of eigenvalues.

Since we know the angle that we rotated the data, let’s calculate the values we would expect to get for the eigenvectors (columns of V):Next, let’s calculate the eigenvalues.

The eigenvalues will be the variance of the data along the principal component axes.

We can measure these values by de-rotating the data with the eigenvectors then finding the variance in the x- and y-directions.

Here the data is de-rotate:The variance in the x- and y-directions:The variance is greatest in the x-direction, so most information in the data is in that component.

This is the first principal component.

Further principal components are ordered based on their variance from greatest to least.

Now that we know what to expect, let’s use scikit-learn’s PCA module and compare the results to ours.

First, let’s print out the principal components.

(The module returns the eigenvectors in rows, so I’ll print out the transpose to put them in columns like above.

)Notice this is similar to the rotation matrix we calculated.

The columns are the eigenvectors in order of importance, showing the direction of first component, second component, etc.

Next, let’s print out the explained variance:These are consistent with the variances we calculated.

Again, these components are ordered from greatest to least.

Viewing the ComponentsTo get a clearer understanding of the separate eigenvectors and how they rotate the data, let’s use them to create images to visually compare the principal components to the data.

In general, we can multiply an eigenvector by any number, and it will give us the x- and y-values for where that point would be along the principal component axis.

Let’s use three standard deviations of the noise as that number, which can be calculated using the variance by the relationshipUsing points at ±3σ would be a good indicator of the relative noise of the different principal components.

Let’s create an array with the values [-3σ,3σ], which for our first principal component would beThen we can multiply the array by the first eigenvector, which will give us the x- and y-components for these two values along the first principal component axis.

Similarly this can be done for other principal components by changing the index used.

By plotting a line between the ±3σ points for each principal component (with matplotlib by simply plt.

plot(x_comp,y_comp)), we can view the extent of their noise.

In short, the two red lines show the directions of the two principal components.

Their rotation angles were calculated by the eigenvectors, and their lengths were determined by the eigenvalues to show the 3σ noise range along the axes.

Dimensionality ReductionFor a better understanding of the importance of eigenvalues, let’s use them in an example of dimensionality reduction.

If you want to see how much of the total variance is explained by the different components, you can divide each eigenvalue by the total sum of the eigenvalues.

This means 94.

6% of the explained variance is in the first component.

However, there’s a faster way to get the ratio of explained variance:The second principal component accounts for 5.

4% of the variance information in the data, and could be mainly noise.

If we want do to dimensionality reduction, we can remove the second principal component and only keep the information along that first principal component.

First I’ll show how this is done with eigenvectors, then how it’s easily done with scikit-learn.

If we first start with the rotated data (rotated_data above) where the principal components are aligned with the x,y axes, the x-component of the points are actually the first principal component.

We can rotate the principal component back into the original direction by using our first eigenvector, the same as how we plotted the ±3σ lines above.

Here’s first_component plotted along with the rotated data.

Here’s the first_component_xy plotted along with the original data.

The data points now lie along the first principal component axis.

To get the first principal component with scikit-learn, simply set the number of components to 1 and transform the data.

The arrays first_component and first_component_xy will be the same as shown above.

SummaryPrincipal Component Analysis is easier to understand when comparing to basic matrix transformations.

We can consider the covariance matrix as being decomposed into rotation and scaling matrices.

Rotation Matrices: These matrices rotate data without altering its shape.

Similarly, eigenvectors are used to “rotate” the data into a new coordinate system so the correlated features are aligned with the new axes (the principal component axes).

Scaling Matrices: These diagonal matrices scale the data along the different coordinate axes.

Similarly, a diagonal matrix of eigenvalues gives a measure of the data variance (their scale) along the different principal component axes.

Finally, dimensionality reduction is the same as first rotating the data with the eigenvalues to be aligned with the principal components, then using only the components with the greatest eigenvalues.