The number of input variables or features for a dataset is referred to as its dimensionality.
Dimensionality reduction refers to techniques that reduce the number of input variables in a dataset.
More input features often make a predictive modeling task more challenging to model, more generally referred to as the curse of dimensionality.
Although on high-dimensionality statistics, dimensionality reduction techniques are often used for data visualization, these techniques can be used in applied machine learning to simplify a classification or regression dataset in order to better fit a predictive model.
In this post, you will discover a gentle introduction to dimensionality reduction for machine learningAfter reading this post, you will know:Let’s get started.
A Gentle Introduction to Dimensionality Reduction for Machine LearningPhoto by Kevin Jarrett, some rights reserved.
This tutorial is divided into three parts; they are:The performance of machine learning algorithms can degrade with too many input variables.
If your data is represented using rows and columns, such as in a spreadsheet, then the input variables are the columns that are fed as input to a model to predict the target variable.
Input variables are also called features.
We can consider the columns of data representing dimensions on an n-dimensional feature space and the rows of data as points in that space.
This is a useful geometric interpretation of a dataset.
Having a large number of dimensions in the feature space can mean that the volume of that space is very large, and in turn, the points that we have in that space (rows of data) often represent a small and non-representative sample.
This can dramatically impact the performance of machine learning algorithms fit on data with many input features, generally referred to as the “curse of dimensionality.
”Therefore, it is often desirable to reduce the number of input features.
This reduces the number of dimensions of the feature space, hence the name “dimensionality reduction.
”Dimensionality reduction refers to techniques for reducing the number of input variables in training data.
When dealing with high dimensional data, it is often useful to reduce the dimensionality by projecting the data to a lower dimensional subspace which captures the “essence” of the data.
This is called dimensionality reduction.
— Page 11, Machine Learning: A Probabilistic Perspective, 2012.
High-dimensionality might mean hundreds, thousands, or even millions of input variables.
Fewer input dimensions often mean correspondingly fewer parameters or a simpler structure in the machine learning model, referred to as degrees of freedom.
A model with too many degrees of freedom is likely to overfit the training dataset and therefore may not perform well on new data.
It is desirable to have simple models that generalize well, and in turn, input data with few input variables.
This is particularly true for linear models where the number of inputs and the degrees of freedom of the model are often closely related.
The fundamental reason for the curse of dimensionality is that high-dimensional functions have the potential to be much more complicated than low-dimensional ones, and that those complications are harder to discern.
The only way to beat the curse is to incorporate knowledge about the data that is correct.
— Page 15, Pattern Classification, 2000.
Dimensionality reduction is a data preparation technique performed on data prior to modeling.
It might be performed after data cleaning and data scaling and before training a predictive model.
… dimensionality reduction yields a more compact, more easily interpretable representation of the target concept, focusing the user’s attention on the most relevant variables.
— Page 289, Data Mining: Practical Machine Learning Tools and Techniques, 4th edition, 2016.
As such, any dimensionality reduction performed on training data must also be performed on new data, such as a test dataset, validation dataset, and data when making a prediction with the final model.
There are many techniques that can be used for dimensionality reduction.
In this section, we will review the main techniques.
Perhaps the most common are so-called feature selection techniques that use scoring or statistical methods to select which features to keep and which features to delete.
… perform feature selection, to remove “irrelevant” features that do not help much with the classification problem.
— Page 86, Machine Learning: A Probabilistic Perspective, 2012.
Two main classes of feature selection techniques include wrapper methods and filter methods.
For more on feature selection in general, see the tutorial:Wrapper methods, as the name suggests, wrap a machine learning model, fitting and evaluating the model with different subsets of input features and selecting the subset the results in the best model performance.
RFE is an example of a wrapper feature selection method.
Filter methods use scoring methods, like correlation between the feature and the target variable, to select a subset of input features that are most predictive.
Examples include Pearson’s correlation and Chi-Squared test.
For more on filter-based feature selection methods, see the tutorial:Techniques from linear algebra can be used for dimensionality reduction.
Specifically, matrix factorization methods can be used to reduce a dataset matrix into its constituent parts.
Examples include the eigendecomposition and singular value decomposition.
For more on matrix factorization, see the tutorial:The parts can then be ranked and a subset of those parts can be selected that best captures the salient structure of the matrix that can be used to represent the dataset.
The most common method for ranking the components is principal components analysis, or PCA for short.
The most common approach to dimensionality reduction is called principal components analysis or PCA.
— Page 11, Machine Learning: A Probabilistic Perspective, 2012.
For more on PCA, see the tutorial:Techniques from high-dimensionality statistics can also be used for dimensionality reduction.
In mathematics, a projection is a kind of function or mapping that transforms data in some way.
— Page 304, Data Mining: Practical Machine Learning Tools and Techniques, 4th edition, 2016.
These techniques are sometimes referred to as “manifold learning” and are used to create a low-dimensional projection of high-dimensional data, often for the purposes of data visualization.
The projection is designed to both create a low-dimensional representation of the dataset whilst best preserving the salient structure or relationships in the data.
Examples of manifold learning techniques include:The features in the projection often have little relationship with the original columns, e.
they do not have column names, which can be confusing to beginners.
Deep learning neural networks can be constructed to perform dimensionality reduction.
A popular approach is called autoencoders.
This involves framing a self-supervised learning problem where a model must reproduce the input correctly.
For more on self-supervised learning, see the tutorial:A network model is used that seeks to compress the data flow to a bottleneck layer with far fewer dimensions than the original input data.
The part of the model prior to and including the bottleneck is referred to as the encoder, and the part of the model that reads the bottleneck output and reconstructs the input is called the decoder.
An auto-encoder is a kind of unsupervised neural network that is used for dimensionality reduction and feature discovery.
More precisely, an auto-encoder is a feedforward neural network that is trained to predict the input itself.
— Page 1000, Machine Learning: A Probabilistic Perspective, 2012.
After training, the decoder is discarded and the output from the bottleneck is used directly as the reduced dimensionality of the input.
Inputs transformed by this encoder can then be fed into another model, not necessarily a neural network model.
Deep autoencoders are an effective framework for nonlinear dimensionality reduction.
Once such a network has been built, the top-most layer of the encoder, the code layer hc, can be input to a supervised classification procedure.
— Page 448, Data Mining: Practical Machine Learning Tools and Techniques, 4th edition, 2016.
The output of the encoder is a type of projection, and like other projection methods, there is no direct relationship to the bottleneck output back to the original input variables, making them challenging to interpret.
For an example of an autoencoder, see the tutorial:There is no best technique for dimensionality reduction and no mapping of techniques to problems.
Instead, the best approach is to use systematic controlled experiments to discover what dimensionality reduction techniques, when paired with your model of choice, result in the best performance on your dataset.
Typically, linear algebra and manifold learning methods assume that all input features have the same scale or distribution.
This suggests that it is good practice to either normalize or standardize data prior to using these methods if the input variables have differing scales or units.
This section provides more resources on the topic if you are looking to go deeper.
In this post, you discovered a gentle introduction to dimensionality reduction for machine learning.
Specifically, you learned:Do you have any questions? Ask your questions in the comments below and I will do my best to answer.