Machine learning datasets are often structured or tabular data comprised of rows and columns.

The columns that are fed as input to a model are called predictors or “p” and the rows are samples “n“.

Most machine learning algorithms assume that there are many more samples than there are predictors, denoted as p << n.

Sometimes, this is not the case, and there are many more predictors than samples in the dataset, referred to as “big-p, little-n” and denoted as p >> n.

These problems often require specialized data preparation and modeling algorithms to address them correctly.

In this tutorial, you will discover the challenge of big-p, little n or p >> n machine learning problems.

After completing this tutorial, you will know:Let’s get started.

How to Handle Big-p, Little-n (p >> n) in Machine LearningPhoto by Phil Dolby, some rights reserved.

This tutorial is divided into three parts; they are:Consider a predictive modeling problem, such as classification or regression.

The dataset is structured data or tabular data, like what you might see in an Excel spreadsheet.

There are columns and rows.

Most of the columns would be used as inputs to a model and one column would represent the output or variable to be predicted.

The inputs go by different names, such as predictors, independent variables, features, or sometimes just variables.

The output variable—in this case, sales—is often called the response or dependent variable, and is typically denoted using the symbol Y.

— Page 15, An Introduction to Statistical Learning with Applications in R, 2017.

Each column represents a variable or one aspect of a sample.

The columns that represent the inputs to the model are called predictors.

Each row represents one sample with values across each of the columns or features.

It is common to describe a training dataset in machine learning in terms of the predictors and samples.

The number of predictors in a dataset is described using the term “p” and the number of samples in a dataset is described using the term “n” or sometimes “N“.

To make this concrete, let’s take a look at the iris flowers classification problem.

Below is a sample of the first five rows of this dataset.

This dataset has five columns and 150 rows.

The first four columns are inputs and the fifth column is the output, meaning that there are four predictors.

We would describe the iris flowers dataset as:It is almost always the case that the number of predictors (p) will be smaller than the number of samples (n).

Often much smaller.

We can summarize this expectation as p << n, where “<<” is a mathematical inequality that means “much less than.

”To demonstrate this, let’s look at a few more standard machine learning datasets:Most machine learning algorithms operate based on the assumption that there are many more samples than predictors.

One way to think about predictors and samples is to take a geometrical perspective.

Consider a hypercube where the number of predictors (p) defines the number of dimensions of the hypercube.

The volume of this hypercube is the scope of possible samples that could be drawn from the domain.

The number of samples (n) are the actual samples drawn from the domain that you must use to model your predictive modeling problem.

This is a rationale for the axiom “get as much data as possible” in applied machine learning.

It is a desire to gather a sufficiently representative sample of the p-dimensional problem domain.

As the number of dimensions (p) increases, the volume of the domain increases exponentially.

This, in turn, requires more samples (n) from the domain to provide effective coverage of the domain for a learning algorithm.

We don’t need full coverage of the domain, just what is likely to be observable.

This challenge of effectively sampling high-dimensional spaces is generally referred to as the curse of dimensionality.

Machine learning algorithms overcome the curse of dimensionality by making assumptions about the data and structure of the mapping function from inputs to outputs.

They add a bias.

The fundamental reason for the curse of dimensionality is that high-dimensional functions have the potential to be much more complicated than low-dimensional ones, and that those complications are harder to discern.

The only way to beat the curse is to incorporate knowledge about the data that is correct.

— Page 15, Pattern Classification, 2000.

Many machine learning algorithms that use distance measures and other local models (in feature space) often degrade in performance as the number of predictors is increased.

When the number of features p is large, there tends to be a deterioration in the performance of KNN and other local approaches that perform prediction using only observations that are near the test observation for which a prediction must be made.

This phenomenon is known as the curse of dimensionality, and it ties into the fact that non-parametric approaches often perform poorly when p is large.

— Page 168, An Introduction to Statistical Learning with Applications in R, 2017.

It is not always the case that the number of predictors is less than the number of samples.

Some predictive modeling problems have more predictors than samples by definition.

Often many more predictors than samples.

This is often described as “big-p, little-n,” “large-p, small-n,” or more compactly as “p >> n”, where the “>>” is a mathematical inequality operator that means “much more than.

”… prediction problems in which the number of features p is much larger than the number of observations N, often written p >> N.

— Page 649, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2016.

Consider this from a geometrical perspective.

Now, instead of having a domain with tens of dimensions (or fewer), the domain has many thousands of dimensions and only a few tens of samples from this space.

We cannot expect to have anything like a representative sample of the domain.

Many examples of p >> n problems come from the field of medicine, where there is a small patient population and a large number of descriptive characteristics.

At the same time, applications have emerged in which the number of experimental units is comparatively small but the underlying dimension is massive; illustrative examples might include image analysis, microarray analysis, document classification, astronomy and atmospheric science.

— Statistical challenges of high-dimensional data, 2009.

A common example of a p >> n problem is gene expression arrays, where there may be thousands of genes (predictors) and only tens of samples.

Gene expression arrays typically have 50 to 100 samples and 5,000 to 20,000 variables (genes).

— Expression Arrays and the p >> n Problem, 2003.

Given that most machine learning algorithms assume many more samples than predictors, this introduces a challenge when modeling.

Specifically, the assumptions made by standard machine learning models may cause the models to behave unexpectedly, provide misleading results, or fail completely.

… models cannot be used “out of the box”, since the standard fitting algorithms all require p<n; in fact the usual rule of thumb is that there be five or ten times as many samples as variables.

— Expression Arrays and the p >> n Problem, 2003.

A major problem with p >> n problems when using machine learning models is overfitting the training dataset.

Given the lack of samples, most models are unable to generalize and instead learn the statistical noise in the training data.

This makes the model perform well on the training dataset but perform poorly on new examples from the problem domain.

This is also a hard problem to diagnose, as the lack of samples does not allow for a test or validation dataset by which model overfitting can be evaluated.

As such, it is common to use leave-one-out style cross-validation (LOOCV) when evaluating models on p >> n problems.

There are many ways to approach a p >> n type classification or regression problem.

Some examples include:One approach is to ignore the p and n relationship and evaluate standard machine learning models.

This might be considered the baseline method by which any other more specialized interventions may be compared.

Feature selection involves selecting a subset of predictors to use as input to predictive models.

Common techniques include filter methods that select features based on their statistical relationship to the target variable (e.

g.

correlation), and wrapper methods that select features based on their contribution to a model when predicting the target variable (e.

g.

RFE).

A suite of feature selection methods could be evaluated and compared, perhaps applied in an aggressive manner to dramatically reduce the number of input features to those determined to be most critical.

Feature selection is an important scientific requirement for a classifier when p is large.

— Page 658, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2016.

For more on feature selection see the tutorial:Projection methods create a lower-dimensionality representation of samples that preserves the relationships observed in the data.

They are often used for visualization, although the dimensionality reduction nature of the techniques may also make them useful as a data transform to reduce the number of predictors.

This might include techniques from linear algebra, such as SVD and PCA.

When p > N, the computations can be carried out in an N-dimensional space, rather than p, via the singular value decomposition …— Page 659, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2016.

It might also include manifold learning algorithms often used for visualization such as t-SNE.

Standard machine learning algorithms may be adapted to use regularization during the training process.

This will penalize models based on the number of features used or weighting of features, encouraging the model to perform well and minimize the number of predictors used in the model.

This can act as a type of automatic feature selection during training and may involve augmenting existing models (e.

g regularized linear regression and regularized logistic regression) or the use of specialized methods such as LARS and LASSO.

There is no best method and it is recommended to use controlled experiments to test a suite of different methods.

This section provides more resources on the topic if you are looking to go deeper.

In this tutorial, you discovered the challenge of big-p, little n or p >> n machine learning problems.

Specifically, you learned:Do you have any questions? Ask your questions in the comments below and I will do my best to answer.

with just arithmetic and simple examplesDiscover how in my new Ebook: Master Machine Learning AlgorithmsIt covers explanations and examples of 10 top algorithms, like: Linear Regression, k-Nearest Neighbors, Support Vector Machines and much more.

Skip the Academics.

Just Results.

.