How Statistical Norms Improve ModelingMadeline SchiappaBlockedUnblockFollowFollowingMay 31IntroductionA regularizer is commonly used in machine learning to constrain a model’s capacity to cerain bounds either based on a statistical norm or on prior hypotheses.

This adds preference for one solution over another in the model’s hypothesis space, or the set of functions that the learning algorithm is allowed to select as being the solution [1].

The primary aim of this method is to improve the generalizability of a model, or to improve a model’s performance on previously unseen data.

Using a regularizer improves generalizability because it reduces overfitting the model to the training data.

The most common practice is to add a norm penalty on the objective function during the learning process.

The following equation is the regularized objective function:Regularized objective function adding a penalty to the original objective function.

From [1].

The original objective function, J, is a function of the parameters θ, the true label y, and the input X.

The regularizer consists of the penalty norm function Ω and a penalty α that weights the contribution of Ω.

The next section will provide an introduction to some penalty norms that are commonly used.

Commonly used Statistical NormsNorms are a method of measuring the length or magnitude of vectors.

A vector norm is calculated using some measure that can summarize the distance of the vector from the origin.

These different measures are most often the L¹ norm and the L² norm.

The L¹ norm is calculated by the sum of absolute differences and is often referred to as the Manhattan Norm: ||x||₁ = |x₁|+ |x₂|+|x₃|+.

+|x ₙ| where | ●| is the absolute value of a given variable.

While this is the vector norm, to apply to matrices the calculation changes slightly.

The matrix L¹ norm for example is ||A|| = |a₁₁|+|a₁₂| +…+|aᵢⱼ|.

The L² norm is also commonly referred to as the Euclidean norm.

This norm measures the distance from the origin to the point x.

The Euclidean Norm for a vector.

The L∞ norm, or max-norm measures the maximum of the vector as the length: ||x||∞ = max(|x₁|+ |x₂|+|x₃|+.

+|x ₙ|).

These norms and their variations can be measured using Lᵖ, or the p-norm.

The p-norm is measured by:The p-norm for vectors.

From Wikipedia.

The p-norm for matrices.

From Wikipedia.

When p=1, we get the L¹ and when p=2, we get the L² norm.

As p approaches infinity, you get the L∞ norm.

How Norms are Used in RegularizationWeight decay is a method that makes preference to weights being smaller than the L² norm, driving the weights to be closer to the origin (see Figure 1).

The result is that the learning rule multiplicatively shrinks the weights by a constant factor at each step before performing a gradient update [1].

In other words, it constrains the weights to lie in a region limited by the L² norm.

Figure 1.

A toy visual example of weight decay.

Different choices of the norm used for Ω can result in different solutions preferred (see Figure 2).

One common difference between the behavior of the L¹ and the L² penalty norms is that L¹ results in more sparse solutions, meaning some parameters’ optimal value is 0.

This is commonly used for feature selection in which features with parameters that are optimally 0 are removed.

Figure 2.

A toy visual example of other choices for Ω.

Multitask learning is a learning problem in which several similiar tasks are learned simultaneously.

For example, tasks could be different classes in a multi-class learning problem.

For each task, a different set of parameters is learned.

The idea is that there is a sharing of information across the tasks from which they can benefit.

In other words, “among the factors that explain the variations observed in the data associated with the diﬀerent tasks, some are shared across two or more tasks” [1].

The goal of this methodology is to improve generalizability overall.

Figure 3.

Toy visualization of Multi-task learning.

It is common to use prior knowledge on how the tasks relate to each other to constrain the different weight vectors for each task (again see Figure 2).

These constraints can be the same as mentioned above, e.

g.

L¹ norm.

This is commonly done by applying the norm over columns of the matrix.

An example is a combination of the L¹ and L² norm in which the L² norm is applied on each column and an L¹ norm is applied over all columns:The penalty function where R is equivalent to Ω.

ConclusionThere are other ways to regularize that do not involve statistical norms such as adding noise, early stopping of the learning algorithm, and data augmentation.

However, this article focuses on the use of statistical norms to add constraints to the learning algorithm as a means to improve generalizability of a model.

References[1] Ian Goodfellow and Yoshua Bengio and Aaron Courville.

Deep Learning.

MIT Press.

2016.

.