How to Improve Neural Network Stability and Modeling Performance With Data Scaling

Neural Nets FAQIf your problem is a regression problem, then the output will be a real value.

This is best modeled with a linear activation function.

If the distribution of the value is normal, then you can standardize the output variable.

Otherwise, the output variable can be normalized.

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Download Your FREE Mini-CourseThere are two types of scaling of your data that you may want to consider: normalization and standardization.

These can both be achieved using the scikit-learn library.

Normalization is a rescaling of the data from the original range so that all values are within the range of 0 and 1.

Normalization requires that you know or are able to accurately estimate the minimum and maximum observable values.

You may be able to estimate these values from your available data.

A value is normalized as follows:Where the minimum and maximum values pertain to the value x being normalized.

For example, for a dataset, we could guesstimate the min and max observable values as 30 and -10.

We can then normalize any value, like 18.

8, as follows:You can see that if an x value is provided that is outside the bounds of the minimum and maximum values, the resulting value will not be in the range of 0 and 1.

You could check for these observations prior to making predictions and either remove them from the dataset or limit them to the pre-defined maximum or minimum values.

You can normalize your dataset using the scikit-learn object MinMaxScaler.

Good practice usage with the MinMaxScaler and other scaling techniques is as follows:The default scale for the MinMaxScaler is to rescale variables into the range [0,1], although a preferred scale can be specified via the “feature_range” argument and specify a tuple including the min and the max for all variables.

If needed, the transform can be inverted.

This is useful for converting predictions back into their original scale for reporting or plotting.

This can be done by calling the inverse_transform() function.

The example below provides a general demonstration for using the MinMaxScaler to normalize data.

You can also perform the fit and transform in a single step using the fit_transform() function; for example:Standardizing a dataset involves rescaling the distribution of values so that the mean of observed values is 0 and the standard deviation is 1.

It is sometimes referred to as “whitening.

”This can be thought of as subtracting the mean value or centering the data.

Like normalization, standardization can be useful, and even required in some machine learning algorithms when your data has input values with differing scales.

Standardization assumes that your observations fit a Gaussian distribution (bell curve) with a well behaved mean and standard deviation.

You can still standardize your data if this expectation is not met, but you may not get reliable results.

Standardization requires that you know or are able to accurately estimate the mean and standard deviation of observable values.

You may be able to estimate these values from your training data.

A value is standardized as follows:Where the mean is calculated as:And the standard_deviation is calculated as:We can guesstimate a mean of 10 and a standard deviation of about 5.

Using these values, we can standardize the first value of 20.

7 as follows:The mean and standard deviation estimates of a dataset can be more robust to new data than the minimum and maximum.

You can standardize your dataset using the scikit-learn object StandardScaler.

You can also perform the fit and transform in a single step using the fit_transform() function; for example:A regression predictive modeling problem involves predicting a real-valued quantity.

We can use a standard regression problem generator provided by the scikit-learn library in the make_regression() function.

This function will generate examples from a simple regression problem with a given number of input variables, statistical noise, and other properties.

We will use this function to define a problem that has 20 input features; 10 of the features will be meaningful and 10 will not be relevant.

A total of 1,000 examples will be randomly generated.

The pseudorandom number generator will be fixed to ensure that we get the same 1,000 examples each time the code is run.

Each input variable has a Gaussian distribution, as does the target variable.

We can demonstrate this by creating histograms of some of the input variables and the output variable.

Running the example creates two figures.

The first shows histograms of the first two of the twenty input variables, showing that each has a Gaussian data distribution.

Histograms of Two of the Twenty Input Variables for the Regression ProblemThe second figure shows a histogram of the target variable, showing a much larger range for the variable as compared to the input variables and, again, a Gaussian data distribution.

Histogram of the Target Variable for the Regression ProblemNow that we have a regression problem that we can use as the basis for the investigation, we can develop a model to address it.

We can develop a Multilayer Perceptron (MLP) model for the regression problem.

A model will be demonstrated on the raw data, without any scaling of the input or output variables.

We expect that model performance will be generally poor.

The first step is to split the data into train and test sets so that we can fit and evaluate a model.

We will generate 1,000 examples from the domain and split the dataset in half, using 500 examples for the train and test datasets.

Next, we can define an MLP model.

The model will expect 20 inputs in the 20 input variables in the problem.

A single hidden layer will be used with 25 nodes and a rectified linear activation function.

The output layer has one node for the single target variable and a linear activation function to predict real values directly.

The mean squared error loss function will be used to optimize the model and the stochastic gradient descent optimization algorithm will be used with the sensible default configuration of a learning rate of 0.

01 and a momentum of 0.

9.

The model will be fit for 100 training epochs and the test set will be used as a validation set, evaluated at the end of each training epoch.

The mean squared error is calculated on the train and test datasets at the end of training to get an idea of how well the model learned the problem.

Finally, learning curves of mean squared error on the train and test sets at the end of each training epoch are graphed using line plots, providing learning curves to get an idea of the dynamics of the model while learning the problem.

Tying these elements together, the complete example is listed below.

Running the example fits the model and calculates the mean squared error on the train and test sets.

In this case, the model is unable to learn the problem, resulting in predictions of NaN values.

The model weights exploded during training given the very large errors and, in turn, error gradients calculated for weight updates.

This demonstrates that, at the very least, some data scaling is required for the target variable.

A line plot of training history is created but does not show anything as the model almost immediately results in a NaN mean squared error.

The MLP model can be updated to scale the target variable.

Reducing the scale of the target variable will, in turn, reduce the size of the gradient used to update the weights and result in a more stable model and training process.

Given the Gaussian distribution of the target variable, a natural method for rescaling the variable would be to standardize the variable.

This requires estimating the mean and standard deviation of the variable and using these estimates to perform the rescaling.

It is best practice is to estimate the mean and standard deviation of the training dataset and use these variables to scale the train and test dataset.

This is to avoid any data leakage during the model evaluation process.

The scikit-learn transformers expect input data to be matrices of rows and columns, therefore the 1D arrays for the target variable will have to be reshaped into 2D arrays prior to the transforms.

We can then create and apply the StandardScaler to rescale the target variable.

Rescaling the target variable means that estimating the performance of the model and plotting the learning curves will calculate an MSE in squared units of the scaled variable rather than squared units of the original scale.

This can make interpreting the error within the context of the domain challenging.

In practice, it may be helpful to estimate the performance of the model by first inverting the transform on the test dataset target variable and on the model predictions and estimating model performance using the root mean squared error on the unscaled data.

This is left as an exercise to the reader.

The complete example of standardizing the target variable for the MLP on the regression problem is listed below.

Running the example fits the model and calculates the mean squared error on the train and test sets.

Your specific results may vary given the stochastic nature of the learning algorithm.

Try running the example a few times.

In this case, the model does appear to learn the problem and achieves near-zero mean squared error, at least to three decimal places.

A line plot of the mean squared error on the train (blue) and test (orange) dataset over each training epoch is created.

In this case, we can see that the model rapidly learns to effectively map inputs to outputs for the regression problem and achieves good performance on both datasets over the course of the run, neither overfitting or underfitting the training dataset.

Line Plot of Mean Squared Error on the Train a Test Datasets for Each Training EpochIt may be interesting to repeat this experiment and normalize the target variable instead and compare results.

We have seen that data scaling can stabilize the training process when fitting a model for regression with a target variable that has a wide spread.

It is also possible to improve the stability and performance of the model by scaling the input variables.

In this section, we will design an experiment to compare the performance of different scaling methods for the input variables.

The input variables also have a Gaussian data distribution, like the target variable, therefore we would expect that standardizing the data would be the best approach.

This is not always the case.

We can compare the performance of the unscaled input variables to models fit with either standardized and normalized input variables.

The first step is to define a function to create the same 1,000 data samples, split them into train and test sets, and apply the data scaling methods specified via input arguments.

The get_dataset() function below implements this, requiring the scaler to be provided for the input and target variables and returns the train and test datasets split into input and output components ready to train and evaluate a model.

Next, we can define a function to fit an MLP model on a given dataset and return the mean squared error for the fit model on the test dataset.

The evaluate_model() function below implements this behavior.

Neural networks are trained using a stochastic learning algorithm.

This means that the same model fit on the same data may result in a different performance.

We can address this in our experiment by repeating the evaluation of each model configuration, in this case a choice of data scaling, multiple times and report performance as the mean of the error scores across all of the runs.

We will repeat each run 30 times to ensure the mean is statistically robust.

The repeated_evaluation() function below implements this, taking the scaler for input and output variables as arguments, evaluating a model 30 times with those scalers, printing error scores along the way, and returning a list of the calculated error scores from each run.

Finally, we can run the experiment and evaluate the same model on the same dataset three different ways:The mean and standard deviation of the error for each configuration is reported, then box and whisker plots are created to summarize the error scores for each configuration.

Tying these elements together, the complete example is listed below.

Running the example prints the mean squared error for each model run along the way.

After each of the three configurations have been evaluated 30 times each, the mean errors for each are reported.

Your specific results may vary, but the general trend should be the same as is listed below.

In this case, we can see that as we expected, scaling the input variables does result in a model with better performance.

Unexpectedly, better performance is seen using normalized inputs instead of standardized inputs.

This may be related to the choice of the rectified linear activation function in the first hidden layer.

A figure with three box and whisker plots is created summarizing the spread of error scores for each configuration.

The plots show that there was little difference between the distributions of error scores for the unscaled and standardized input variables, and that the normalized input variables result in better performance and more stable or a tighter distribution of error scores.

These results highlight that it is important to actually experiment and confirm the results of data scaling methods rather than assuming that a given data preparation scheme will work best based on the observed distribution of the data.

Box and Whisker Plots of Mean Squared Error With Unscaled, Normalized and Standardized Input Variables for the Regression ProblemThis section lists some ideas for extending the tutorial that you may wish to explore.

If you explore any of these extensions, I’d love to know.

This section provides more resources on the topic if you are looking to go deeper.

In this tutorial, you discovered how to improve neural network stability and modeling performance by scaling data.

Specifically, you learned:Do you have any questions?.Ask your questions in the comments below and I will do my best to answer.

…with just a few lines of python codeDiscover how in my new Ebook: Better Deep LearningIt provides self-study tutorials on topics like: weight decay, batch normalization, dropout, model stacking and much more…Skip the Academics.

Just Results.

Click to learn more.

.

. More details

Leave a Reply