Normalization vs Standardization — Quantitative analysis

Normalization vs Standardization — Quantitative analysisStop using StandardScaler from Sklearn as a default feature scaling method can get you a boost of 7% in accuracy, even when you hyperparameters are tuned!Shay GellerBlockedUnblockFollowFollowingApr 4https://365datascience.

com/standardization/Every ML practitioner knows that feature scaling is an important issue (read more here).

The two most discussed scaling methods are Normalization and Standardization.

Normalization typically means rescales the values into a range of [0,1].

Standardization typically means rescales data to have a mean of 0 and a standard deviation of 1 (unit variance).

In this blog, I conducted a few experiments and hope to answer questions like:Should we always scale our features?Is there a single best scaling technique?How different scaling techniques affect different classifiers?Should we consider scaling technique as an important hyperparameter of our model?I’ll analyze the empirical results of applying different scaling methods on features in multiple experiments settings.

Table of Contests0.

Why are we here?1.

Out-of-the-box classifiers2.

Classifier + Scaling3.

Classifier + Scaling + PCA4.

Classifier + Scaling + PCA + Hyperparameter Tuning5.

All again on more datasets:— 5.

1 Rain in Australia dataset— 5.

2 Bank Marketing dataset— 5.

3 Income classification dataset— 5.

4 Income classification datasetConclusions0.

Why are we here?First, I was trying to understand what is the difference between Normalization and Standardization.

So, I encountered this excellent blog by Sebastian Raschka that supplies a mathematical background that satisfied my curiosity.

Please take 5 minutes to read this blog if you are not familiar with these concepts.

There is also a great explanation of the need for scaling features when dealing with classifiers that train using gradient descendent methods( like neural networks) by famous Hinton here.

Ok, we grabbed some math, that’s it?.Not quite.

 When I checked the popular ML library Sklearn, I saw that there are lots of different scaling methods.

There is a great visualization of the effect of different scalers on data with outliers.

But they didn’t show how it affects a classification task with different classifiers.

I saw a lot of ML pipelines tutorials that use StandardScaler (usually called Z-score Standardization) or MinMaxScaler (usually called min-max Normalization) to scale features.

Why does no one use other scaling techniques for classification?.Is it possible that StandardScaler or MinMaxScaler are the best scaling methods? I didn’t see any explanation in the tutorials about why or when to use each one of them, so I thought I’d investigate the performance of these techniques by running some experiments.

This is what this notebook is all aboutProject detailsLike many Data Science projects, lets read some data and experiment with several out-of-the-box classifiers.

DatasetSonar dataset.

It contains 208 rows and 60 feature columns.

It’s a classification task to discriminate between sonar signals bounced off a metal cylinder and those bounced off a roughly cylindrical rock.

It’s a balanced dataset:sonar[60].

value_counts() # 60 is the label column nameM 111R 97All the features in this dataset are between 0 to 1, but it’s not ensured that 1 is the max value or 0 is the min value in each feature.

I chose this dataset because, from one hand, it is small, so I can experiment pretty fast.

On the other hand, it’s a hard problem and none of the classifiers achieve anything close to 100% accuracy, so we can compare meaningful results.

We will experiment with more datasets in the last section.

CodeAs a preprocessing step, I already calculated all the results (it takes some time).

So we only load the results file and work with it.

The code that produces the results can be found in my GitHub:https://github.

com/shaygeller/Normalization_vs_Standardization.

gitI pick some most popular classification models from the Sklearn library denoted as:(MLP is Multi-Layer Perceptron, a neural network)The scalers and normalizers I used are denoted as:Experiment details:The same seed was used when needed for reproducibility.

I randomly split the data to train-test sets of 80%-20% respectively.

All results are accuracy scores on 10-fold random cross-validation splits from the train set.

I do not discuss the results on the test set here.

Usually, the test set should be kept hidden, and all of our conclusions about our classifiers should be taken only from the cross-validation scores.

In part 4, I performed nested cross-validation.

One inner cross-validation with 5 random splits for hyperparameter tuning, and another outer CV with 10 random splits to get the model’s score using the best parameters.

Also in this part, all data taken only from the train set.

A picture is worth a thousand words:https://sebastianraschka.

com/faq/docs/evaluate-a-model.

htmlLet’s read the results fileimport osimport pandas as pdresults_file = "sonar_results.

csv"results_df = pd.

read_csv(os.

path.

join(".

","data","processed",results_file)).

dropna().

round(3)results_df1.

Out-of-the-box classifiersimport operatorresults_df.

loc[operator.

and_(results_df["Classifier_Name"].

str.

startswith("_"), ~results_df["Classifier_Name"].

str.

endswith("PCA"))].

dropna()Nice results.

By looking at the CV_mean column, we can see that at the moment, MLF is leading.

SVM has the worst performance.

Standard deviation is pretty much the same, so we can judge mainly by the mean score.

All the results below will be the mean score of 10-fold cross-validation random splits.

Now, let’s see how different scaling methods change the scores for each classifier2.

Classifiers+Scalingimport operatortemp = results_df.

loc[~results_df["Classifier_Name"].

str.

endswith("PCA")].

dropna()temp["model"] = results_df["Classifier_Name"].

apply(lambda sen: sen.

split("_")[1])temp["scaler"] = results_df["Classifier_Name"].

apply(lambda sen: sen.

split("_")[0])def df_style(val): return 'font-weight: 800'pivot_t = pd.

pivot_table(temp, values='CV_mean', index=["scaler"], columns=['model'], aggfunc=np.

sum)pivot_t_bold = pivot_t.

style.

applymap(df_style, subset=pd.

IndexSlice[pivot_t["CART"].

idxmax(),"CART"])for col in list(pivot_t): pivot_t_bold = pivot_t_bold.

applymap(df_style, subset=pd.

IndexSlice[pivot_t[col].

idxmax(),col])pivot_t_boldThe first row, the one without index name, is the algorithm without applying any scaling method.

import operatorcols_max_vals = {}cols_max_row_names = {}for col in list(pivot_t): row_name = pivot_t[col].

idxmax() cell_val = pivot_t[col].

max() cols_max_vals[col] = cell_val cols_max_row_names[col] = row_name sorted_cols_max_vals = sorted(cols_max_vals.

items(), key=lambda kv: kv[1], reverse=True)print("Best classifiers sorted:.")counter = 1for model, score in sorted_cols_max_vals: print(str(counter) + ".

" + model + " + " +cols_max_row_names[model] + " : " +str(score)) counter +=1Best classifier from each model:1.

SVM + StandardScaler : 0.

8492.

MLP + PowerTransformer-Yeo-Johnson : 0.

8393.

KNN + MinMaxScaler : 0.

8134.

LR + QuantileTransformer-Uniform : 0.

8085.

NB + PowerTransformer-Yeo-Johnson : 0.

7526.

LDA + PowerTransformer-Yeo-Johnson : 0.

7477.

CART + QuantileTransformer-Uniform : 0.

748.

RF + Normalizer : 0.

723Let’s analyze the resultsThere is no single scaling method to rule them all.

We can see that scaling improved the results.

SVM, MLP, KNN, and NB got a significant boost from different scaling methods.

Notice that NB, RF, LDA, CART are unaffected by some of the scaling methods.

This is, of course, related to how each of the classifiers works.

Trees are not affected by scaling because the splitting criterion first orders the values of each feature and then calculate the ginientropy of the split.

Some scaling methods keep this order, so no change to the accuracy score.

 NB is not affected because the model’s priors determined by the count in each class and not by the actual value.

Linear Discriminant Analysis (LDA) finds it’s coefficients using the variation between the classes (check this), so the scaling doesn’t matter either.

Some of the scaling methods, like QuantileTransformer-Uniform, doesn’t preserve the exact order of the values in each feature, hence the change in score even in the above classifiers that were agnostic to other scaling methods.

3.

Classifier+Scaling+PCAWe know that some well-known ML methods like PCA can benefit from scaling (blog).

Let’s try adding PCA(n_components=4) to the pipeline and analyze the results.

import operatortemp = results_df.

copy()temp["model"] = results_df["Classifier_Name"].

apply(lambda sen: sen.

split("_")[1])temp["scaler"] = results_df["Classifier_Name"].

apply(lambda sen: sen.

split("_")[0])def df_style(val): return 'font-weight: 800'pivot_t = pd.

pivot_table(temp, values='CV_mean', index=["scaler"], columns=['model'], aggfunc=np.

sum)pivot_t_bold = pivot_t.

style.

applymap(df_style, subset=pd.

IndexSlice[pivot_t["CART"].

idxmax(),"CART"])for col in list(pivot_t): pivot_t_bold = pivot_t_bold.

applymap(df_style, subset=pd.

IndexSlice[pivot_t[col].

idxmax(),col])pivot_t_boldLet’s analyze the resultsMost of the time scaling methods improve models with PCA, but no specific scaling method is in charge.

 Let’s look at “QuantileTransformer-Uniform”, the method with most of the high scores.

 In LDA-PCA it improved the results from 0.

704 to 0.

783 (8% jump in accuracy!), but in RF-PCA it makes things worse, from 0.

711 to 0.

668 (4.

35% drop in accuracy!) On the other hand, using RF-PCA with “QuantileTransformer-Normal”, improved the accuracy to 0.

766 (5% jump in accuracy!)We can see that PCA only improve LDA and RF, so PCA is not a magic solution.

It’s fine.

We didn’t hypertune the n_components parameter, and even if we did, PCA doesn’t guarantee to improve predictions.

We can see that StandardScaler and MinMaxScaler achieve best scores only in 4 out of 16 cases.

So we should think carefully what scaling method to choose, even as a default one.

We can conclude that even though PCA is a known component that benefits from scaling, no single scaling method always improved our results, and some of them even cause harm(RF-PCA with StandardScaler).

The dataset is also a great factor here.

To better understand the consequences of scaling methods on PCA, we should experiment with more diverse datasets (class imbalanced, different scales of features and datasets with numerical and categorical features).

I’m doing this analysis in section 5.

4.

Classifiers+Scaling+PCA+Hyperparameter tuningThere are big differences in the accuracy score between different scaling methods for a given classifier.

One can assume that when the hyperparameters are tuned, the difference between the scaling techniques will be minor and we can use StandardScaler or MinMaxScaler as used in many classification pipelines tutorials in the web.

 Let’s check that!First, NB is not here, that’s because NB has no parameters to tune.

We can see that almost all the algorithms benefit from hyperparameter tuning compare to results from o previous step.

An interesting exception is MLP that got worse results.

It’s probably because neural networks can easily overfit the data (especially when the number of parameters is much bigger than the number of training samples), and we didn’t perform a careful early stopping to avoid it, nor applied any regularizations.

Yet, even when the hyperparameters are tuned, there are still big differences between the results using different scaling methods.

If we would compare different scaling techniques to the broadly used StandardScaler technique, we can gain up to 7% improvement in accuracy (KNN column) when experiencing with other techniques.

The main conclusion from this step is that even though the hyperparameters are tuned, changing the scaling method can dramatically affect the results.

So, we should consider the scaling method as a crucial hyperparameter of our model.

Part 5 contains a more in-depth analysis of more diverse datasets.

If you don’t want to deep dive into it, feel free to jump to the conclusion section.

5.

All again on more datasetsTo get a better understanding and to derive more generalized conclusions, we should experiment with more datasets.

We will apply Classifier+Scaling+PCA like section 3 on several datasets with different characteristics and analyze the results.

All datasets were taken from Kaggel.

For the sake of convenience, I selected only the numerical columns out of each dataset.

In multivariate datasets (numeric and categorical features), there is an ongoing debate about how to scale the features.

I didn’t hypertune any parameters of the classifiers.

5.

1 Rain in Australia datasetLinkClassification task: Predict is it’s going to rain?Metric: AccuracyDataset shape: (56420, 18)Counts for each class:No 43993Yes 12427Here is a sample of 5 rows, we can’t show all the columns in one picture.

dataset.

describe()We will suspect that scaling will improve classification results due to the different scales of the features (check min max values in the above table, it even get worse on some of the rest of the features).

ResultsResults analysisWe can see the StandardScaler never got the highest score, nor MinMaxScaler.

We can see differences of up to 20% between StandardScaler and other methods.

(CART-PCA column)We can see that scaling usually improved the results.

Take for example SVM that jumped from 78% to 99%.

5.

2 Bank Marketing datasetLinkClassification task: Predict has the client subscribed a term deposit?Metric: AUC ( The data is imbalanced)Dataset shape: (41188, 11)Counts for each class:no 36548yes 4640Here is a sample of 5 rows, we can’t show all the columns in one picture.

dataset.

describe()Again, features in different scales.

ResultsResults analysisWe can see that in this dataset, even though the features are on different scales, scaling when using PCA doesn’t always improve the results.

However, the second-best score in each PCA column is pretty close to the best score.

It might indicate that hypertune the number of components of the PCA and using scaling will improve the results over not scaling at all.

Again, there is no one single scaling method that stood out.

Another interesting result is that in most models, all the scaling methods didn’t affect that much (usually 1%–3% improvement).

Let’s remember that this is an unbalanced dataset and we didn’t hypertune the parameters.

Another reason is that the AUC score is already high (~90%), so it’s harder to see major improvements.

5.

3 Sloan Digital Sky Survey DR14 datasetLinkClassification task: Predict if an object to be either a galaxy, star or quasar.

Metric: Accuracy (multiclass)Dataset shape: (10000, 18)Counts for each class:GALAXY 4998STAR 4152QSO 850Here is a sample of 5 rows, we can’t show all the columns in one picture.

dataset.

describe()Again, features in different scales.

ResultsResults analysisWe can see that scaling highly improved the results.

We could expect it because it contains features on different scales.

We can see that RobustScaler almost always wins when we use PCA.

It might be due to the many outliers in this dataset that shift the PCA eigenvectors.

On the other hand, those outliers don’t make such an effect when we do not use PCA.

We should do some data exploration to check that.

There is up to 5% difference in accuracy if we will compare StandardScaler to the other scaling method.

So it’s another indicator to the need for experiment with multiple scaling techniques.

PCA almost always benefit from scaling.

5.

4 Income classification datasetLinkClassification task: Predict if income is >50K, <=50K.

Metric: AUC (imbalanced dataset)Dataset shape: (32561, 7)Counts for each class: <=50K 24720 >50K 7841Here is a sample of 5 rows, we can’t show all the columns in one picture.

dataset.

describe()Again, features in different scales.

ResultsResults analysisHere again, we have an imbalanced dataset, but we can see that scaling do a good job in improving the results (up to 20%!).

This is probably because the AUC score is lower (~80%) compared to the Bank Marketing dataset, so it’s easier to see major improvements.

Even though StandardScaler is not highlighted (I highlighted only the first best score in each column), in many columns, it achieves the same results as the best, but not always.

From the running time results(no appeared here), I can tell you that running StandatdScaler is much faster than many of the other scalers.

So if you are in a rush to get some results, it can be a good starting point.

But if you want to squeeze every percent from your model, you might want to experience with multiple scaling methods.

Again, no single best scale method.

PCA almost always benefited from scalingConclusionsExperiment with multiple scaling methods can dramatically increase your score on classification tasks, even when you hyperparameters are tuned.

So, you should consider the scaling method as an important hyperparameter of your model.

Scaling methods affect differently on different classifiers.

Distance-based classifiers like SVM, KNN, and MLP(neural network) dramatically benefit from scaling.

But even trees (CART, RF), that are agnostic to some of the scaling methods, can benefit from other methods.

Knowing the underlying math behind modelspreprocessing methods is the best way to understand the results.

(For example, how trees work and why some of the scaling methods didn’t affect them).

It can also save you a lot of time if you know no to apply StandardScaler when your model is Random Forest.

Preprocessing methods like PCA that known to be benefited from scaling, do benefit from scaling.

When it doesn’t, it might be due to a bad setup of the number of components parameter of PCA, outliers in the data or a bad choice of a scaling method.

If you find some mistakes or have proposals to improve the coverage or the validity of the experiments, please notify me.

.

. More details

Leave a Reply