The penalty of missing values in Data Science

No, that would imply under-utilizing our potential.

But again, the rigidity remains, as we are still using a single value — mean/median/mode.

We’ll discuss more about this in the next section shortly.

For now let’s replace values with mean(in c0), median(in c1) and mode(in c3).

Before, let’s deal with the garbage value ‘#$%’ at (‘i2’, ‘c3’).

The respective values are:We’ll use 3 different methods to replace NaNs.

It looks like we’ll have to drop the c2 column altogether as it contains no data.

Note that in the beginning one row and one column were completely filled with NaNs but we were only able to successfully manipulate the rows but not the columns.

Dropping c2.

We finally got rid of all the missing values!Part-II: Random but proportional replacement (RBPR)Photo by Rakicevic Nenad from PexelsThe above methods, I think, can be described as hard imputation approaches, as they rigidly accept only one value.

Now let’s focus on a “soft” imputation approach.

Soft because it makes use of probabilities.

Here we are not forced to pick a single value.

We’ll replace NaNs randomly in a ratio which is “proportional” to the population without NaNs (the proportion is calculated using probabilities but with a touch of randomness).

An explanation with an example would be better.

Assume a list having 15 elements with one-third data missing:[1, 1, 1, 1, 2, 2, 2, 2, 3, 3, NaN, NaN, NaN, NaN, NaN] — — — (original)Now notice in the original list there are sets of 4 ones, 4 twos, 2 threes, and 5 NaNs.

Thus the ones & twos are in majority while threes are in minority.

Now let's begin by calculating the probabilities and expected values.

prob(1 occurring in NaNs) = (no.

of 1s)/(population without NaNs)= 4/10= 2/5Expected value/counts of 1= (prob) * (total no.

of NaNs)= (2 / 5) * (5)= 2Similarly expected value of prob(2 occurring in NaNs) is 2 and prob(3 occurring in NaNs) is 1 (Note that 2+2+1=5, is equal to the number of NaNs).

Thus our list will now look like this:[1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 1, 1, 2, 2, 3] — — — (replaced_by_proportion)The ratio of ones, twos, and threes replacing NaNs is thus 2 : 2 : 1.

That is when we have ‘nothing’ it is highly likely that ‘ones’ and ‘twos’ form the major part of it than ‘threes’, instead of a single hard mean/mode/median.

If we simply impute NaNs by mean(1.

8), then our list looks like:[1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 1.

8, 1.

8, 1.

8, 1.

8, 1.

8] — — — (replaced_by_mean)Let’s box-plot these three lists and draw conclusions from the same:Box-plot code (NaN-12.

py)First, the list with proportional replacement has far better data distribution than the mean replaced one.

Second, observe how mean affects the distribution with ‘3’(a minority): it was originally not an outlier, suddenly turned so in plot-2 but regained its original status in plot-3.

This shows that plot-3 distribution is less biased.

Third, this approach is also fairer, it gave ‘3’(the minority) a “chance” in the missing values which otherwise it would have never got.

The fourth beauty of this approach is that we still have successfully conserved the mean!Fifth, the distribution (based on probability) ensures, without doubt, that the chances of this method to over-fit a model is definitely lesser than imputing with the hard approach.

Sixth, if NaNs are replaced “randomly” then applying a little logic we can easily calculate that there are: 5!/(2!*2!*1!) = 30, different arrangements (permutations) possible:… 1, 1, 2, 2, 3], … 1, 1, 2, 3, 2], … 1, 1, 3, 2, 2], … 1, 3, 1, 2, 2], … 3, 1, 1, 2, 2] and 25 more!To make this dynamism even clearer and intuitive see this gif with 4 NaNs.

Each color represents a different NaN value.

Notice how different arrangements generate different interactions between columns each time we run the code.

Per se, we are not ‘generating’ new data here, as we are only resourcefully utilizing the already available data.

We are only generating newer and newer interactions.

And these fluctuating interactions are the real penalty of NaNs.

Code:Now let’s code this concept and bound it.

The code for dealing with numerical features can be found here and for categorical features here.

(I am purposely avoiding displaying the code here as the focus is on the concept, also it would needlessly make the article lengthy.

If you do find the code useful and are [algorithmically] greedy enough to optimize it further, I’ll be glad if you revert).

How to use the code?random.

seed = 0 np.


seed = 0# important so that results are reproducible# The df_original is free of impurities(eg.

no '$' or ',' in price # field) and df_original.

dtypes are all set.


df = df_original.


Call the CountAll() function given in the code3.

categorical list = [all categorical column names in df]4.

numerical list = [all numerical column names in df]5.

run a for loop to fill NaNs through numerical list, using the Fill_NaNs_Numeric() function6.

run a for loop to fill NaNs through categorical list, using the Fill_NaNs_Catigorical() function7.

perform a train test split and check for the accuracy(do not specify the random_state)(After step 7 we require a bit of imputation tuning.

Ensuring steps 1-7 are in a single cell, repeatedly run it 15-20 times manually to get an idea of the 'range' of accuracies as it'll keep fluctuating due to randomness.

7th step helps one get an estimate of the limits of accuracies and helps us to boil down to the "best accuracy")8.

("skip" this step if df is extremely huge) run a conditioned while loop again through 1 to 7 this time to directly get our desired(tuned) accuracy.

(One may want to write down and save this 'updated'-df for future use to save oneself from repeating this process).

Here is a complete example, with all the steps just mentioned, on the famous Iris data-set included in sklearn library.

20% values from each column, including the target, have been randomly deleted.

Then the NaNs in this data-set is imputed using this approach.

By step-7 its easily identifiable that after imputation we can tune our recall at-least ≥ 0.

7 for “each” class of the iris plant, and the same is the condition in the 8-th step.

After running several times few reports are as follows:Soft Imputation on Iris DatasetNext, for a second confirmation, we plot PR-curves post-tuning, this time with a RandomForestClassifier (n_estimators= 100).

[the classes are {0 :’setosa’, 1: ‘versicolor’, 2: ‘virginica’}].

Measuring the RBPR’s quality through the area under the curveThese figures look okay.

Now shifting our attention to hard imputation.

One of the many classification reports is shown below: [observe the 1s(to be discussed shortly) in precision and recall along with the class imbalance in support] precision recall f1-score supportsetosa 1.

00 0.

52 0.

68 25versicolor 0.

45 1.

00 0.

62 9virginica 0.

67 0.

73 0.

70 11The law of large numbersNow lets put to use the law of large numbers using DecisionTreeClassifier to perform 500 iterations, each with a different randomly removed set of values, over the same Iris dataset without tuning the imputations; that is, we skip the tuning stage to “deliberately” obtain the worst soft scores.

The code is here.

The final comparisons in terms of precision and recall scores, for both hard and soft imputation, are as follows:RECALLSPRECISIONSPrecision and recall come handy mainly when we observe class imbalance.

Although initially, we did have a well-balanced target but hard imputing with it with mode made it imbalanced.

Observe a large number of hard recalls and hard precisions having value = 1.

Here the use of the word “over-fit” would be incorrect as these are test scores not train ones.

So the correct way to put would be: the prophetic hard model already knew what to predict, or the use of mode ensured the two scores to overshoot.

Now observe the soft scores.

Despite any tuning as well as much fewer values being= 1, the soft scores are still able to catch up/converge with the hard scores (except in two cases — versicolor-recall and stetosa-precision — for obvious reasons where a huge number of prophetic 1s forcefully pull up the average).

Also, observe the soft-stetosa-recall (despite the presence of large 1s in the hard counterpart), and, the increased soft-versicolor-precision.

The last thing to note is the overall reduction in variation and standard deviation in the soft approach.

For reference the f1 scores and accuracy scores are: (once again note the reduced variation and standard deviation in soft approach)F1-SCOREACCURACY SCORESThus we can observe that in the long run, even without soft imputation tuning we have obtained results which match the performance of the hard imputation strategy.

Thus after tuning we can obtain even better results.

ConclusionWhy are we doing this?.The only reason is to improve our chances of dealing with uncertainty.

We never penalize ourselves for missing values!.Whenever we find a missing value we simply anchor our ship in the ‘middle’ of the sea falsely presuming that our anchor has successfully fathomed the deepest trench of “uncertainties”.

The attempt here is to keep the ship sailing by employing the resources available at hand — the wind speed and direction, the location of stars, the energy of the waves and tides, etc.

to get the best ‘diversified’ catch, for a better return.

Photo by Simon Matzinger from Pexels(If you identify anything wrong/incorrect, please do respond.

Criticism is welcomed).


. More details

Leave a Reply