Getting Data ready for modelling: Feature engineering, Feature Selection, Dimension Reduction (Part 1)

Encoding: So, What and Why is Encoding?Most algorithms we use work with numerical values whereas more often than not categorical data are in text/string (male, female) or bin (0–4, 4–8 etc.) form.One option is to leave these variables out of the algorithms and use only numeric data..But in doing so we can lose out on some critical information..Hence, it is generally a good idea to include the categorical variables into your algorithms by encoding them to convert to numeric values but, first let’s learn a thing or two about our categorical variables.Types of VariableThere are generally two types of encoding done on data, Label Encoding & One hot Encoding (or pandas.get_dummies).i) Label Encoding: Each category is given one label (e.g. 0, 1, 2 etc)..Label encoding is a handy technique to encode categorical variables..However, such encoded nominal variables might end up in being misinterpreted as ordinal..So Label Encoding is done on only Ordinal types of data (which have some sense of order).→So even after Label Encoding, all the data will not lose it ranking or level of importance.eg of Label EncodingCan be performed using ‘sklearn.preprocessing.LabelEncoder’ii) One Hot Encoding: Label encoding cannot be performed on Nominal or Binary, as we can’t rank them based on their property..Every data is treated equally..Consider following two categorical variables and their values as eg→ Colour: Blue, Green, Red, Yellow→ Educational Qualification: Primary School, Secondary School, Graduate, Post-Graduate, Phd.Eg Of One hot EncodingCan be performed using ‘pd.get_dummies’ or ‘sklearn.preprocessing.OneHotEncoder’A dataset with more dimensions requires more parameters for the model to understand, and that means more rows to reliably learn those parameters..The effect of using One Hot Encoder is the addition of a number of columns (dimensions).If the number of rows in the dataset is fixed, addition of extra dimensions without adding more information for the models to learn from can have a detrimental effect on the eventual model accuracy.One Hot Encoding vs Label EncodingWith this said Part 1 comes to an end..Do read Part 2 where Feature extraction and ultra important Dimension Reduction will be discussed.. More details

Leave a Reply