Feature Selection Techniques in Machine Learning with Python

Feature Selection Techniques in Machine Learning with PythonRaheel ShaikhBlockedUnblockFollowFollowingOct 27, 2018With the new day comes new strength and new thoughts — Eleanor RooseveltWe all may have faced this problem of identifying the related features from a set of data and removing the irrelevant or less important features with do not contribute much to our target variable in order to achieve better accuracy for our model.

Feature Selection is one of the core concepts in machine learning which hugely impacts the performance of your model.

The data features that you use to train your machine learning models have a huge influence on the performance you can achieve.

Irrelevant or partially relevant features can negatively impact model performance.

Feature selection and Data cleaning should be the first and most important step of your model designing.

In this post, you will discover feature selection techniques that you can use in Machine Learning.

Feature Selection is the process where you automatically or manually select those features which contribute most to your prediction variable or output in which you are interested in.

Having irrelevant features in your data can decrease the accuracy of the models and make your model learn based on irrelevant features.

How to select features and what are Benefits of performing feature selection before modeling your data?· Reduces Overfitting: Less redundant data means less opportunity to make decisions based on noise.

· Improves Accuracy: Less misleading data means modeling accuracy improves.

· Reduces Training Time: fewer data points reduce algorithm complexity and algorithms train faster.

I want to share my personal experience with this.

I prepared a model by selecting all the features and I got an accuracy of around 65% which is not pretty good for a predictive model and after doing some feature selection and feature engineering without doing any logical changes in my model code my accuracy jumped to 81% which is quite impressiveNow you know why I say feature selection should be the first and most important step of your model design.

Feature Selection Methods:I will share 3 Feature selection techniques that are easy to use and also gives good results.

1.

Univariate Selection2.

Feature Importance3.

Correlation Matrix with HeatmapLet’s have a look at these techniques one by one with an exampleYou can download the dataset from here https://www.

kaggle.

com/iabhishekofficial/mobile-price-classification#train.

csvDescription of variables in the above filebattery_power: Total energy a battery can store in one time measured in mAhblue: Has Bluetooth or notclock_speed: the speed at which microprocessor executes instructionsdual_sim: Has dual sim support or notfc: Front Camera megapixelsfour_g: Has 4G or notint_memory: Internal Memory in Gigabytesm_dep: Mobile Depth in cmmobile_wt: Weight of mobile phonen_cores: Number of cores of the processorpc: Primary Camera megapixelspx_heightPixel Resolution Heightpx_width: Pixel Resolution Widthram: Random Access Memory in MegaBytessc_h: Screen Height of mobile in cmsc_w: Screen Width of mobile in cmtalk_time: the longest time that a single battery charge will last when you arethree_g: Has 3G or nottouch_screen: Has touch screen or notwifi: Has wifi or notprice_range: This is the target variable with a value of 0(low cost), 1(medium cost), 2(high cost) and 3(very high cost).

1.

Univariate SelectionStatistical tests can be used to select those features that have the strongest relationship with the output variable.

The scikit-learn library provides the SelectKBest class that can be used with a suite of different statistical tests to select a specific number of features.

The example below uses the chi-squared (chi²) statistical test for non-negative features to select 10 of the best features from the Mobile Price Range Prediction Dataset.

import pandas as pdimport numpy as npfrom sklearn.

feature_selection import SelectKBestfrom sklearn.

feature_selection import chi2data = pd.

read_csv("D://Blogs//train.

csv")X = data.

iloc[:,0:20] #independent columnsy = data.

iloc[:,-1] #target column i.

e price range#apply SelectKBest class to extract top 10 best featuresbestfeatures = SelectKBest(score_func=chi2, k=10)fit = bestfeatures.

fit(X,y)dfscores = pd.

DataFrame(fit.

scores_)dfcolumns = pd.

DataFrame(X.

columns)#concat two dataframes for better visualization featureScores = pd.

concat([dfcolumns,dfscores],axis=1)featureScores.

columns = ['Specs','Score'] #naming the dataframe columnsprint(featureScores.

nlargest(10,'Score')) #print 10 best featuresTop 10 Best Features using SelectKBest class2.

Feature ImportanceYou can get the feature importance of each feature of your dataset by using the feature importance property of the model.

Feature importance gives you a score for each feature of your data, the higher the score more important or relevant is the feature towards your output variable.

Feature importance is an inbuilt class that comes with Tree Based Classifiers, we will be using Extra Tree Classifier for extracting the top 10 features for the dataset.

import pandas as pdimport numpy as npdata = pd.

read_csv("D://Blogs//train.

csv")X = data.

iloc[:,0:20] #independent columnsy = data.

iloc[:,-1] #target column i.

e price rangefrom sklearn.

ensemble import ExtraTreesClassifierimport matplotlib.

pyplot as pltmodel = ExtraTreesClassifier()model.

fit(X,y)print(model.

feature_importances_) #use inbuilt class feature_importances of tree based classifiers#plot graph of feature importances for better visualizationfeat_importances = pd.

Series(model.

feature_importances_, index=X.

columns)feat_importances.

nlargest(10).

plot(kind='barh')plt.

show()top 10 most important features in data3.

Correlation Matrix with HeatmapCorrelation states how the features are related to each other or the target variable.

Correlation can be positive (increase in one value of feature increases the value of the target variable) or negative (increase in one value of feature decreases the value of the target variable)Heatmap makes it easy to identify which features are most related to the target variable, we will plot heatmap of correlated features using the seaborn library.

import pandas as pdimport numpy as npimport seaborn as snsdata = pd.

read_csv("D://Blogs//train.

csv")X = data.

iloc[:,0:20] #independent columnsy = data.

iloc[:,-1] #target column i.

e price range#get correlations of each features in datasetcorrmat = data.

corr()top_corr_features = corrmat.

indexplt.

figure(figsize=(20,20))#plot heat mapg=sns.

heatmap(data[top_corr_features].

corr(),annot=True,cmap="RdYlGn")Have a look at the last row i.

e price range, see how the price range is correlated with other features, ram is the highly correlated with price range followed by battery power, pixel height and width while m_dep, clock_speed and n_cores seems to be least correlated with price_range.

In this article we have discovered how to select relevant features from data using Univariate Selection technique, feature importance and correlation matrix.

If you found this article useful give it a clap and share it with others.

— Thank You.

. More details

Leave a Reply