Feature Selection and Dimensionality Reduction

The one with a higher correlation to the target.

Let’s explore correlations among our features:# find correlations to targetcorr_matrix = train.

corr().

abs()print(corr_matrix['target'].

sort_values(ascending=False).

head(10))target 1.

00000033 0.

37360865 0.

293846217 0.

207215117 0.

19749691 0.

19253624 0.

173096295 0.

17050173 0.

167557183 0.

164146Here we see the features that are most highly correlated with our target variable.

Feature 33 has the highest correlation to the target, but with a correlation value of only 0.

37, it is only weakly correlated.

We can also check the correlation of features to other features.

Below we can visualize a correlation matrix.

It looks like none of our features are very highly correlated.

Correlation MatrixLet’s try to drop features with a correlation value greater than 0.

5:# Find index of feature columns with high correlationto_drop = [column for column in matrix.

columns if any(matrix[column] > 0.

50)]print('Columns to drop: ' , (len(to_drop)))Columns to drop: 0We have no columns to drop using highly correlated features.

Let’s continue to explore other strategies.

Univariate feature selectionUnivariate feature selection works by selecting the best features based on univariate statistical tests.

We can use sklearn’s SelectKBest to select a number of features to keep.

This method uses statistical tests to select features having the highest correlation to the target.

Here we will keep the top 100 features.

from sklearn.

feature_selection import SelectKBest, f_classif# feature extractionk_best = SelectKBest(score_func=f_classif, k=100)# fit on train setfit = k_best.

fit(X_train, y_train)# transform train setunivariate_features = fit.

transform(X_train)Recursive feature eliminationRecursive feature selection works by eliminating the least important features.

It continues recursively until the specified number of features is reached.

Recursive elimination can be used with any model that assigns weights to features, either through coef_ or feature_importances_Here we will use Random Forest to select the 100 best features:from sklearn.

feature_selection import RFE# feature extractionrfe = RFE(rfc, n_features_to_select=100)# fit on train setfit = rfe.

fit(X_train, y_train)# transform train setrecursive_features = fit.

transform(X_train)Feature Selection using SelectFromModelLike recursive feature selection, sklearn’s SelectFromModel is used with any estimator that has a coef_ or feature_importances_ attribute.

It removes features with values below a set threshold.

from sklearn.

feature_selection import SelectFromModelfrom sklearn.

ensemble import RandomForestClassifier# define modelrfc = RandomForestClassifier(n_estimators=100)# feature extractionselect_model = feature_selection.

SelectFromModel(rfc)# fit on train setfit = select_model.

fit(X_train, y_train)# transform train setmodel_features = fit.

transform(X_train)PCAPCA (Principle Component Analysis) is a dimensionality reduction technique that projects the data into a lower dimensional space.

While there are many effective dimensionality reduction techniques, PCA is the only example we will explore here.

PCA can be useful in many situations, but especially in cases with excessive multicollinearity or explanation of predictors is not a priority.

Here we will apply PCA and keep 90% of the variance:from sklearn.

decomposition import PCA# pca – keep 90% of variancepca = PCA(0.

90)principal_components = pca.

fit_transform(X_train)principal_df = pd.

DataFrame(data = principal_components)print(principal_df.

shape)(250, 139)We can see that we are left with 139 features that explain 90% of the variance in our data.

ConclusionFeature selection is an important part of any machine learning process.

Here we explored several methods for feature selection and dimensionality reduction that can aid in improving model performance.

.

. More details

Leave a Reply