Identifying the Most Important Features for Student’s Educational Success

For this reason, Linear Regression (both with and without Lasso Regularization) were chosen since they are simple models that allow for quick and easy feature selection.lasso = SFS(Lasso(alpha=.1), k_features=n_features, forward=True, floating=False, verbose=2, scoring='neg_mean_squared_error', cv=3)lasso.fit(X_reg_train, y_reg_train)time.sleep(1) lasso_reg_selected_features = Lasso(alpha = .1) lasso_reg_selected_features.fit(X_reg_train[list(lasso.subsets_[n_features]['feature_names'])],y_reg_train) features = pd.DataFrame({'Features': lasso.subsets_[n_features]['feature_names'], 'Coefs': lasso_reg_selected_features.coef_})display(features) lin_reg_fig = plot_sfs(lasso.get_metric_dict(), kind='std_err')plt.show()Top ten features chosen by Lass RegressionXGBoost Linear Regression is another variation of linear regression that allows us to incrementally build upon intermediate models and eventually reach a better model..This gives us slightly different results than the vanilla linear regression and adds some variance to our selected features.Finally, XGBoost Tree Regression was utilized to incorporate a non-linear regression model into our regression model set..Because it is interpretable and possibly fits the data better than the linear regression models, XGBoost Tree Regression is the safe last choice for our regression model set..From all of these models, we are able to easily grab the most important features through feature selection.xgb = XGBRegressor(booster = 'gbtree')xgb.fit(X_reg_train, y_reg_train) feature_importance = pd.DataFrame({'Features': X_reg_train.columns,'Coefficients': np.array(xgb.feature_importances_).astype(np.float)})feature_importance.sort_values(by=['Coefficients'], ascending = False, inplace = True) feature_importance.drop(feature_importance.tail(len(feature_importance.index)-n_features).index, inplace=True) display(feature_importance)Top ten features chosen by XGBoost Tree RegressionAfter training all of these models, an ensembling method was used to combine the selected features from each of these different models..Before running our ensemble we created a union of all the top ten features selected by each of the models above..We used sequential forward selection using a linear regression model to select the top 10 features of the union.lin_reg = SFS(LinearRegression(), k_features=10, forward=True, floating=False, verbose=2, scoring='neg_mean_squared_error', cv=3)lin_reg.fit(X_train, y_train)time.sleep(1) lin_reg_selected_features = LinearRegression()lin_reg_selected_features.fit(X_train[list(lin_reg.subsets_[10]['feature_names'])],y_train)V..ResultsOnce we ran our ensemble model we got the following results:Top ten features selected by the ensembleTop ten features selected by the ensembleLabel description tableAs we can see from our results, the three features selected by our model that impacted education success the most were the socio-economic status of the student, whether the school is public or private, and the student’s standardized math score..Reflecting on these results, we can see that the top features makes sense..For example, a student’s socio-economic status could affect the amount of help he or she gets outside of school or the opportunity to have updated books.As we can see from two categories that emerged above, behavior and amount of credits taken in academic classes each appeared multiple times in the top features..Therefore in order to make the most substantial impact on increasing student’s success, we recommend two courses of action..The first course of action should focus on the behavior aspect, which can be seen in features such as X2BEHAVIN and P1PERFORM..Second, the model suggests that students take more academic courses, because features like X3TCREDSCI, X3TCREDHON, and X3TCREDAPSS were present in the top 10 features.VI.. More details

Leave a Reply