Building A Logistic Regression in Python, Step by Step

You may have noticed that I over-sampled only on the training data, because by oversampling only on the training data, none of the information in the test data is being used to create synthetic observations, therefore, no information will bleed from test data into the model training.

Recursive Feature EliminationRecursive Feature Elimination (RFE) is based on the idea to repeatedly construct a model and choose either the best or worst performing feature, setting the feature aside and then repeating the process with the rest of the features.

This process is applied until all features in the dataset are exhausted.

The goal of RFE is to select features by recursively considering smaller and smaller sets of features.

data_final_vars=data_final.

columns.

values.

tolist()y=['y']X=[i for i in data_final_vars if i not in y]from sklearn.

feature_selection import RFEfrom sklearn.

linear_model import LogisticRegressionlogreg = LogisticRegression()rfe = RFE(logreg, 20)rfe = rfe.

fit(os_data_X, os_data_y.

values.

ravel())print(rfe.

support_)print(rfe.

ranking_)Figure 16The RFE has helped us select the following features: “euribor3m”, “job_blue-collar”, “job_housemaid”, “marital_unknown”, “education_illiterate”, “default_no”, “default_unknown”, “contact_cellular”, “contact_telephone”, “month_apr”, “month_aug”, “month_dec”, “month_jul”, “month_jun”, “month_mar”, “month_may”, “month_nov”, “month_oct”, “poutcome_failure”, “poutcome_success”.

cols=['euribor3m', 'job_blue-collar', 'job_housemaid', 'marital_unknown', 'education_illiterate', 'default_no', 'default_unknown', 'contact_cellular', 'contact_telephone', 'month_apr', 'month_aug', 'month_dec', 'month_jul', 'month_jun', 'month_mar', 'month_may', 'month_nov', 'month_oct', "poutcome_failure", "poutcome_success"] X=os_data_X[cols]y=os_data_y['y']Implementing the modelimport statsmodels.

api as smlogit_model=sm.

Logit(y,X)result=logit_model.

fit()print(result.

summary2())Figure 17The p-values for most of the variables are smaller than 0.

05, except four variables, therefore, we will remove them.

cols=['euribor3m', 'job_blue-collar', 'job_housemaid', 'marital_unknown', 'education_illiterate', 'month_apr', 'month_aug', 'month_dec', 'month_jul', 'month_jun', 'month_mar', 'month_may', 'month_nov', 'month_oct', "poutcome_failure", "poutcome_success"] X=os_data_X[cols]y=os_data_y['y']logit_model=sm.

Logit(y,X)result=logit_model.

fit()print(result.

summary2())Figure 18Logistic Regression Model Fittingfrom sklearn.

linear_model import LogisticRegressionfrom sklearn import metricsX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.

3, random_state=0)logreg = LogisticRegression()logreg.

fit(X_train, y_train)Figure 19Predicting the test set results and calculating the accuracyy_pred = logreg.

predict(X_test)print('Accuracy of logistic regression classifier on test set: {:.

2f}'.

format(logreg.

score(X_test, y_test)))Accuracy of logistic regression classifier on test set: 0.

74Confusion Matrixfrom sklearn.

metrics import confusion_matrixconfusion_matrix = confusion_matrix(y_test, y_pred)print(confusion_matrix)[[6124 1542][2505 5170]]The result is telling us that we have 6124+5170 correct predictions and 2505+1542 incorrect predictions.

Compute precision, recall, F-measure and supportTo quote from Scikit Learn:The precision is the ratio tp / (tp + fp) where tp is the number of true positives and fp the number of false positives.

The precision is intuitively the ability of the classifier to not label a sample as positive if it is negative.

The recall is the ratio tp / (tp + fn) where tp is the number of true positives and fn the number of false negatives.

The recall is intuitively the ability of the classifier to find all the positive samples.

The F-beta score can be interpreted as a weighted harmonic mean of the precision and recall, where an F-beta score reaches its best value at 1 and worst score at 0.

The F-beta score weights the recall more than the precision by a factor of beta.

beta = 1.

0 means recall and precision are equally important.

The support is the number of occurrences of each class in y_test.

from sklearn.

metrics import classification_reportprint(classification_report(y_test, y_pred))Figure 20Interpretation: Of the entire test set, 74% of the promoted term deposit were the term deposit that the customers liked.

Of the entire test set, 74% of the customer’s preferred term deposits that were promoted.

ROC Curvefrom sklearn.

metrics import roc_auc_scorefrom sklearn.

metrics import roc_curvelogit_roc_auc = roc_auc_score(y_test, logreg.

predict(X_test))fpr, tpr, thresholds = roc_curve(y_test, logreg.

predict_proba(X_test)[:,1])plt.

figure()plt.

plot(fpr, tpr, label='Logistic Regression (area = %0.

2f)' % logit_roc_auc)plt.

plot([0, 1], [0, 1],'r–')plt.

xlim([0.

0, 1.

0])plt.

ylim([0.

0, 1.

05])plt.

xlabel('False Positive Rate')plt.

ylabel('True Positive Rate')plt.

title('Receiver operating characteristic')plt.

legend(loc="lower right")plt.

savefig('Log_ROC')plt.

show()Figure 21The receiver operating characteristic (ROC) curve is another common tool used with binary classifiers.

The dotted line represents the ROC curve of a purely random classifier; a good classifier stays as far away from that line as possible (toward the top-left corner).

The Jupyter notebook used to make this post is available here.

I would be pleased to receive feedback or questions on any of the above.

Reference: Learning Predictive Analytics with Python book.. More details

Leave a Reply