Bivariate Logistic Regression Example (python)Intuitive Understanding and Simple ExerciseAndrew HershyBlockedUnblockFollowFollowingJun 24Source: Anne SprattA logistic regression is a model used to predict the “either-or” of a target variable.
The example we will be working on is:Target variable: Student will pass or fail the exam.
Independent variable: Hours spent studying per weekLogistic models are essentially linear models with an extra step.
In logistic models, a linear regression is ran through a “sigmoid function” which compresses its output into dichotomous 1’s and 0’s.
If we wanted to predict actual test scores, we would use a linear model.
If we wanted to predict “pass”/ “fail”, we would use a logistic regression model.
Linear (Predict Numerical Test Score):y = b0 + b1xLogistic (Predict “Pass/Fail”):p = 1 / 1 + e ^-(b0 + b1x)Visualization:In the image below, the straight line is linear , and the “S” shaped line is logistic.
Logistic regressions have higher accuracy when used for “either-or” models due to their shape.
Logistic Regressions are “S” shaped.
Linear Regressions are straight.
Understanding the data:import numpy as npimport pandas as pdimport matplotlib.
pyplot as plt%matplotlib inlinedf = pd.
read_excel(r”C:UsersxxLog_test.
xlsx”)x = df[‘W_hours’]y = df[‘Y’]plt.
scatter(x,y)plt.
show()df.
info()x.
plot.
hist()y.
plot.
hist()There are 23 rows in the dataset.
Below is the distribution of hours studied:Below is the distribution of pass (1)/fail (0):Data preparation / modelingNext, we will use the sklearn library to import “LogisticRegression”.
Details about the parameters can be found here.
We are converting our bivariate model into 2 dimensions with the .
reshape() function.
We are defining 1 column, but we are leaving the number of rows to be the size of the dataset.
So we get the new shape for x as (23, 1), a vertical array.
This is needed to make the sklearn function work properly.
Use the “logreg.
fit(x,y)” to fit the regression.
from sklearn.
linear_model import LogisticRegressionlogreg = LogisticRegression(C=1.
0, solver=’lbfgs’, multi_class=’ovr’)#Convert a 1D array to a 2D array in numpyx = x.
reshape(-1,1)#Run Logistic Regressionlogreg.
fit(x, y)Using and visualizing the modelLet’s write a program where we can get the predicted probability of passing and failing by hours studied.
We input the study time in the code below: Examples of 12, 16, and 20 hours studied.
print(logreg.
predict_proba([[12]]))print(logreg.
predict_proba([[16]]))print(logreg.
predict_proba([[20]]))The output on the left is the probability of failing, the output on the right is passing.
In order to visualize the model, let’s make a loop where we run each half-hour of study time into the regression from 0 to 33.
hours = np.
arange(0, 33, 0.
5)probabilities= []for i in hours: p_fail, p_pass = logreg.
predict_proba([[i]])[0] probabilities.
append(p_pass)plt.
scatter(hours,probabilities)plt.
title("Logistic Regression Model")plt.
xlabel('Hours')plt.
ylabel('Status (1:Pass, 0:Fail)')plt.
show()In this fictional set of data, a student is guaranteed to pass if he/she studies more than 20 hours and is guaranteed to fail if less than 10 hours.
17 hours is the 50/50 mark.
Thanks for reading,Check out a more detailed Logistic Regression model predicting cancer hereFind out about Linear vs Polynomial Regressions hereFind out more about Rsquared herePlease subscribe if you found this helpful.
.