Moneyball — Linear Regression

As we previously mentioned, Scouts at the time relied heavily on Batting Average, and, according to DePodesta, undervalued On Base Percentage and Slugging Percentage.

Again we can use the .

corr() Pandas function, to compute the pairwise correlation between columns.

podesta = df[['OBP','SLG','BA','RS']]podesta.

corr(method='pearson')Note the right hand column here, which shows RS’s relationship with OBP, SLG, and BA.

We can see that Batting Average is actually the least correlated attribute in respect to Runs Scored, with a correlation of 0.


Slugging Percentage and On Base Percentage are actually correlated more highly, with 0.

93 and 0.

90, respectively.

This confirms DePodestas idea of the undervalue placed on SLG and OBP and the relative overvaluing of BA.

We can actually apply a bit of machine learning to further verify these claims.

Firstly, by using univariate selection, to select those features that have the strongest relationship with the output variable(RD in this case).

The scikit-learn library provides the SelectKBest class that allows us to pick a specific number of features.

We will use the chi-squared statistical test for non-negative features to select the best features from our dataset.

Firstly we need to use moneyball = df.

dropna() to remove any null values from our dataset that would interfere with machine learning methods.

Then:from sklearn.

feature_selection import SelectKBestfrom sklearn.

feature_selection import chi2 #we use RD as the target columnX = moneyball.

iloc[:,6:9]y = moneyball.

iloc[:,-1]#apply SelectKBest class to get best featuresbestfeatures = SelectKBest(score_func=chi2, k=3)fit = bestfeatures.

fit(X,y)dfscores = pd.


scores_)dfcolumns = pd.


columns)#concat two dataframes for better visualizationfeatureScores = pd.


columns = ['Feature','Score']print(featureScores.

nlargest(3,'Score'))Another method is to use feature importance that comes inbuilt with Tree Based Classifiers.

Feature importance will give a score for each feature of the data, the higher the score, the more important or relevant the feature is towards the output variable.

X = moneyball.

iloc[:,6:9] #independent columnsy = moneyball.

iloc[:,-1] #target columnfrom sklearn.

ensemble import ExtraTreesClassifier model = ExtraTreesClassifier()model.


feature_importances_)feat_importances = pd.


feature_importances_, index=X.



plot(kind='barh', figsize = (12,8))plt.

xlabel("Importance", fontsize = 20)plt.

ylabel("Statistic", fontsize = 20)The importance of attributes on determining Run DifferenceModel BuildingScikit-learn provides the functionality for us to build our linear regression models.

First of all, we build a model for Runs Scored, predicted using On Base Percentage and Slugging Percentage.

x = df[['OBP','SLG']].

valuesy = df[['RS']].

values Runs = linear_model.

LinearRegression() Runs.


intercept_) print(Runs.

coef_)We can then say that our Runs Scored model takes the form:RS = -804.

627 + (2737.

768×(OBP)) + (1584.

909×(SLG))Next, we do the same but for modelling Runs Allowed, using Opponents On Base Percentage and Opponents Slugging Percentage.

x = moneyball[['OOBP','OSLG']].

valuesy = moneyball[['RA']].

valuesRunsAllowed = linear_model.


fit(x,y) print(RunsAllowed.


coef_)We can then say that our Runs Allowed model takes the form:RA = -775.

162 + (3225.

004 ×(OOBP)) + (1106.

504 ×(OSLG))*We then need to build a model to predict the number of Wins when given Run Difference.

x = moneyball[['RD']].

valuesy = moneyball[['W']].

valuesWins = linear_model.


fit(x,y) print(Wins.


coef_)We can say that our Wins model takes the form:W = 84.

092 + (0.

085 ×(RD))Now all we have left to do is get OBP, SLG, OOBP, OSLG, and simply plug them into the models!We know which players were transferred in and out after the 2001 season, so we can take 2001 player statistics to build the A’s new 2002 team.

The A’s 2002 team pre-season statistics taken from 2001:OBP: 0.

339SLG: 0.

430OOBP: 0.

307OSLG: 0.

373Now lets create our predictions:Runs.



430]])Our model predicts 805 runsRunsAllowed.



373]])Our model predicts 628 runs allowedMeaning we get a RD of 177 (805–628), which we can then plug into our Wins model.


predict([[177]])So, in the end, our model predicted 805 Runs Scored, 628 Runs Allowed, and 99 games won, meaning that our model predicted that the A’s would make the playoffs given their team statistics, which they did!Lets compare our model to DePodestas predictions:Limitations:There are, of course, some limitations involved in this short project.

For example: As we use a players previous years statistics, there is no guarantee they will be at the same level.

For example, older players abilities may regress while younger players may grow.

It is also important to note that we assume no injuries.

Limitations aside, the model performance is quite remarkable and really backs up DePodestas theory and explains why the techniques he used were so heavily adopted across the entirity of baseball soon after the Oakland A’s ‘Moneyball’ season.

Thank you so much for reading my first post!.I intend to keep making posts like these, mainly using data analysis and machine learning.

Any feedback is welcome.

.. More details

Leave a Reply