Basketball Analytics: Predicting Win Shares


But the only way a player can be seen as successful is by winning.

Win shares is a great start to show how much overall success a player brings to their team.

According to Basketball-Reference, win shares is a metric that estimates the number of wins a player produces for his team throughout the season.

To put that into perspective, Kareem Abdul-Jabbar is both the single season leader in win shares with 25.

4 win shares and all-time career leader in win shares with 273.

4 win shares.

So, I’m sure you understand this measure could potentially be a good indicator of how much success a player contributes to their team.

NOTE: All code used for this post can be found on GitHub, organized on this notebook.

AnalysisObjective: Can we predict individual win shares of NBA players using other basketball metrics?The data used for this analysis is from the 2016–17 and 2017–2018 NBA Season, using Basketball-Reference.

Essentially, I used data from the 2016–2017 NBA season to create our model and stats from the most recent season to predict win shares.

I performed a supervised regression machine learning analysis:Supervised: The data had win shares and all other basketball metrics included to train and test the modelsRegression: Win Shares is a continuous variableToday, there are plenty of new basketball metrics used by fans and analysts worldwide to compare and measure players.

In order to predict Win Shares, I used a mix of basic and advanced NBA stats:Exploratory Data AnalysisWhat did the distribution of win shares originally look like?Before continuing the analysis, I used basic EDA to see what our data can tell us first hand.

First, I took a quick look at the distribution win shares:Right away, we see that the distribution of win shares is skewed to the right.

The majority of NBA players during the 2016–2017 NBA season had win shares of less than 5.

This makes sense since only a selected few, mostly composed of NBA All Stars, will have very high win shares.

For example, the win shares leader of the 2016–2017 NBA season was James Harden,the runner-up for MVP, with 15 win shares.

So, it’s an elite status to have a high win shares.

The odd takeaway from the distribution is the number of players with 0 win shares.

After further analysis, I was able to find that there was a high number of players that barely played at all.

Although these players are on NBA rosters, that doesn’t mean they will not have significant impact to both the team and their own individual stats.

Therefore, I updated the data to only include players that had played at least 30 games and averaged at least 25 minutes of playing time per game.

Here’s what the distribution of win shares looks like after cleaning out the data:There’s a lot less players with 0 win shares now!.The distribution of win shares is still right-skewed, but looks a bit more normally distributed.

Were all the features I chose good predictors of win shares?I analyzed this question by using the Pearson Correlation Coefficient, which measures the linear correlation between the features and target (win shares).

It has a value between -1 and +1, where a value close to -1 represents a negatively strong relationship and a value close to +1 represents a positively strong relationship:This discovery was surprising!.Of the 13 basketball metrics I originally chose to predict win shares, 4 of them did not have a strong enough correlation (strong = greater than 0.

5 & less than -0.


Therefore, I excluded those metrics from my model.

I was stunned to see 3-Point percentage (3P%) have not only a weak correlation, but also a negative correlation.

In today’s NBA, almost every team and player relies heavily on the 3P, so I assumed 3P% would play an important part in the analysis.

Another interesting finding was the weak relationship between games played (G) and win shares.

One would assume that the more games a player is involved in, the higher their win shares would be.

However, just because a player is involved in a game, doesn’t mean they will have success.

Assists (AST) was another metric that I assumed would have a strong relationship with win shares.

Were the remaining features highly correlation with one another?For the next part of the analysis, I wanted to see whether the features were strongly correlated with one another, or multicollinearity.

Multicollinearity generally occurs when there are high correlations between two or more predictor variables.

Remember, features having a strong correlation with win shares is good, but features being strongly correlated with other features might not be that helpful.

It can even make it tougher to interpret the models we will be creating.

Based on the PairPlot above, I found some issues between the features:VORP vs BPM: has a very strong positive relationship (correlation of 0.


Box Plus Minus (BPM) is a player’s contribution per 100 possesions over the league average when the player was on the court.

Value Over Replacement Player (VORP) takes BPM and translates it into minutes based contribution to a team.

In order to calculate VORP, you need to use BPM.

Therefore, they are highly correlated.

Shooting Percentage Metrics: We had three shooting percentage metrics left in the analysis.

Field goal percentage (FG%) is a ratio of field goals made to field goals attempted.

Effective field goal percentage (eFG%) adjusts field goal percentage to account for the fact that three-point field goals count for three points while field goals only count for two points.

True shooting percentage (TS%) measures a player’s shooting efficiency by calculating two- and three-point field goals and free throws.

I felt that these metrics were closely related (TS% and eFG% had a correlation of 0.


Therefore, I decided to only use TS% since it had the highest correlation with win shares.

After exploring the data, I was left with the following basketball metrics as the features that were used for creating the models:Model Selection and TestingSince this was a supervised regression machine learning analysis, I created three regression models:Linear RegressionSupport Vector Regressionk-Nearest Neighbors RegressionMaking the test set 25% of the data (the rest of the data to train the model), here were the results of the models:And, the clear winner of these models: linear regression.

The linear regression model had a lower mean squared error (lower is better), mean squared error (lower is better), and a higher variance score (higher is better).

This is not to say that the other two models, support vector and k-nearest Neighbors regression, should not be ignore!.They still have very impressive results, just not as strong as the linear regression model.

PredictionsAs stated above, I used all features of the 2017–2018 NBA season to predict win shares using our models:Win Share Predictions Using Linear Regression ModelBased on the predictions from the linear regression model, LeBron “LABron” James came out on top!.The model predicted him leading the NBA in win shares with 14.

81 win shares.

However, in reality, LeBron (14 win shares) came in second to 2017–2018 NBA MVP James Harden (15.

4 win shares), who was second in the prediction.

Not a bad prediction!.Karl-Anthony Towns and Anthony Davis were 3rd and 4th, respectively, on both the linear regression prediction and the recent NBA season.

Win Share Predictions Using Support Vector Regression ModelThe support vector regression model had some unusual results.

Based on the predictions from this model, LeBron (15.

5 win shares) came out on top again, with Harden a close second (15.

4 wins).

The most exciting outcome of this was that the model accurately predicted Harden’s win shares!.He led the 2017–2018 NBA season with 15.

4 win shares, the same as the predicted value.

Andre Drummond, who was not among the top 10 players in predicted win shares using the linear regression model, was 3rd in win shares using the support vector regression model.

Towns and Davis dropped down on the prediction list, and Stephen Curry and Kevin Durant didn’t even make the top 10.

Win Share Predictions Using k-Nearest Neighbors Regression ModelThe win share predictions using the k-Nearest Neighbors were a lot less than the predictions of the previous models.

For this model, Harden’s predicted win shares was the highest at only 12.

3 win shares.

Another weird outcome of the predicted values was that LeBron, Russell Westbrook, Giannis Antetokounmpo, and Davis were tied with 12.

24 win shares!ConclusionHere were some of the things I learned from this analysis:Using mean absolute error as a measure, one of the models we created was able to predict the individual win shares of an NBA player to within 0.

761 points.

Value Over Replacement Player (VORP) was the most significant factor for predicting win shares.

3-Point % (3P%), games played (G) , and Assists (AST) did not have strong relationships with win shares.

Lebron James is still KING! Well not exactly, but he’s 33 years old and still producing!Although the best model did a great job at predicting win shares of NBA players, there were some issues that I learned along the way that could help me out in future projects:More data could have helped.

To train the models, I only used data from the 2016–17 NBA season.

I could have used data from past seasons but I wanted the predictions to be true to today’s NBA.

The NBA is not what is was a few years ago.

Like I stated in the analysis, 3-Point shooting has taken over.

More teams rely on shooting the long ball.

In fact, last seasons’ semi-finalists — the Golden State Warriors, the Cleveland Cavilers, the Houston Rockets, and the Boston Celtics — were among the leaders in most 3-Point attempts.

Initially, I was afraid of including data from previous seasons because it would not accurately predict today’s NBA.

However, it could have actually helped, given that 3-Point shooting is not as important as I assumed.

Stat padding doesn’t necessarily mean success.

Russell Westbrook is the king of the Triple Double, when a player performs double-digit totals in three different metrics.

In fact, he finished the previous two seasons averaging a triple double! Westbrook was in the top 10 in all win shares predictions I made using the models, but he was not even top 10 in the past season.

Since his stats were so great, his win shares predictions were a lot higher than they should have been.

Win shares might not be the best stat to measure individual success.

Basketball is a team sport, and measuring individual performances isn’t that easy.

James Harden was MVP for his amazing offensive production, but his defense? NON-EXISTENT.

His teammates helped pick up many of his defensive inabilities.

But what if they didn’t? Or what if the defensive strategy wasn’t a good fit? Harden would still produce great offensive numbers, but if his team lacked in defense his win shares would drop.

Overall, I learned that, yes, one can use basketball metrics to predict win shares.

However, win shares might not be a great metric to measure individual success, since it takes a whole team to win in basketball, not just one player.

Unless your Lebron.

GO LAKERS!!!Please feel free to give feedback or constructive criticism! 🙂 Also, follow me on my personal blog or Twitter @dataosanch.

All code used for this post can be found on GitHub, organized on this notebook.


. More details

Leave a Reply