Obtaining Insights From Data: Optimizing an NBA Career

The question itself seems open-ended, so in order to better scope this endeavor, I’m going to measure success by dollars earned.

All of the code on this post can be found at https://github.

com/aaronfrederick/B-Tier-Basketball-Career-Modeling for reference and additional context.

In order to answer this question of how to maximize earnings over time, we must know which teams have the highest median salary and recommend our player make their way to that team.

Luckily, a kind user on data.

world has curated a csv with each player’s annual salary and team for the past ~30 years.

This allows us to gain our first insight — teams with the highest median salary as shown below.

money_group = pay_df[[‘team’,’salary’]].

groupby(by=[‘team’]).

median()top10 = money_group.

sort_values(by=’salary’, ascending=False).

head(10)top10[‘Team’] = top10.

indextop10[‘Salary’] = top10.

salarycolor_order = [‘xkcd:cerulean’, ‘xkcd:ocean’, ’xkcd:black’, ’xkcd:royal purple’, ’xkcd:royal purple’, ‘xkcd:navy blue’, ’xkcd:powder blue’, ‘xkcd:light maroon’, ‘xkcd:lightish blue’, ’xkcd:navy’]sns.

barplot(x=top10.

Team, y=top10.

Salary, palette=color_order).

set_title(‘Teams with Highest Median Salary’)plt.

ticklabel_format(style=’sci’, axis=’y’, scilimits=(0,0))The code above allows us to visualize our top-salaried teams by median salary, and with a little bit of work adding logos, our final graph is shown below:Clearly the team that our player is on is not the only factor that will decide salary, so we will explore the box score statistics to see what lends itself to the highest earnings.

Now that we have answered this first question, it is time to move on to the things more easily controlled by a basketball player: what skills they can improve to earn a higher salary.

After exploring box scores over the last 20 years or so, I will show a trend which leads to the conclusion that we should be aiming to maximize our playing time, as that gives our player the most potential for higher earnings and is much more under the control of said player.

Dollars earned per average minutes played in the NBAThe box scores I obtained were scraped from basketball-reference.

com using code on my github.

Having uncovered a more efficient way of scraping the box scores than previously updated, I would highly recommend using the Pandas library’s read_html function.

In the code below, I will grab an arbitrary box score and clean it so for use and aggregation with similar dataframes:url = 'https://www.

basketball-reference.

com/boxscores/200905240ORL.

html'df_list = pd.

read_html(url)cavs = df_list[0]#Strip Multi-Indexed Column Headerscavs = pd.

DataFrame(cavs.

values)#Drop Null Columnscavs = cavs.

dropna(axis=1)#Get New Column Headers from 'Reserves' rowcolnames = cavs.

values[5]colnames[0] = 'Name'cavs.

columns = colnamesIn just a few lines, we can take the basic box scores from a website (with tables in its html) and convert into a cleaned, ml-friendly dataframe.

After scraping a few hundred box scores, performing some feature engineering (converting the box scores statistics into rates), and z-score scaling the data, we are ready to examine the effects of each metric on playing time.

Below is a heatmap showing the correlation between various features with each other, our target row being the bottom:The shade of red doesn’t tell us the magnitude of the correlation, but evidence of a relationship.

The implications of this are that the more colored a box, the more evidence there is of a correlation (of any magnitude) between the row and column features.

We can see from the red along the bottom row (also down the last column) that most of the statistics show a positive correlation with playing time, the main exceptions being 3-pointers attempted, rebounds, steals, blocks, and turnovers (all per minute played) have no discernible correlation and that fouls and offensive rebounds show a negative correlation with playing time.

It would be nice to end our analysis here, but we can be more specific than just telling our basketball player to improve their 2-point shooting percentage, 3-point shooting percentage, shooting rate, free throw percentage, and assist rate.

By performing a linear regression on these data, we can get into the specifics of which of these stats is most important to cultivate in order to play more, and by proxy earn more.

After fitting linear regression models with Scikit-Learn’s ElasticNetCV class, we can obtain values for the coefficients that contribute to our linear fit:lambdas = [0.

01,0.

1,0,1, 10, 100, 1000, 10000, 100000]score = 0for ratio in range(11): model = ElasticNetCV(alphas=lambdas,l1_ratio=ratio/10) model.

fit(X_train,y_train) if model.

score(X_test,y_test) > score: score = model.

score(X_test,y_test) optimal_model = model optimal_ratio = ratioprint(score)print(optimal_ratio)Examining the output of optimal_model.

coef_ shows us an array of coefficients for each feature as specified in our input matrix.

Because the input matrix was z-scaled (mean of 0, standard deviation of 1), the coefficients can be ranked from high to low to indicate importance in achieving more playing time.

From our list earlier, we can narrow down the most important skills to improve down to 3: field goal percentage, 3-point field goal percentage, and free-throw percentage.

In summary, we want our professional basketball player to improve their shooting.

By improving their shot, our model predicts they will increase their playing time.

By increasing their playing time, our player is increasing their potential for salary.

While our player is cultivating their skills on the court, the agent should be pitching to the top ten teams as shown above to maximize earnings in off-court efforts.

.

. More details

Leave a Reply