Assessing NBA player similarity with Machine Learning (R)

In this project, we will be looking at player statistics from NBA’s last complete regular season..In this project, we will try to come up with conclusions that we couldn’t have reached from superficially looking at the data, and hopefully reach findings that are interesting to fans and analysts alike.Data-sets usedWe are using player data from the 2017–18 NBA Regular season and are combining three data-sets from the websites, Basketball-reference and NBAmining.com..We are looking at player statistics normalized to 36 minutes of game-time as opposed to looking at per-game averages..If we looked at per-game statistics instead, players who play more minutes like Lebron James would look better on paper than players like Steph Curry who play on elite teams and often don’t play at the end of games that are blow-out wins..There are also arguments for stats per-100-possessions being the best metric for evaluating a player’s impact, as they normalize player stats by the team’s pace..This one provided miscellaneous statistics (fast break points, points in the paint, etc) that offer useful insights on player impact that were not included in the first 2 data sets.The “traditional” data-set has data on 540 players (all active players) with 29 features, the “advanced” data-set has data on the same players with 27 features, and the “miscellaneous” data-set contains data on 521 players with 14 features..So if you know of any more data-sets/sources of statistics that could be incorporated into this model to better represent player impact, feel free to let me know.Pre-processing: Data-cleaning & Feature EngineeringThis project will be completed on R..My R code and plots are publicly available on Github.Merging data sets and fixing mismatches: The first 2 data-sets we use are from the same source and has data on the same players..But as we investigate by matching for player names, we will find that there are in fact 45 players in the first 2 data-sets that don’t seem to have any data in the third, and 26 players vice-versa..But we don’t have to worry about these players, as we would have excluded them anyways when we subset our data (more on this step later).Removing features: After merging the aforementioned data-sets, we remove repeated features (ex: games played)..One way to subset the data is setting a threshold for the minimum number of minutes-per-game and including players who meet the threshold..We set this threshold at 28.5 minutes-per-game, which limits our data-set to 104 players.After combining the 3 tables, feature-engineering and sub-setting our data, we have a data on 104 players with the following 55 features:Player — Player namePos — PositionAge — Age of Player at the start of February 1st of that season.Tm — Team nameG— Games PlayedGS — Games StartedMP — Minutes Played over the entire seasonMPG — Minutes averaged per gameFG — Field Goals Per 36 MinutesFGA — Field Goal Attempts Per 36 MinutesFG% — Field Goal Percentage3P — 3-Point Field Goals Per 36 Minutes3PA — 3-Point Field Goal Attempts Per 36 Minutes3P% — FG% on 3-Pt FGAs.2P — 2-Point Field Goals Per 36 Minutes2PA — 2-Point Field Goal Attempts Per 36 Minutes2P% — FG% on 2-Pt FGAs.FT — Free Throws Per 36 MinutesFTA — Free Throw Attempts Per 36 MinutesFT% — Free Throw PercentageORB — Offensive Rebounds Per 36 MinutesDRB — Defensive Rebounds Per 36 MinutesTRB — Total Rebounds Per 36 MinutesAST — Assists Per 36 MinutesSTL — Steals Per 36 MinutesBLK — Blocks Per 36 MinutesTOV — Turnovers Per 36 MinutesA2TO — Assists to turnover rationPF — Personal Fouls Per 36 MinutesPTS — Points Per 36 MinutesPER — Player Efficiency Rating- A measure of per-minute production standardized such that the league average is 15.TS% — True Shooting Percentage- A measure of shooting efficiency that takes into account 2-point field goals, 3-point field goals, and free throws.3PAr — 3-Point Attempt Rate- Percentage of FG Attempts from 3-Point RangeFTr — Free Throw Attempt Rate- Number of FT Attempts Per FG AttemptORB% — Offensive Rebound Percentage- An estimate of the percentage of available offensive rebounds a player grabbed while he was on the floor.DRB% — Defensive Rebound Percentage- An estimate of the percentage of available defensive rebounds a player grabbed while he was on the floor.TRB% — Total Rebound Percentage- An estimate of the percentage of available rebounds a player grabbed while he was on the floor.AST% — Assist Percentage- An estimate of the percentage of teammate field goals a player assisted while he was on the floor.STL% — Steal Percentage- An estimate of the percentage of opponent possessions that end with a steal by the player while he was on the floor.BLK% — Block Percentage- An estimate of the percentage of opponent two-point field goal attempts blocked by the player while he was on the floor.TOV% — Turnover Percentage- An estimate of turnovers committed per 100 plays.USG% — Usage Percentage- An estimate of the percentage of team plays used by a player while he was on the floor.OWS — Offensive Win Shares- An estimate of the number of wins contributed by a player due to his offense.DWS — Defensive Win Shares- An estimate of the number of wins contributed by a player due to his defense.WS — Win Shares- An estimate of the number of wins contributed by a player.WS/48 — Win Shares Per 48 Minutes- An estimate of the number of wins contributed by a player per 48 minutes (league average is approximately .100)OBPM — Offensive Box Plus/Minus- A box score estimate of the offensive points per 100 possessions a player contributed above a league-average player, translated to an average team.DBPM — Defensive Box Plus/Minus- A box score estimate of the defensive points per 100 possessions a player contributed above a league-average player, translated to an average team.BPM — Box Plus/Minus- A box score estimate of the points per 100 possessions a player contributed above a league-average player, translated to an average team.VORP — Value over Replacement Player- A box score estimate of the points per 100 TEAM possessions that a player contributed above a replacement-level (-2.0) player, translated to an average team and prorated to an 82-game season.Fast Break PTS — Fast break points per gamePoints in Paint — Points score in the paint per gamePoints off TO — Points scored after the opposing team turned over the ball2nd chance points — Any points scored during a possession after an offensive player has already attempted one shot and missedPoints scored per shot — Calculated by dividing the total points (2P made and 3P made) by the total field goals attempts.Normalizing data: We are also normalizing our data-set..This ensures that features with high-value ranges (such as Points scored) do not have a greater impact on our overall similarity comparison than features with low-value ranges (such as Blocks or Steals). Note that by normalizing the data, we are affecting the outcomes with our own biases. Would it be incorrect to leave the data as it is, or to manipulate it so defensive statistics such as Steals have higher ranges? It wouldn’t. It would just mean that the similarity results that we will assess at the end would be weighted more towards those particular statistics.Data Analysis MethodsWe are going to apply the following statistical methods to investigate our data:1) Principle Component Analysis2) K-means Clustering3) Hierarchical ClusteringFor each method, we are going to compare the players in terms of Overall impact (from all available statistics).. More details

Leave a Reply