March Madness — Predicting the NCAA Tournament

We were successfully able to score a more accurate prediction each time we added complexities into our model, and we have visualized that progress below.Submission ImprovementBaseline ModelUsed Season AveragesImplemented Historical DataAdded Upset Margin FeaturesIncorporated Logistic Regression PenaltySome of the major steps we made were implementing in the historical data of the tournament into our model, and then using the turnover and rebound margin features to help predict upsets..This is a testament to the importance of these metrics in defining our tournament outcomes, and can be reflected by being in our highest ranked features..Although we experimented with other models including MLP and a Gradient Boosted Decision Tree, our original logistic regression gave the most accurate outcome through log-loss score..In addition, to improve our final result, we were able to fine tune our logistic regression and add an ideal penalty value to constrain the model..This led to our best log loss score of .425 with an accuracy of 73.9%..We were very pleased with this result, and compared to the baseline model, was a 2.1% increase in accuracy..If we compare our late submission to the Kaggle public leaderboard, it would be in line with a top seven rank.VI..ConclusionAlthough there is a great focus on the statistics of each year’s current teams, we determined that using the historical data is a large factor in identifying key correlations in winning teams, and looking at the successes of premier programs..In addition, March Madness is a tournament known for its incredible upsets and unpredictable nature..However, we found that with certain metrics such as rebound and turnover margin, it may be possible to tell when a heavy favorite should be on upset alert..Using our logistic regression model we identified the most important features and improved upon the baseline performance, seeing a considerable accuracy improvement.Along the way, we were able to learn a great amount through seeing what approaches worked and those that didn’t..We saw immense growth in understanding how to implement temporal data into a complex model, as well as exploring various correlations in our dataset and extracting new features.. More details

Leave a Reply