Rating Sports Teams — Maximizing A Generic System

This one was invented in 1995 by professor Glickman whom I mentioned earlier.

It’s worth reading the paper itself, but I’ll do my best to summarize.

Glicko introduces two main components.

One, it introduces uncertainty.

This is extremely useful.

An overall rating in the Elo ratings system is just captured by one number.

Let’s take a player (or team) that’s rated 1750.

The issue is that you can’t be 100% certain that the player’s exact rating, on a random day, in random conditions, after two weeks since the last match, is exactly 1750.

Instead, you can be pretty sure that the player’s “true” rating is between 1650 and 1850.

In the Glicko system, the ratings deviation term would be 50 in this case.

It would represent a 95% confidence interval that the player’s true ability is within 100 points (two ratings deviations) of 1750.

Uncertainty can also be modified.

If a player hasn’t played in 3 months, we can increase uncertainty.

If a player has very erratic match results, Glicko keeps uncertainty high.

The other introduction is the concept of a “ratings period”.

As a mild spoiler, I’ve empirically found that the ratings period works better in some sports than others.

Glicko treats all matches in a ratings period to have occurred simultaneously.

This is useful in golf, when you’re competing against over 100 other players simultaneously.

Not only does it reduce calculation time, but by using the all the other players as reference, Glicko is able to nail the context of a round score.

On the other hand, in college basketball, it works less well.

I found that a ratings period of three games works best there.

It doesn’t make much sense to use a ratings period in basketball.

Depending on how you group past games into threes, you might end up with different ratings.

You’re also waiting three games to update rankings, and so you’re often ignoring the last game or the last two games to factor into your rankings.

An implementation note: I’m using Glicko 2, but I refer to it as Glicko.

I heavily referenced this implementation by sublee.

College Basketball ResultsError per game, averaged from 2003–2019.

Less is better!In the graph above I show the error per game of the ratings systems I tried.

The weekly error is averaged over all games during that week of all seasons from 2003 to 2019.

First of all: Wow!.Improved Elo, when finely tuned, does very well.

Crucially, it maintains a clear gap with the other ratings systems late into the season.

In sports models, when you’re fighting for decimal places, that’s really impressive!.An important note is that it even beat Glicko without preseason rankings, which is something I could’ve implemented into the Glicko system as well.

Improved Elo seems better in any case!In my opinion, the weirdest finding is the effectiveness of my improved Elo implementation early in the season.

The error early in the season is smaller than it is late in the season.

Do teams develop and change skill more during the season (thus being harder to predict) than during the longer offseason?.Even with offseason roster turnover?.I could also be overfitting my parameters.

I think this is worth exploring further, because it could also be mid-season injuries or evidence that coaching matters.

I’m somewhat surprised that after much tuning, I couldn’t get Glicko to come close to beating improved Elo.

It’s supposed to be a better system!.Only after plotting this did I realize the shortcomings of Glicko in the college basketball universe.

Let’s not throw it out yet though!Also, I isolated the individual improvements to Elo to show that they all contributed:Improvement IsolationI didn’t fully optimize the individual improvements, but the point is that they all contributed something.

It’s also impressive that as a whole they seem the same or better than the sum of their parts.

I kind of expected decaying K and priors to be two solutions to the same problem, but it turns out, they’re both useful even in concert.

The code I used, sans data, can be found here.

Golf ResultsNote: .

301 is equivalent to guessing 50% every time.

Round by round scores are very random so it’s tough to beat.

In golf, to my surprise, I couldn’t get the improved Elo algorithm to come close to Glicko!.In my opinion, that’s an important finding.

It seems the Glicko system is clearly better in many-competitor events, and Elo is better in head to head over short seasons.

For golf, I gave adjusting Improved Elo my absolute best effort.

There were many manipulations of K I tried, such as changing it round by round, increasing it after long layoffs, and adjusting it based on field size.

However, no matter what, the order of magnitude of the improvements suggested I could never beat Glicko.

Glicko, to the best of my knowledge, is superior.

Even if I could get Elo to match Glicko, Elo requires tweaking many more parameters and more computation.

Note: If you’re wondering how I used Log 5 in golf, I did it by summing the result of every matchup in a tournament.

So in a 144 person tournament, 1 round involves 143 matchups for every player.

The very best players (i.


Tiger Woods) maintain close to a 70% win rate in these matchups.

By 2019 some players have over 150,000 matchups.

Since it’s a career metric, it doesn’t weight recent form over matchups that happened 18 years ago.

Taking the Log 5 over a smaller historical window would probably be better, like past 3 years, but it shouldn’t beat Elo.

To my knowledge, I haven’t seen it spelled out that Bayesian ranking systems like Glicko have shortcomings in head to head matches over short seasons.

I can’t rule out there are adjustments that I could make to improve Glicko’s performance there.

Microsoft uses TrueSkill and Trueskill 2 in most of their games which are both closely related to Glicko.

This makes sense, because many online gaming multiplayer matches involve more than two players.

From my experiments, game designers might be better off employing a version of Elo in head to head games.

There’s another thing interested me.

People who pay attention to golf (I understand, it might be a small subset of this audience) would look at the Glicko ratings and turn their head.

They don’t make any sense.

Typically dominant golfers like Dustin Johnson and Rory McIlroy aren’t at the top of the rankings.

Just by looking at it, it looks like Glicko weighs recent results too heavily.

But the error is smaller than the Elo rankings, which have Dustin and Rory where you would expect, #1 and #2.

In the graph below, I compare the correlation of 4 ranking systems.

One, official world golf rank.

It’s not the best system, but has been used for a really long time and is familiar to any golf fan.

It weights winning golf tournaments and doing well in big tournaments heavier than it probably should (therefore it loves Brooks Koepka).

The other ranking system, “DG Ranking”, is the rankings created by datagolf.


Data Golf does great work and I think even non-golf fans would enjoy seeing some of the visualizations they’ve created.

Their overall rankings are usually close to Vegas odds and are a good reference to see how golfers compare at any given time.

Correlations of top 50 ranked players with different systemsAs you can see, Improved Elo and Data Golf have 97% correlation!.That’s pretty incredible.

It suggests that they probably have extremely similar predictive qualities.

I don’t say this lightly, but the correlation between Data Golf rankings and Glicko rankings absolutely shocked me.

45%!.That means they will almost surely give you different matchup probabilities for the same matchup between two golfers.

As I showed above, the low Glicko error means you have to pay attention to it.

There are many factors that could be at play here.

One, I could have made a miscalculation.

It’s possible, but I don’t think it’s likely because it’s consistent with results produced by Mark Glickman.

Golf is an ideal sport for his ratings.

Two, Glicko might do better for certain types of matchups.

Brief experiments I’ve done show that it’s better at predicting matchups between less experienced golfers than Elo (which might be why the margin between the two decreases as more ratings are made).

It also might specialize in matchups between mid-tier golfers and not be able to predict the golfers fans care about more at the upper end.

This would require more experimentation, and that’s beyond the scope of what I wanted to achieve here.

The code I used for golf analysis can be found here.

Future DirectionThere are a couple of directions I can go from here.

This is kind of just a wishlist, I can’t guarantee any of them:I want to implement the Stephenson system, which is supposed to have improved on Glicko.

I didn’t try it here because the only implementation I’ve found is coded in R.

I’m not an R guy, but I think I might try to either write a Python version or learn enough R to use it.

I want to make these rankings available to the public.

It’s really just a matter of automating some things and creating UI.

It’d be cool to figure out the relative strengths of Glicko vs.

Elo, and combine them into a single model that draws from both of them.

If, for example, Elo becomes better after 1,000 golf rounds played, you could weight it more heavily for more experienced golfers.

The true power of these rankings will be when you factor in more advanced inputs.

Controlling for travel is one priority I have in golf since I have course location data.

Sometimes golfers fly to Dubai in between two tournaments in the United States for example.

Also, the PGA’s strokes gained data would really add predictive power, but I’m not sure the best way to access it.

I might have to switch to hockey for advanced analytics.


. More details

Leave a Reply