Gridiron in the Cloud: Hacking on Football with GCP

Gridiron in the Cloud: Hacking on Football with GCPElissa LernerBlockedUnblockFollowFollowingJan 28If you had a week to wring as much out of Google Cloud as you could, what might you be able to build?And, could you get it done in time for Thursday Night Football?That’s what a group of front-end software engineers and UXers attempted to find out.

Though we each regularly work on different Google Cloud products, our wider team sets aside a Make Week toward the end of the year to experiment with different products, test out new engineering and UX skills, work with different colleagues, and generally hack on a project using GCP (read: without internal tools).

The idea seemed simple: come up with a machine learning model that could predict the next play of a football game in a real-time UI.

At best, we’d invent the greatest sports bar app of all time (our initial goal was 90% accuracy and 90% confidence on each play call — a surefire tell that we were not expert data scientists).

At worst, we’d have greater empathy for the challenges of predictive modeling.

In reality, we’d probably land somewhere in the middle — maybe not with the greatest model, but at least with one that could deliver something of value in a real-time UI.

In practice, this would be pretty complicated.

We’d need to create an ingestion and ETL pipeline, build a functional data warehouse, train and serve a model, and develop a real-time UI nearly simultaneously in order to pull this off in under a week.

Fortunately, each person gravitated towards different project areas based on where they had the most skill or energy to contribute.

With the team thus distributed, we had about four days to get it all working in time for the December 13th game between the Kansas City Chiefs and Los Angeles Chargers.

If all you want to know is whether we hit our audaciously high threshold, surprise: we didn’t.

But if you want to know how you might go about building a machine learning model with a real-time predictive UI in a week (and some of the stumbling blocks you might hit along the way), then let’s dive in.

Ingestion and explorationGiven some prior examples, we had a rough idea of how to get started.

Before anything could begin, we’d need data.

We used Sportradar’s API to get our base data set of NFL regular season games from 2011–2018, which came out to 1,529 games, or 255,834 plays (we didn’t end up using data from 2013 and 2014 in our model due to inconsistencies).

While this is a pretty small data set, it was enough for us to play around with training and testing a nascent model.

To get each game’s play-by-play data into Google Cloud Storage (GCS), we used a bash script — mostly because it was very quick to write and execute.

GCS became the primary reference for the rest of our work, letting us store the raw data nearby while also giving us the ability to pivot to BigQuery for exploratory analysis and Cloud Machine Learning Engine (CMLE) for modeling (which in turn would feed into Firestore, the backbone of our UI).

But raw play-by-play data is hard to query and model, so we built a Dataflow job that transformed the data from an unstructured JSON format to structured CSV files, allowing us to extract and develop features of interest.

While Dataflow made the process repeatable and fairly quick, it’s optimized for much bigger data problems than what we had.

We weren’t really dealing with the scale that it’s meant for, and in hindsight we realized that this work actually would have been faster on a local machine.

From GCS, we imported everything into BigQuery using the bq load job, which zapped 1,500 files into BigQuery in about five seconds of wall clock time.

(While it’s possible to load directly from Dataflow, it didn’t let us use a recursive wildcard path to match specific patterns within the bucket, hence, bq load.

) BigQuery allowed our team to conduct some exploratory analysis in conjunction with the development of our model.

Since we were mostly interested in finding patterns around runs and passes, most of our BigQuery work revolved around trying to understand these plays better.

For instance, given the received wisdom that NFL has trended toward a passing game in recent years, we wondered if we ought to weight our model to account for this shift.

Our data showed that while there has been a shift, depending on how you look at it, it’s been pretty gradual (of all rush and pass plays, the percent that are passes has grown from 46% to 48%).

That said, it’s also varied quite a bit by team and by year.

Percent of all rush and pass plays across the NFL since 2011Ratio of rushes to passes broken out by teamMoreover, we found that while completed passes have been getting longer (one way to think about the trend toward passing), runs have as well — and those metrics also vary by team and by year.

Total rush yardage and pass yardage across the NFLTotal pass yardage by teamSince we didn’t have time to train and create 32 unique models for each team (or investigate by conference), we decided not to weight the model for passing.

Modeling and refinementWith our pipeline established, we could start building our model.

While there are several tools available for building a model, the team members working on the model had the most experience and familiarity with TensorFlow.

Conveniently, TensorFlow also has less latency than other modeling tools (e.


, BQML), and that suited us well for our real-time purposes.

Preprocess: We began by taking the extracted CSV data from the Sportradar API and preprocessing it using tf.


Using tf.

Transform let us write a Dataflow job that was able to normalize across our entire data set for all numeric data, as well as create pre-defined vocabularies for our categorical data (such as the current offensive / defensive team on a play).

After normalizing the data, our Dataflow job wrote those preprocessed examples into TFRecords so that they could be read easily and efficiently during model training.

Train: Once we had preprocessed all of our data, we began training the model using TensorFlow estimators.

The TensorFlow estimator library ships with a pre-built linear classifier that we were able to train on our data set.

Given our use of a linear model and the distributed training built-in to TensorFlow’s estimators, we knew our training could easily scale to exponentially more examples.

We adapted our training program largely from the CloudML census estimator code sample.

Using a canned estimator made it particularly easy to tweak the sample code to suit our needs (a nice perk when you’re trying to hit a tight deadline).

Test: In addition to the common practice of testing the model against 30% of randomized data, we created a simulated game replay feature on play-by-play data.

As we checked to make sure the model was not overfitted (our validation accuracy tracked well with training accuracy), we also confirmed these suspicions by using simulated real-time replays that the model had never seen before.

By testing the model in this way, we could validate that it did have predictive accuracy, and was not overfitting the training data.

Refine: After getting the model up and running, we went back to BigQuery, using it as an evaluative tool to help us determine what new features to add to (and help improve) our model.

We started with 17 baseline features (things like play type, field position, offensive and defensive team), but created another nine based on the kinds of questions we wanted to ask of the data (and removed three that weren’t helping).

Twenty-three features isn’t much — other attempts to create a predictive play model have used upwards of 100 features — but given the looming kickoff on Thursday night, 23 seemed fine for a proof-of-concept.

One derived feature in particular, something we called SERA (situational expected rush attempts, or the likelihood of a team running the ball given a particular down and yards-to-first-down) later formed the basis of our naive model and helped us measure the skillfulness of our model.

Serve: To serve our model, we leaned on tf.

Transform yet again.

This ensured that any input received during inference would be preprocessed in the exact same way our training data was, thereby mitigating any training-serving skew from data preprocessing differences.

Machine learning architectureUI developmentIf a model predicts a play and there’s no UI to let you check it in real time, did it make a sound?We didn’t want to find out, which meant we needed to build the UI in parallel with the data pipeline and model.

In our brief research around predictive play models in football, we’d found several examples of sophisticated machine learning, but none that attempted to make play predictions in real time, or tried to connect to a graphic UI.

And as fun as it may be to make armchair predictions based on your own knowledge about (or aspirations for) your favorite team, we wanted to actually see how a trained model would stack up throughout the course of a game.

We ran an abbreviated design sprint to decide what information might be interesting and useful in such a UI, and how to present it.

In addition to including the play prediction and our model’s confidence score for it, we wanted to include all the prior plays in the game by drive, and our predictions — including confidence and accuracy — for those plays as well.

Finally, we wanted to include a chart that would update our cumulative predictions for each play’s accuracy, and total confidence and accuracy for the game.

Design iterationTo build this, we started by creating a Kubernetes container to subscribe to the Sportradar API so that it could push real-time, raw play-by-play data into a Firestore collection.

A Cloud Function normalized each raw play into our expected structure and fed it into CMLE, then wrote back a prediction into a second collection for predictions.

When the next play occured, another Cloud Function matched it to the prediction made, and then updated the data in the play-by-play collection.

The UI then displayed the model’s prediction for the next play from the most recent document in the predictions collection, and all previous plays and predictions from the play-by-play collection.

Throughout the week, our simulated game replay feature pulled double duty for both model testing and UI testing — we relied on it heavily to check on the UI’s behavior when there were no games being played.

By testing the UI’s functionality against simulated game replays, we prepped for Thursday night.

Real-time UI architecturePutting it togetherAs fun as it was just to get a model and UI working together in test scenarios, running our system during Thursday Night Football took on another level of suspense.

After commandeering some projection screens (not that there was much resistance in the office at 8 pm), we queued up the game and our application side by side.

Following the model was possibly even more exciting than the game (and it was an exciting one!).

During the game, our model performed with 66% accuracy in real-time, though we re-ran the model the next day with some updated features, which bumped up the accuracy to 73%.

To get a sense of whether this was any good, we checked to see how our naive model (which relied strictly on NFL play history in the past seven years with a given down and distance-to-first-down) performed on that game — it came in at only 55.


One quirk of designing a real-time UI was determining what “real time” actually means to a viewer: we needed to reconcile the latency between the Sportradar API, the television broadcast, and any system processing on our end.

Given that we were building a serverless UI, we weren’t losing that much time to server lags or notification delays.

To make up for this difference (and prevent a viewer from seeing a prediction on a future play when the current one hadn’t happened yet or even crossed the 40-second play clock threshold), we gave the UI a brief padding between receiving an update from Firestore and updating the underlying data.

(We covered the lag with a rotation of some common play calls so that it wouldn’t look like the UI was frozen.

)Our final model netted out with an overall 70.

1% accuracy and an average of loss of .


Not too bad for a week’s work.

PostseasonGiven the lack of infinite time and resources, we didn’t hit 90% confidence and 90% accuracy — numbers that surely would raise the eyebrows of any experienced data scientist.

But we did build a functional, end-to-end system that offered and evaluated play predictions in a real-time UI using in a serverless cloud environment and only generally available cloud products.

And it took a team of different skill sets to make it happen.

Still, the model can always be improved, and there’s a myriad of directions to explore:More and better features: 23 is only the tip of what’s possible.

With enough time, you could build features around season tendencies and game trends, not to mention weather conditions, home and away factors, and more fine-grained features around time and score.

(And that’s before looking at individual players or coaches for signal.

)More and better data: we only used regular season games; the postseason is another beast entirely.

More and better models: football teams are highly idiosyncratic — creating and controlling distinct models for players and coaches could be enormously helpful.

More and better modeling: our knowledge of Tensorflow was limited.

Digging into the TensorFlow Model Analysis tool to understand how certain features were affecting our model was high on our list for next steps to improve our system.

Moreover, we ran the model only using a linear classifier.

A better model combined with a real-time UI would not only be fun for fans, but could have a real impact on in-game decisions.

If a model eventually lands on skillful features that are more complicated than a human can process in the time it takes to call a play, you might imagine a scenario where sports commentators could use that information to improve their coverage and insight of a game.

You might imagine a scenario where a quarterback or an offensive coordinator might use that knowledge to inform their strategy — to go for a predictable play, or a surprise.

You might even imagine a scenario that’s not about football at all.

Just some food for thought to keep you busy during all those epic commercial breaks on Sunday.

Thanks to Austin Bergstrom, Kerry Cheng, Joseph Rollins, Ibrahim Abdus-Sabur, Ivey Padgett, Aya Damary, Nicholas Kelman, Asaph Arnon and Eric Schmidt.


. More details

Leave a Reply