Reverse Engineering the Walk Score Algorithm

I can tell that I’ve moved to a considerably less walkable neighborhood but it’s unclear how to quantify the magnitude or what goes into a walkability score.

I’ve previously used the Walk Score API as a data source for predicting clustering of electric scooter locations.

Walk Score is a website that takes an address and computes a measure of its walkability on a scale from 0–100 using proprietary algorithms and various data streams.

As someone who enjoys walking to get places (and hiking!), I’ve become curious as to what fuels these proprietary algorithms that generate a walkability score.

I set out to ask the following questions:Can the proprietary Walk Score algorithms be reverse engineered?2.

What features are important in building a walkability score?To answer these questions, I built a custom data set by collecting a diverse set of granular Seattle city data and Walk Score API data to train machine learning models to predict a walkability score.

I was able to train a model that achieved a R² of 0.

95 on test-set data.

R-squared explains the extent to which the variance of the feature set explains the variance of the Walk Score.

In essence, I was able to reverse engineer the Walk Score methodology and recreate the proprietary algorithms that power their Walk Score.

The most important features for a location’s walkability are the number of restaurants within 1000 meters, population density within that census tract, number of supermarkets within 1000 meters, and proximity in meters to the nearest commercial zoning.

Data Source and Machine Learning PipelineThe Full Data Pipeline for Reverse Engineering the Walk Score MethodologyDataI started by randomly generating latitude and longitude coordinates in the Greater Seattle area.

Once armed with a list of ~7800 unique geolocations I leveraged the Walk Score API which returned a walkability score for each unique geolocation.

I then set out to collect data that reflected the walkability of a location’s surrounding area.

Data Sources:OSMnx: Python package that lets you download spatial geometries and model, project, visualize, and analyze street networks from OpenStreetMap’s APIs.

Walk Score API: Returns a walkability score for any location.

LocationIQ API: Nearby Points of Interest (PoI) API returns specified PoIs or places around a given coordinate.

Seattle City Zoning: Zoning districts specify a category of use (e.


, single-family residential, multifamily residential, commercial, industrial, etc.

)Seattle Census Data: Provides population and area in square miles for census tracts within Census Tracts and Geographic IdentifiersU.


Census Geocoder API: For a given geolocation, the API returns Census tracts and unique Geographic Identifiers.

This was crucial for correctly merging in Zoning and Census data.

Feature EngineeringDue to LocationIQ API daily requests limitations, I prolonged the data collection phase for two weeks.

This left me with ~7800 unique geolocations which I then engineered 27 features to train machine learning models to predict walkability throughout Seattle.

Full Feature SetThe features break down into four categories:Amenity-based: number of bus stations, parks, restaurants, schools, total amenities within a specified radius (1000-meter radius was used for most amenities)2) Census Derived: zoning category and population densityGeolocation Observations Grouped by Zoning Category3) Distance Based: proximity to closest highway, closest primary road, closest secondary road, closest residential road, closest industry zoning4) Walk Network Structure: intersection count, average circuity, street length average, average streets per nodeA single geolocation plotted on top of the OSMnx library for Walk Network Structure feature generationModel DevelopmentI trained three machine learning models: a random forest regression, a gradient boosting regression and an extreme gradient boosting regression.

I trained each of these models on two-thirds of the data collected and reserved the remaining one-third for testing.

Extreme gradient boosting model predictions (R² of 0.

95) of Walk Score against the one-third test set.

The extreme gradient boosting regression did a great job of predicting the Walk Score, achieving a R² of 0.

95 on the one-third test set (~2300 samples).

This model had the best performing test set accuracy (RMSE).

What’s in the Black Box?The purpose of reverse engineering the Walk Score methodology was to gain an understanding of the key features that go into their algorithms.

We want to know what really makes a location walkable, not just a score!.By examining the feature importance of the extreme gradient boosting model, the number of restaurants within 1000 meters dominated as the most important feature.

Additional important model features were population density in a given census tract, count of total amenities, the number of supermarkets within 1000 meters, and proximity in meters to nearest commercial zoning.

The number of restaurants, supermarkets and total amenities within 1000 meters, population density and proximity to commercial zoning are the most important features for predicting a location’s Walk Score.

ConclusionThe Walk Score is already a useful decision-making tool on where to live and develop real-estate based on walking, biking and transit preferences.

It’s now helpful to have an understanding of the Walk Score methodology and what features go into building their algorithms.

We now know the inputs that make a location walkable according to Walk Score.

The current model is trained within the Seattle city confines, where urban characteristics are similar.

There are additional features that could be collected to enhance predictive power of the Walk Score such as topographical measurements and closest amenity distance calculations.

Model predictions could easily be expanded to other areas, as the Walk Score API and the underlying data sources for generating features (US Census, OSMnx, City Zoning, LocationIQ) are widely available.

CodeThe code for this project can be found on my GitHubComments or Questions?.Please email me at: perryrjohnson7@gmail.


. More details

Leave a Reply