How to use K-Means clustering in BigQuery ML to understand and describe your data better

Then, cluster the data on attributes of that field.

Find which cluster a given customer/item/etc.

belongs to.

Understand something about a cluster’s attributes.

Use this understanding to make decisions.

In this article, I’ll demonstrate the clustering approach on London bike share data.

Let’s say that we make a lot of decisions around the design and stocking of bike share stations and we would like to understand the different types of stations we have in our system so that we can make these decisions in a data-driven way.

We need to do these four things:Cluster bicycle rental stations.

The field we need to cluster is station_name, and we need to find attributes of a station in order to do so.

Find which cluster each station belongs to.

Understand the characteristics of each cluster.

Use this understanding to make data-driven decisions.

But before all that, we first have to gather the data we need.

0.

Gather dataThe data we will use consists of bike rentals in London and it’s a public dataset in the EU region, so make sure to create a dataset in the EU region called demos_eu if you want to follow along with my queries.

We want to cluster stations based on these factors:Duration of rentalsNumber of trips per dayNumber of bike racksDistance from city centerWe often make decisions such as stocking the station (i.

e.

making sure the station has bikes for rent) on weekdays differently from weekends.

So, we are actually going to cluster instances whose “primary key” is the combination of station_name and isweekday.

Here’s a query that’ll pull together the relevant data:WITH hs AS (SELECT h.

start_station_name as station_name,IF(EXTRACT(DAYOFWEEK FROM h.

start_date) = 1 OR EXTRACT(DAYOFWEEK FROM h.

start_date) = 7, "weekend", "weekday") as isweekday,h.

duration,s.

bikes_count,ST_DISTANCE(ST_GEOGPOINT(s.

longitude, s.

latitude), ST_GEOGPOINT(-0.

1, 51.

5))/1000 as distance_from_city_centerFROM `bigquery-public-data.

london_bicycles.

cycle_hire` as hJOIN `bigquery-public-data.

london_bicycles.

cycle_stations` as sON h.

start_station_id = s.

idWHERE h.

start_date BETWEEN CAST('2015-01-01 00:00:00' AS TIMESTAMP) AND CAST('2016-01-01 00:00:00' AS TIMESTAMP)),stationstats AS (SELECT station_name,isweekday,AVG(duration) as duration,COUNT(duration) as num_trips,MAX(bikes_count) as bikes_count,MAX(distance_from_city_center) as distance_from_city_centerFROM hsGROUP BY station_name, isweekday)SELECT * from stationstatsORDER BY distance_from_city_center ASCWe are extracting data on cycle hires (start_station_name, weekend/weekday, duration) in 2015 (see the WHERE clause) and joining it against station information (bike count, distance-from-city-center).

Then, we compute attributes of the station in stationstats (the average duration of rides, number of trips) and pass through the station attributes (bike count, distance-from-city-center).

Our dataset looks like this:Dataset that we will use for clustering; The primary key for our data is the combination of station_name and isweekday.

1.

BigQuery ML ClusteringDoing the clustering simply involves adding a CREATE MODEL statement to the SELECT query above and removing the “id” fields in our data:CREATE OR REPLACE MODEL demos_eu.

london_station_clustersOPTIONS(model_type='kmeans', num_clusters=4, standardize_features = true) ASWITH hs AS (SELECT h.

start_station_name as station_name,IF(EXTRACT(DAYOFWEEK FROM h.

start_date) = 1 OR EXTRACT(DAYOFWEEK FROM h.

start_date) = 7, "weekend", "weekday") as isweekday,h.

duration,s.

bikes_count,ST_DISTANCE(ST_GEOGPOINT(s.

longitude, s.

latitude), ST_GEOGPOINT(-0.

1, 51.

5))/1000 as distance_from_city_centerFROM `bigquery-public-data.

london_bicycles.

cycle_hire` as hJOIN `bigquery-public-data.

london_bicycles.

cycle_stations` as sON h.

start_station_id = s.

idWHERE h.

start_date BETWEEN CAST('2015-01-01 00:00:00' AS TIMESTAMP) AND CAST('2016-01-01 00:00:00' AS TIMESTAMP)),stationstats AS (SELECT station_name,isweekday,AVG(duration) as duration,COUNT(duration) as num_trips,MAX(bikes_count) as bikes_count,MAX(distance_from_city_center) as distance_from_city_centerFROM hsGROUP BY station_name, isweekday)SELECT * except(station_name, isweekday)from stationstatsThis query processed 1.

2 GB and took 54 seconds.

The model schema lists the 4 factors that were used in the clustering:The K-Means model schemaNote that we have specified the desired number of clusters (num_clusters=4) in my CREATE MODEL and removed the factors (station_name and isweekday) that we don’t want to cluster on.

Standardizing features is a good option to include if your features have different dynamic ranges (bikes_count is in the 10–50 range while num_trips is in the thousands, so I am doing that).

If we leave out the number of clusters, BigQuery ML will choose a reasonable default based on the total number of rows in the training dataset.

We could also do hyperparameter tuning to find a good number.

Essentially, we will run the clustering for different values of num_clusters, find the error measure, and pick the point at which the graph plateaus out.

The error measure can be obtained using:SELECT davies_bouldin_index FROM ML.

EVALUATE(MODEL demos_eu.

london_station_clusters)The lower the number, the better the clustering.

2.

Which cluster does a station belong to?Which cluster does a particular station belong to?.You get that using ML.

PREDICT.

Here’s a query to find the cluster of every station that has ‘Kennington’ in its name:WITH hs AS .

,stationstats AS .

,SELECT * except(nearest_centroids_distance) FROM ML.

PREDICT(MODEL demos_eu.

london_station_clusters, (SELECT * FROM stationstats WHERE REGEXP_CONTAINS(station_name, 'Kennington')))This yields:Which cluster do the Kennington stations belong to?3.

Examine clustersYou can look at the cluster centroids — essentially the values of the 4 factors in the model schema using:SELECT * FROM ML.

CENTROIDS(MODEL demos_eu.

london_station_clusters)ORDER BY centroid_idWith a little bit of SQL manipulation, we can pivot the above table:WITH T AS (SELECT centroid_id,ARRAY_AGG(STRUCT(numerical_feature AS name, ROUND(feature_value,1) AS value) ORDER BY centroid_id) AS clusterFROM ML.

CENTROIDS(MODEL demos_eu.

london_station_clusters)GROUP BY centroid_id)SELECTCONCAT('Cluster#', CAST(centroid_id AS STRING)) AS centroid,(SELECT value from unnest(cluster) WHERE name = 'duration') AS duration,(SELECT value from unnest(cluster) WHERE name = 'num_trips') AS num_trips,(SELECT value from unnest(cluster) WHERE name = 'bikes_count') AS bikes_count,(SELECT value from unnest(cluster) WHERE name = 'distance_from_city_center') AS distance_from_city_centerFROM TORDER BY centroid_id ASCThis yields:Clusters’ attributesTo visualize this table, click on “Explore in Data Studio” and select “Table with bars”.

Make the centroid column the “metric” and the remaining columns the dimensions.

Then you will get:Cluster attributes, visualized in Data StudioThis allows us to interpret the clusters.

Essentially, Cluster #1 consists of suburban stations from which people go for long rides, Cluster #2 consists of other suburban stations, Cluster #3 consists of busy city stations, and Cluster #4 consists of not-so-busy city stations.

Based on these characteristics and some knowledge of London, we can come up with descriptive names for these clusters.

Cluster 1 would probably be “Day trips”, Cluster 2 would be “Bedroom communities”, Cluster 3 would be “Tourist areas”, and Cluster 4 would be “Business district”.

4.

Make data-driven decisionsLet’s use these clusters to make different decisions.

We just got funding and can expand the bike racks.

Which stations should we put extra capacity in?.If we didn’t have the clustering data, we’d be tempted to go with stations with lots of trips and not enough bikes — stations in Cluster #3.

But we have done the clustering, and discovered that this group of stations serves mostly tourists.

They don’t vote, so we’ll put the extra capacity in Cluster #2 (bedroom communities).

We need to experiment with a new type of lock.

Which cluster of stations should we do this experiment on?.The business district stations seems logical, and sure enough those are the stations with lots of bikes and which are busy enough to support an A/B test.

We want to stock some stations with racing bikes.

Which ones?.Cluster 1, the stations that serve people who going on day trips out of the city.

Obviously, we could have made these decisions individually by doing custom data analysis each time.

But clustering the stations, coming up with descriptive names, and using the names to make decisions is much simpler and more explainable.

The fact that you can do all this using just SQL?.That’s pretty cool.

.

. More details

Leave a Reply