Geolocation with BigQuery: De-identify 76 million IP addresses in 20 seconds

Geolocation with BigQuery: De-identify 76 million IP addresses in 20 secondsWe published our first approach to de-identifying IP addresses four years ago- GeoIP geolocation with Google BigQuery- and it’s time for an update that includes the best and latest BigQuery features, like using the latest SQL standards, dealing with nested data, and handling joins much faster.

Felipe HoffaBlockedUnblockFollowFollowingJul 8BigQuery is Google Cloud’s serverless data warehouse designed for scalability and fast performance.

Using it lets you explore large datasets to find new and meaningful insights.

To comply with current policies and regulations, you might need to de-identify the IP addresses of your users when analyzing datasets that contain personal data.

For example, under GDPR, an IP address might be considered PII or personal data.

Replacing collected IP addresses with a coarse location is one method to help reduce risk-and BigQuery is ready to help.

Let’s see how.

How to de-identify IP address dataFor this example of how you can easily de-identify IP addresses, let’s use:76 million IP addresses collected by Wikipedia from anonymous editors between 2001 and 2010MaxMind’s Geolite2 free geolocation databaseBigQuery’s improved byte and networking functions NET.

SAFE_IP_FROM_STRING(), NET.

IP_NET_MASK()BigQuery’s new superpowers that deal with nested data, generate arrays, and run incredibly fast joinsThe new BigQuery Geo Viz tool that uses Google Maps APIs to chart geopoints around the world.

Let’s go straight into the query.

Use the code below to replace IP addresses with the generic location.

Top countries editing WikipediaHere’s the list of countries where users are making edits to Wikipedia, followed by the query to use:#standardSQL# replace with your source of IP addresses# here I'm using the same Wikipedia set from the previous articleWITH source_of_ip_addresses AS ( SELECT REGEXP_REPLACE(contributor_ip, 'xxx', '0') ip, COUNT(*) c FROM `publicdata.

samples.

wikipedia` WHERE contributor_ip IS NOT null GROUP BY 1)SELECT country_name, SUM(c) cFROM ( SELECT ip, country_name, c FROM ( SELECT *, NET.

SAFE_IP_FROM_STRING(ip) & NET.

IP_NET_MASK(4, mask) network_bin FROM source_of_ip_addresses, UNNEST(GENERATE_ARRAY(9,32)) mask WHERE BYTE_LENGTH(NET.

SAFE_IP_FROM_STRING(ip)) = 4 ) JOIN `fh-bigquery.

geocode.

201806_geolite2_city_ipv4_locs` USING (network_bin, mask))GROUP BY 1ORDER BY 2 DESCQuery complete (20.

9 seconds elapsed, 1.

14 GB processed)Top cities editing WikipediaThese are the top cities where users are making edits to Wikipedia, collected from 2001 to 2010, followed by the query to use:# replace with your source of IP addresses# here I'm using the same Wikipedia set from the previous articleWITH source_of_ip_addresses AS ( SELECT REGEXP_REPLACE(contributor_ip, 'xxx', '0') ip, COUNT(*) c FROM `publicdata.

samples.

wikipedia` WHERE contributor_ip IS NOT null GROUP BY 1)SELECT city_name, SUM(c) c, ST_GeogPoint(AVG(longitude), AVG(latitude)) pointFROM ( SELECT ip, city_name, c, latitude, longitude, geoname_id FROM ( SELECT *, NET.

SAFE_IP_FROM_STRING(ip) & NET.

IP_NET_MASK(4, mask) network_bin FROM source_of_ip_addresses, UNNEST(GENERATE_ARRAY(9,32)) mask WHERE BYTE_LENGTH(NET.

SAFE_IP_FROM_STRING(ip)) = 4 ) JOIN `fh-bigquery.

geocode.

201806_geolite2_city_ipv4_locs` USING (network_bin, mask))WHERE city_name IS NOT nullGROUP BY city_name, geoname_idORDER BY c DESCLIMIT 5000`Exploring some new BigQuery featuresThese new queries are compliant with the latest SQL standards, enabling a few new tricks that we’ll review here.

New MaxMind tables: Goodbye math, hello IP masksThe downloadable GeoLite2 tables are not based in ranges anymore.

Now they use proper IP networks, like in “156.

33.

241.

0/22”.

Using BigQuery, we parsed these into binary IP addresses with integer masks.

We also did some pre-processing of the GeoLite2 tables, combining the networks and locations into a single table, and adding the parsed network columns, as shown here:#standardSQLSELECT * , NET.

IP_FROM_STRING(REGEXP_EXTRACT(network, r'(.

*)/' )) network_bin , CAST(REGEXP_EXTRACT(network, r'/(.

*)' ) AS INT64) maskFROM `fh-bigquery.

geocode.

201806_geolite2_city_ipv4` JOIN `fh-bigquery.

geocode.

201806_geolite2_city_locations_en`USING(geoname_id)Geolocating one IP address out of millionsTo find one IP address within this table, like “103.

230.

141.

7,” something like this might work:SELECT country_name, city_name, maskFROM `fh-bigquery.

geocode.

201806_geolite2_city_ipv4_locs` WHERE network_bin = NET.

IP_FROM_STRING('103.

230.

141.

7')But that doesn’t work.

We need to apply the correct mask:SELECT country_name, city_name, maskFROM `fh-bigquery.

geocode.

201806_geolite2_city_ipv4_locs` WHERE network_bin = NET.

IP_FROM_STRING('103.

230.

141.

7') & NET.

IP_NET_MASK(4, 24)And that gets an answer: this IP address seems to live in Antarctica.

Scaling upThat looked easy enough, but we need a few more steps to figure out the right mask and joins between the GeoLite2 table (more than 3 million rows) and a massive source of IP addresses.

And that’s what the next line in the main query does:SELECT * , NET.

SAFE_IP_FROM_STRING(ip) & NET.

IP_NET_MASK(4, mask) network_bin FROM source_of_ip_addresses, UNNEST(GENERATE_ARRAY(9,32)) maskThis is basically applying a CROSS JOIN with all the possible masks (numbers between 9 and 32) and using these to mask the source IP addresses.

And then comes the really neat part: BigQuery manages to handle the correct JOIN in a massively fast way:USING (network_bin, mask)BigQuery here picks up only one of the masked IPs-the one where the masked IP and the network with that given mask matches.

If we dig deeper, we’ll find in the execution details tab that BigQuery did an “INNER HASH JOIN EACH WITH EACH ON”, which requires a lot of shuffling resources, while still not requiring a full CROSS JOIN between two massive tables.

Go further with anonymizing dataThis is how BigQuery can help you to replace IP addresses with coarse locations and also provide aggregations of individual rows.

This is just one technique that can help you reduce the risk of handling your data.

GCP provides several other tools, including Cloud Data Loss Prevention (DLP), that can help you scan and de-identify data.

You now have several options to explore and use datasets that let you comply with regulations.

What interesting ways are you using de-identified data?.Let us know.

Find the latest MaxMind GeoLite2 table in BigQuery, thanks to our Google Cloud Public Datasets.

Originally published at https://dev.

to on July 8, 2019.

.

. More details

Leave a Reply