Fast IPv4 to Host Lookups

' % (ip2int(ip), tld_ex(domain)[1])) except: pass filenames = [(csv_gz, 'out.

%s.

csv' % str(uuid4())[:6]) for csv_gz in glob('rdns_*.

json.

gz.

csv.

gz')] pool = Pool(4) pool.

map(extract, filenames) $ python domain_stub.

py The above took over a day to run on my machine.

I believe the tldextract library is well-written and reasonable steps have been taken to optimise it.

Nonetheless, I did raise a ticket to see if there is anything that could be done to speed up the performance even further.

I suspect a string-focused library like this could be ported to C, C++, Rust or GoLang and yield greater performance.

The above produced 23 GB of uncompressed CSV files.

Here is a sample from one of the resulting output files.

$ head -n3 out.

7dcbe9.

csv 17491593,totinternet 17491594,totinternet 17491595,totinternet Reducing the Search Space In the above example totinternet is listed as the hostname stub for at least three IPv4 addresses that run in sequential order.

I want to see if I can build ranges where a hostname stub is paired with the lowest and highest IPv4 addresses that run sequentially and uninterrupted.

This way there will be fewer records describing the same dataset and ultimately shrink the search space for the lookup tasks later on in this post.

Ill import the dataset into a single table in SQLite3, apply an index and then go through each record ordered by the value of the IPv4 address and print out any uninterrupted sequences for any given hostname stub.

$ sqlite3 lookup.

db CREATE TABLE ip_host ( "ipv4" INTEGER, "domain_stub" TEXT ); The following produced a 28 GB database prior to the index being applied.

$ cat out.

*.

csv | sqlite3 -csv -separator ',' lookup.

db '.

import /dev/stdin ip_host' $ sqlite3 lookup.

db CREATE INDEX ip_host_ipv4 ON ip_host (ipv4); $ vi shrink.

py import sqlite3 lookup_conn = sqlite3.

connect('lookup.

db') lookup_cur = lookup_conn.

cursor() sql = '''SELECT ipv4, domain_stub FROM ip_host ORDER BY ipv4''' lookup_cur.

execute(sql) last_ip, last_domain, consecutive = 0, None, 0 for ipv4, domain_stub in lookup_cur: if ipv4 != last_ip + (consecutive + 1): if last_domain: print '%s,%d,%d' % (last_domain, last_ip, (last_ip + consecutive)) last_ip = ipv4 last_domain = domain_stub consecutive = 0 else: consecutive = consecutive + 1 if consecutive: print '%s,%d,%d' % (last_domain, last_ip, (last_ip + consecutive)) $ python shrink.

py > shrunken.

csv The resulting CSV file is 508 MB uncompressed and is made up of 17,784,359 lines, 71x fewer than the source dataset.

Here is a sample of the file produced.

$ head -n3 shrunken.

csv one,16777217,16777217 bigredgroup,16778244,16778244 gtelecom,16778501,16778501 Ill produce a random set of records that will be used to do lookups in the benchmarks below.

This will ensure that every lookup is a hit and every query is unique.

$ sort -R shrunken.

csv | head -n1000 > unique_ips Populating PostgreSQL Ill setup a PostgreSQL account and create a database that will be populated by the "shrunken" dataset.

$ sudo -u postgres bash -c "psql -c "CREATE USER mark WITH PASSWORD 'test' SUPERUSER;"" $ createdb ip_ranges $ psql ip_ranges CREATE TABLE ip_ranges ( domain_stub VARCHAR(255), "start" BIGSERIAL, "end" BIGSERIAL ); copy ip_ranges FROM 'shrunken.

csv' DELIMITER ',' CSV PostgreSQL can scan its indices both forwards and backwards at nearly the same speed but when running a scan, it cannot change direction without starting a new scan.

Ill create an index on the dataset that sorts the data by the first IPv4 address in the range and then by the last IPv4 in the range in reverse.

That way when there is a hit on the first IPv4 address in the range the second columns hit will always come afterword avoiding PostgreSQL having to start a second scan.

CREATE INDEX ip_ranges_inverse ON ip_ranges ("start", "end" DESC); Ill then re-order the table based on the ordering in the above index.

ALTER TABLE ip_ranges CLUSTER ON ip_ranges_inverse; Ill also update the statistics used by PostgreSQLs query planner.

VACUUM ANALYZE ip_ranges; The resulting database is 1,479 MB in PostgreSQLs internal format.

l+ ip_ranges List of databases Name | Owner | Encoding | Collate | Ctype | Access privileges | Size | Tablespace | Description ———–+——-+———-+————-+————-+——————-+———+————+————- ip_ranges | mark | UTF8 | en_US.

UTF-8 | en_US.

UTF-8 | | 1479 MB | pg_default | Below Ill be using a prepared query for the lookup process.

The PREPARE statement will parse the SQL, analyse it and perform any rewriting / macro expansion of the query ahead of time and only once during this exercise.

Later on, when the EXECUTE statement is called the query will only need to be planned and executed saving a good amount of overhead.

$ vi run_pg.

py from psycopg2 import connect as PG unique_ips = [int(x.

split(',')[1]) for x in open('unique_ips', 'r+b') .

read() .

strip() .

splitlines()] conn = PG(database='ip_ranges') cursor = conn.

cursor() cursor.

execute(''' PREPARE find_stub AS SELECT domain_stub FROM ip_ranges WHERE "start" <= $1 AND "end" >= $1 LIMIT 1;''') sql = 'EXECUTE find_stub(%(ip)s);' for ip in unique_ips: cursor.

execute(sql, {'ip': ip}) resp = cursor.

fetchone() assert len(resp[0].

strip()) > 0 cursor.

execute('''DEALLOCATE find_stub;''') $ time python run_pg.

py The above completed in 11 minutes and 4 seconds giving a lookup rate of 5,421/hour.

Populating ClickHouse Ill first load the CSV data into a Log Engine table in ClickHouse.

$ clickhouse-client CREATE TABLE ip_ranges_log ( domain_stub String, start UInt32, end UInt32 ) ENGINE = Log; $ cat shrunken.

csv | clickhouse-client –query="INSERT INTO ip_ranges_log FORMAT CSV" Ill then produce a MergeTree Engine table which will convert the row-oriented data from the previous table into a form that will be faster to search against.

The MergeTree Engine demands a date to partition the data against so Ive picked a place holder date of January 1st, 1970.

$ clickhouse-client CREATE TABLE ip_ranges ENGINE = MergeTree(const_date, (start, end), 8192) AS SELECT toDate('1970-01-01') AS const_date, domain_stub, start, end FROM ip_ranges_log; The Log Engine table is 166 MB in ClickHouses internal format and the MergeTree Engine table is 283 MB.

I couldnt find an equivalent of PostgreSQLs prepared statements in ClickHouse so the following will simply execute a SELECT statement in full for each and every record.

$ vi run_ch.

py from clickhouse_driver import Client as CH unique_ips = [int(x.

split(',')[1]) for x in open('unique_ips', 'r+b') .

read() .

strip() .

splitlines()] client = CH('localhost', port=9000, password='root', secure=False, verify=False, compression=False) sql = '''SELECT domain_stub FROM ip_ranges WHERE start <= %d AND end >= %d LIMIT 1''' for ip in unique_ips: resp = client.

execute(sql % (ip, ip))[0] assert len(resp[0].

strip()) > 0 $ time python run_ch.

py The above completed in 9.

19 seconds giving a lookup rate of just under 392K/hour.

This is 72x faster than PostgreSQL.

Closing Thoughts The performance gap here is almost two orders of magnitude.

Even if I ran four threads querying PostgreSQL it would still be an order of magnitude off and thats not considering any performance increases that could be seen running the same number of lookup threads with ClickHouse.

PostgreSQL has Zedstore in development.

This is a columnar storage engine that could go some way to speeding up the above queries when its released.

PostgreSQL is a fantastic tool when a dataset cant enforce append-only operations but it would be nice to see if use cases like the above could be optimised further than they already have been.

If this code above were to ever be used in production a Least Recently Used (LRU) cache on the results from either PostgreSQL or ClickHouse could go a long way to returning subsequent matching records quicker.

The processing of the source data into a usable form lends itself well to systems with many CPU cores.

Using an EC2 spot instance, like the 72-vCPU c5.

18xlarge, would be a good choice to process the data quickly while only spending a few dollars to do so.

The above instructions would need the four files and process counts expanded to the number of vCPUs on the given machine.

I/O might become constrained at some point and should also be taken into consideration.

The shrinking operation would be a lot quicker if partitioned by the class A of each IPv4 address and run in parallel.

Running everything sequentially means its bottlenecked by the speed of a single CPU core.

Also, I suspect ClickHouse could import and sort this data much quicker than SQLite3.

ClickHouse has standalone client support for feeding JSON in via UNIX Pipes, transforming it using SQL and outputting the results to CSV, Parquet or even native ClickHouse Log Engine format.

I suspect a few stages of the data processing above could be removed with a single, albeit more elaborate, ClickHouse stage.

The Rapid7 database does have gaps in the dataset from unsuccessful lookups.

If youre objective is to identify cloud providers then using the IPv4 address lists that have either been crowd-sourced or officially published for AWS, Google Cloud and Azure to enrich this dataset will go a long way to achieving a more complete picture of global IPv4 usage.

Another enrichment could include looking for class C ranges with a lot of gaps and doing a WHOIS on the first IPv4 address.

It might announce that a single entity owns the entire class C and therefore all IPs in that range could be marked as such.

I also noted that there were a lot of small gaps where it may be safe to assume that missing IPv4 addresses between two ranges might be owned by the same entity and could be merged.

The following is an example of this: $ grep megaegg shrunken.

csv | head megaegg,16793601,16793855 megaegg,16793857,16794111 megaegg,16794113,16794367 megaegg,16794369,16794623 megaegg,16794625,16794879 megaegg,16794881,16795135 megaegg,16795137,16795391 megaegg,16795393,16795647 megaegg,16795649,16795903 megaegg,16795905,16796159 I took a look at what sort of reduction could be found merging ranges that shared the same domain name stub and were within 12 IPv4 addresses of one another.

$ vi merge_ip_ranges.

py lines = [(x.

split(',')[0], int(x.

split(',')[1]), int(x.

split(',')[2])) for x in open('shrunken.

csv', 'r+b') .

read() .

strip() .

lower() .

splitlines()] banked_first = None for num, (host, first, last) in enumerate(lines): if num + 1 >= len(lines): continue if host == lines[num + 1][0] and lines[num + 1][1] – last < 12: if banked_first is None: banked_first = first else: if banked_first: print '%s,%d,%d' % (host, banked_first, last) banked_first = None else: print '%s,%d,%d' % (host, first, last) The result was a reduction of 17,784,359 records down to 4,969,340.

This would also produce a database with much wider coverage of the global IPv4 space than the dataset initially offered by Rapid7, although its accuracy would be questionable.

The above could be simplified further.

If you assume that if you dont see any other domain name stubs between two records with matching stubs then the entire space between those two records is owned by the same entity.

Heres an example of such a scenario: tepco,16781382,16781392 tepco,16781419,16781422 tepco,16781453,16781471 The first two records have 27 IPv4 addresses between them, the second and third have 31 addresses between them.

When I removed the 12-IP Address range predicate from the above Python script I was presented with 3,693,710 records.

Finally, JSON has wide tooling support and many developers understand what it is and how to work with it but for delivery of a billion rows or more it would be nice to see Parquet as one of the delivery formats.

Parquets tooling support has grown immensely in the last few years and manipulating data stored in Parquet is far more efficient than JSON.

Theres an argument that a Parquet distributable could come with instructions on converting it into JSON or other formats using open source tooling.

.

. More details

Leave a Reply