Bulk IP Address WHOIS Collection with Python and Hadoop

SELECT registry, number_ips, COUNT(*) FROM ips WHERE registry != 'ripencc' GROUP BY 1, 2 ORDER BY 1, 2; registry | number_ips | count ———-+————+——- arin | 256 | 47324 arin | 512 | 8236 arin | 1024 | 13010 arin | 2048 | 7272 arin | 4096 | 10384 arin | 8192 | 7082 arin | 16384 | 3696 arin | 32768 | 1986 arin | 65536 | 12618 arin | 131072 | 994 arin | 262144 | 664 arin | 524288 | 334 arin | 1048576 | 230 arin | 2097152 | 108 arin | 4194304 | 48 arin | 8388608 | 14 arin | 16777216 | 60 apnic | 256 | 12578 apnic | 512 | 4088 apnic | 1024 | 8448 apnic | 2048 | 3382 apnic | 4096 | 3798 apnic | 8192 | 3464 apnic | 16384 | 1900 apnic | 32768 | 1410 apnic | 65536 | 3306 apnic | 131072 | 1248 apnic | 262144 | 892 apnic | 524288 | 450 apnic | 1048576 | 240 apnic | 2097152 | 98 apnic | 4194304 | 40 apnic | 8388608 | 2 apnic | 16777216 | 4 lacnic | 256 | 2100 lacnic | 512 | 430 lacnic | 1024 | 2090 lacnic | 2048 | 2078 lacnic | 4096 | 4134 lacnic | 8192 | 1494 lacnic | 16384 | 768 lacnic | 32768 | 580 lacnic | 65536 | 1006 lacnic | 131072 | 336 lacnic | 262144 | 354 lacnic | 524288 | 30 lacnic | 1048576 | 24 lacnic | 2097152 | 6 afrinic | 256 | 1388 afrinic | 512 | 149 afrinic | 768 | 21 afrinic | 1024 | 925 afrinic | 1280 | 36 afrinic | 1536 | 24 afrinic | 1792 | 20 afrinic | 2048 | 408 afrinic | 2304 | 15 afrinic | 2560 | 41 afrinic | 2816 | 4 afrinic | 3072 | 6 afrinic | 4096 | 473 afrinic | 5120 | 17 afrinic | 7680 | 9 afrinic | 7936 | 6 afrinic | 8192 | 517 afrinic | 8960 | 4 afrinic | 12800 | 17 afrinic | 16384 | 224 afrinic | 24576 | 3 afrinic | 25600 | 10 afrinic | 32768 | 98 afrinic | 65536 | 354 afrinic | 131072 | 69 afrinic | 196608 | 3 afrinic | 262144 | 48 afrinic | 393216 | 3 afrinic | 524288 | 34 afrinic | 1048576 | 20 afrinic | 2097152 | 8 RIPE NNC on the other hand has very granular assignments with large numbers of cases within each: registry | number_ips | count ———-+————+——- ripencc | 8 | 30 ripencc | 16 | 26 ripencc | 32 | 120 ripencc | 48 | 3 ripencc | 64 | 126 ripencc | 96 | 3 ripencc | 128 | 176 ripencc | 192 | 3 ripencc | 256 | 28458 ripencc | 384 | 6 ripencc | 512 | 10449 ripencc | 640 | 4 ripencc | 768 | 498 ripencc | 1024 | 12591 ripencc | 1120 | 4 ripencc | 1152 | 5 ripencc | 1280 | 229 ripencc | 1536 | 263 ripencc | 1792 | 80 ripencc | 2048 | 13419 ripencc | 2304 | 47 ripencc | 2560 | 142 ripencc | 2816 | 41 ripencc | 3072 | 128 ripencc | 3328 | 25 ripencc | 3584 | 20 ripencc | 3840 | 40 ripencc | 4096 | 9447 ripencc | 4352 | 23 ..?.Why Not Use One Machine & IP..Its a valid point that one computer on one IP address possibly could perform this job..To find out how well it would perform I generated a file of 1,000 random IP addresses (1000_ips.txt) and used a pool of 40 workers to perform WHOIS queries..$ pip install eventlet from eventlet import * patcher.monkey_patch(all=True) from ipwhois import IPWhois def whois(ip_address): obj = IPWhois(ip_address, timeout=10) results = obj.lookup_rdap(depth=1) print results if __name__ == "__main__": pool = GreenPool(size=40) ip_addresses = open('1000_ips.txt').read().split('..') for ip_address in ip_addresses: pool.spawn_n(whois, ip_address) pool.waitall() The task took 11 minutes and 58 seconds to complete on my machine..I occasionally got an HTTPLookupError exception which wasnt the end of the world but then I also saw the following: HTTPRateLimitError: HTTP lookup failed for http://rdap.lacnic.net/rdap/ip/x.x.x.x..Rate limit exceeded, wait and try again (possibly a temporary block)..If I could use more than one IP address I could avoid these exceptions for longer..Generating A List of IPs My plan is to generate ~4-5 million IPv4 addresses that will be used as a first pass..Once Ive collected all the WHOIS records I can then see how many black spots are remaining in the IPv4 spectrum..Ill run a Python script to generate this list.. More details

Leave a Reply