Python Web Scraping with Virtual Private Networks

Web Scrapers often face their IPv4 addresses showing up in aggregated traffic metrics and seeing them subjected to rate limiting.

Using a larger number of IPv6 addresses can help mitigate this but not all websites support IPv6.

Being able to spread connections across many IPv4 addresses can help reduce the risk of any one address being subjected to rate limiting.

To add to this, Cloud-originating IPv4 addresses are easily identifiable and theyre often are assumed to host synthetic traffic.

Residential IPv4 addresses face less scrutiny.

Using VPNs and/or other tunnelling techniques can go a long way to keeping crawlers under the radar and collecting data as productively as possible.

These can be hosted both in Cloud as well as residential environments.

In this post Ill explore two solutions, the first using WireGuard and the second, using an OpenSSH SOCKS5 proxy.

WireGuard: A Modern VPN WireGuard is a modern VPN solution which has been built by Jason A.

Donenfeld over the past five years.

It breaks from the traditional prime number-based cryptography schemes by using Elliptic Curves.

For the past few decades, prime number schemes have been plagued by side-channel, padding, replay and forgery attacks as well as implementation errors that in some cases left contents unencrypted.

In 2017, researchers developed an attack named ROBOT that allowed them to sign messages with Facebooks and PayPals private keys.

WireGuard uses Curve25519 which was developed by Daniel J.

Bernstein in 2005.

The encryption version of this curve is called X25519 and a digital signature version is called Ed25519.

Curve25519 requires much less computation than previous prime number-based schemes.

To add to that, Curve25519 is the fastest curve not covered under any patents and the implementation is in the public domain.

Client-side tools may not notice much of a reduction in computational requirements but servers handling a large number of encrypted requests will be able to handle a far larger number of workloads thanks to these efficiencies.

Low-powered Raspberry Pis can happily sustain 20 Mbps when tunnelling WireGuard traffic and Ive witnessed WireGuards Android VPN client battery consumption match that of Spotifys and WhatApps.

Ed25519 public keys are short as they only need 68 characters to represent them in base64.

This contrasts the 717 characters needed for a 4096-bit RSA public key.

Curve25519 is among many Elliptic Curve implementations.

Some of which are suspected of at best being misuse-resistant, lacking rigidness and potentially containing back doors.

Daniel J.

Bernstein and Tanja Lange have painstakingly catalogued a database of eleven mathematical characteristics of rigidness and judged a wide variety of curves against these criteria.

This exercise aimed to prove that their underlying discrete logarithm problem was sufficiently difficult or flag it when it isnt.

Curves meeting all of the criteria, such as Curve25519, have been deemed to be "Safe Curves".

WireGuard also supports peers pre-sharing 256-bit symmetric encryption keys which adds an additional layer of protection against future quantum computing-based attacks.

RSAs prime number-based schemes started life in the 1970s and pre-date the maturity of Elliptic Curves by some 20 years.

By 2005, Americas National Security Agency (NSA) had promulgated a suite of cryptographic algorithms that included Elliptic Curves.

This gave them both credibility and proof that these schemes can be used to protect sensitive information.

In 2017, the National Institute of Standards and Technology (NIST) approved Curve25519 for use by the US Federal government.

As of this writing, WireGuard is made up of 5,478 lines of C code and headers making it one of the simplest VPN solutions to date.

To contrast, OpenVPN when compiled with OpenSSL which in turn can be compiled with MIT Kerberos sits at over one million lines of C code and headers combined.

This is even before you begin to count its compression library dependencies like LZO.

WireGuard will be embedded into version 5.

6 of the Linux Kernel.

This will remove the overhead of context switching between the Kernel and User space while enjoying a very wide installation base.

WireGuard also ships as a standalone package for anyone using a previous version of the Kernel.

Other popular applications implementing Curve25519 include Facebook Messenger, OpenSSH, Signal, Tor, Viber and WhatsApp.

OpenSSH: Ubiquitous Encrypted Tunnelling The main feature of focus in OpenSSH in this post is the SOCKS5 proxy support.

It allows users to setup local ports that can tunnel TCP traffic through a remote OpenSSH server.

The Secure Shell protocol (SSH) was invented in Finland by Tatu Ylönen in 1995.

Though he had produced an open source implementation it came with various restrictions and has since become proprietary software.

In 1999, Damien Miller and Darren Tucker forked the SSH code base and created OpenSSH, a suite of tools designed to bring compressed and encrypted tunnelling to various web-centric communication protocols.

They released their work under a BSD license.

The OpenSSH suite is probably better known by the tools it bundles.

These include SSH, a telnet replacement, SFTP, an FTP replacement, SCP, an RCP replacement and SSHD, a server daemon for the above tools.

These tools are nearly ubiquitously installed on every internet-facing UNIX system.

Both Damien and Darren went on to be employed by Google where theyve been working as an Information Security Engineer and Site Reliability Engineer respectively for the better part of the past two decades.

Damien Millers LinkedIn describes his role as "Helping prevent Google from getting hacked".

OpenSSH supports a wide variety of prime number-based cryptography schemes and added support for Curve25519 in 2013.

To see which digital signature, encryption and compression schemes are supported by both your client and any given SSH server you can connect with adjust the following with the hostname of the target server.

$ ssh -vvv <hostname> uptime 2>&1 | grep -i kex Below you can see the algorithms supported by your client.

debug2: local client KEXINIT proposal debug2: KEX algorithms: curve25519-sha256@libssh.

org,ecdh-sha2-nistp256,ecdh-sha2-nistp384,ecdh-sha2-nistp521,diffie-hellman-group-exchange-sha256,diffie-hellman-group-exchange-sha1,diffie-hellman-group14-sha1,ext-info-c These are the algorithms supported by the Server.

debug2: peer server KEXINIT proposal debug2: KEX algorithms: ecdh-sha2-nistp256,ecdh-sha2-nistp384,ecdh-sha2-nistp521,diffie-hellman-group-exchange-sha256,diffie-hellman-group-exchange-sha1,diffie-hellman-group14-sha1,diffie-hellman-group1-sha1 This is what your client and the Server have agreed upon using.

debug1: kex: algorithm: ecdh-sha2-nistp256 debug1: kex: host key algorithm: ecdsa-sha2-nistp256 debug1: kex: server->client cipher: aes128-ctr MAC: umac-64-etm@openssh.

com compression: none debug1: kex: client->server cipher: aes128-ctr MAC: umac-64-etm@openssh.

com compression: none Setup a WireGuard Server For this example, Ill setup a WireGuard VPN Server on AWS EC2.

If you want to run the following on a Raspberry Pi running Raspbian on a residential internet connection the steps will be much the same.

Ill launch an on-demand t3.

micro instance in eu-west-1 running Ubuntu 16.

Itll cost $8.

32 / month + VAT and has 1 GB of RAM, 2 vCPUs and up to 5 Gbps of network connectivity.

Ill setup 8 GB of Magnetic EBS storage which will incur additional costs.

Ill create a new security group called vpn-farm.

Ill open up TCP port 22 to my IP address and UDP port 51220 (not TCP, UDP) to the vpn-farm security group.

The external address of this EC2 instance is 54.

246.

243.

162 and the private address, which is accessible across the VPC this EC2 instance lives in, is 172.

30.

2.

186.

To setup the machine Ill first SSH into it.

Note, for all my efforts championing Ed25519 above AWS IAM doesnt support it at this time.

As far as I can find only RSA keys are supported.

Apologies for the link being behind an AWS login screen.

$ ssh ubuntu@54.

246.

243.

162 Ill refresh the packages list and then install WireGuard via PiVPNs installer.

When installing WireGuard clients I tend to use WireGuard and Linux tooling directly but for WireGuard servers PiVPN wraps up a lot of complexity and edge case coverage into its installer.

$ sudo apt update $ wget -qO- https://install.

pivpn.

io | bash Youll be given the option of installing either OpenVPN or WireGuard, choose WireGuard.

Select UDP port 51220 as WireGuards default port.

Youll be presented with a list of DNS providers such as Quad9, OpenDNS, Level3, DNS.

WATCH, Norton, FamilyShield, CloudFlare, Google or Custom.

Choose what youre comfortable with using.

You can configure WireGuard to work with a domain name or IPv4 address, for this exercise Im using the private IPv4 address alone.

Ill enabled unattended upgrades of security patches.

Following all the above, PiVPN asked to reboot the system.

Once WireGuard was setup I created a new user account called scrape.

$ pivpn add –name scrape ::: Client Keys generated ::: Client config generated ::: Updated server config ::: WireGuard restarted ====================================================================== ::: Done! scrape.

conf successfully created! ::: scrape.

conf was copied to /home/ubuntu/configs for easy transfer.

::: Please use this profile only on one device and create additional ::: profiles for other devices.

You can also use pivpn -qr ::: to generate a QR Code you can scan with the mobile app.

====================================================================== Below is a truncation of the configuration file generated.

Ill use this on the scraping machine to connect to the WireGuard Server.

$ sudo cat /etc/wireguard/configs/scrape.

conf [Interface] PrivateKey = A.

= Address = 10.

6.

0.

2/24 DNS = 1.

1.

1.

1, 1.

0.

0.

1 [Peer] PublicKey = A.

= PresharedKey = A.

= Endpoint = 172.

30.

2.

186:51220 AllowedIPs = 0.

0.

0.

0/0, ::0/0 Setup WireGuards Client Ill setup another Ubuntu 16 Server for scraping.

This should have more vCPUs, RAM and disk space as itll be used for parsing and storing data collected from scraping.

The specifications of these sorts of machines are very much dependent on their workloads so Ill refrain from making generic recommendations.

Ill install Python and a utility that will allow for us to install WireGuard from a 3rd-party repository.

$ sudo apt update $ sudo apt install python-pip python-virtualenv software-properties-common Below will give the system the details of the 3rd-party repository hosting the WireGuard package were interested in and then install it along with OpenResolv.

$ sudo add-apt-repository ppa:wireguard/wireguard $ sudo apt update $ sudo apt install openresolv wireguard OpenResolv triggered the removal of resolvconf which requires a system reboot.

$ sudo reboot The scrape.

conf file generated on the WireGuard Server has been saved to the home folder Im using on this machine.

Ill copy the configuration into WireGuards configuration folder.

$ cd ~ $ sudo install -o root -g root -m 600 scrape.

conf /etc/wireguard/wg0.

conf Ill then launch WireGuard and tell the system to launch it after any reboot.

$ sudo systemctl start wg-quick@wg0 $ sudo systemctl enable wg-quick@wg0 Run the following to make sure the service launched without issue.

$ sudo systemctl status wg-quick@wg0 | tail -n1 Apr 12 00:11:03 ubuntu systemd[1]: Started WireGuard via wg-quick(8) for wg0.

If you see anything other than the above try running the following again.

$ sudo systemctl start wg-quick@wg0 WireGuard should now be able to report its telemetry.

$ sudo wg interface: wg0 public key: a.

= private key: (hidden) listening port: 53806 fwmark: 0xca6c peer: A.

= preshared key: (hidden) endpoint: 172.

30.

2.

186:51220 allowed ips: 0.

0.

0.

0/0, ::/0 latest handshake: 30 seconds ago transfer: 204 B received, 292 B sent Any networking software on the machine making any new connections will automatically tunnel via WireGuard.

$ wget -qO- https://ipv4.

icanhazip.

com 54.

246.

243.

162 Here is a short Python example.

Ill create a virtual environment with requests which will handle all HTTP and HTTPS calls and BeautifulSoup which will handle parsing of any HTML returned.

$ virtualenv ~/.

scrape $ source ~/.

scrape $ pip install beautifulsoup4 requests Below will set the HTTP agent to a recent version of Chrome.

Replace the <hostname> with a server of your choice.

Ive setup a session so cookies will follow any subsequent requests.

$ python from bs4 import BeautifulSoup import requests headers = { 'User-Agent': 'Mozilla/5.

0 (Windows NT 10.

0; Win64; x64) ' 'AppleWebKit/537.

36 (KHTML, like Gecko) ' 'Chrome/78.

0.

3904.

97 Safari/537.

36' } session = requests.

Session() resp = session.

get('https://<hostname>/', headers=headers) assert resp.

status_code == 200, 'Unexpected HTTP %d' % resp.

status_code The following will parse and print out the contents of any H1 tags found in the above call.

soup = BeautifulSoup(resp.

text) print [x.

text.

strip().

lower() for x in soup.

findAll('h1')] All of the above ran via WireGuard automatically.

Setting up an OpenSSH SOCKS5 Proxy The machines setup on AWS EC2 already come with OpenSSH installed and will have SSH public keys dropped into /home/ubuntu/.

ssh/authorized_keys.

This means I can launch a SOCKS5 proxy with the following on the client system.

$ ssh -D9090 -o ServerAliveInterval=50 ubuntu@172.

30.

2.

186 Once connected to the server, run top to keep the connection from going stale.

$ top The above will open up TCP port 9090 locally.

Tools that use libcurl should support SOCKS5 proxy settings being defined in the ALL_PROXY environment variable.

$ sudo apt install curl $ export ALL_PROXY=socks5h://localhost:9090 $ curl https://ipv4.

icanhazip.

com For Python well need the SOCKS package included in Requests installation.

$ source ~/.

scrape $ pip install -U 'requests[socks]' $ python import requests headers = { 'User-Agent': 'Mozilla/5.

0 (Windows NT 10.

0; Win64; x64) ' 'AppleWebKit/537.

36 (KHTML, like Gecko) ' 'Chrome/78.

0.

3904.

97 Safari/537.

36' } proxies = { 'http': 'socks5://localhost:9090', 'https': 'socks5://localhost:9090' } session = requests.

Session() resp = session.

get('https://ipv4.

icanhazip.

com', headers=headers, proxies=proxies) assert resp.

status_code == 200, 'Unexpected HTTP %d' % resp.

status_code.

Leave a Reply