Web Data Scraping in Python with Regex and BeautifulSoup

Web Data Scraping in Python with Regex and BeautifulSoupLearn how to quickly scrape data off the internetCostas AndreouBlockedUnblockFollowFollowingMay 17What is data scraping?Scraping is defined in the dictionary as ‘small amount of something that has been obtained by scraping it from a surface’.

When it comes to data scraping it refers to the process of collecting data from the web so it can be used for some sort of purpose.

Photo by Hal Gatewood on UnsplashWhy would anyone want to scrape data off the net?Imagine that you would like to build yourself a portfolio monitoring tool which requires the latest stock prices for you to value your investments.

One way that you could source those stock prices, would be to scrape them off the web.

Another possibility would be, that you are building an investment strategy based on the job advertisements from a specific company and you would like to monitor all new job postings.

Doing this manually would be a very time consuming and boring task.

Instead you could code a script that creates time series data for your analysis.

Introduction to Data ScrapingBefore we jump into it, there is one point to be made about copyrights.

A lot of websites would have copyrighted their content and we may not be allowed to save their data.

As such, it would be wise to read the policies set by each website before attempting to scrape their data.

In this instance, as we will be conducting an exercise of how to scrape data and we will not be using this data for commercial purposes but rather for our own personal analysis we can explore how we can scrape data off stocks from NASDAQ’s website.

Photo by José Alejandro Cuffia on UnsplashConnecting to a web page and retrieving dataThe first thing we need to do is connect to a web page and retrieve some data.

There are many ways to do that, but the easiest way is using the urllib library.

We can use the following script:import urllib.

requestfileobject = urllib.

request.

urlopen('https://www.

nasdaq.

com/')This has essentially downloaded the web page on our ‘fileobject’ variable in exactly the same fashion as when we open a file.

Thus, to play it back, we need to loop through line by line:import urllib.

requestfileobject = urllib.

request.

urlopen('https://www.

nasdaq.

com/')for line in fileobject: print(line)Processing the data using RegexNow that’s a lot of data that we simply don’t care about.

We are only interested in the NASDAQ Index ticker information that comes back; that is the price and the net change.

To filter out the information we don’t want, we can use some simple regex (don’t forget to check my blog on regex: Introduction To Regex in Python).

Attempting to use the regex directly on the data however, will not work out of the box; we first need to decode it using the .

decode() extension.

The reason for this is that regex requires a string, whereas the html returned is in bytes.

import urllib.

request, refileobject = urllib.

request.

urlopen('https://www.

nasdaq.

com/')i=1for line in fileobject: if re.

findall('nasdaqHomeIndexChart.

storeIndexInfo', line.

decode()): print(i,': ',line) i = i+1This returns the following:857 : b' nasdaqHomeIndexChart.

storeIndexInfo("NASDAQ","7916.

94","6.

35","0.

08","2,342,496,944","7949.

34","7759.

34");.'858 : b' nasdaqHomeIndexChart.

storeIndexInfo("DJIA","25942.

37","114.

01","0.

44","","26019.

32","25469.

86");.'859 : b' nasdaqHomeIndexChart.

storeIndexInfo("S&P 500","2881.

40","10.

68","0.

37","","2891.

31","2825.

39");.'860 : b' nasdaqHomeIndexChart.

storeIndexInfo("NASDAQ-100","7586.

53","3.

78","0.

05","","7623.

01","7426.

75");.'861 : b' This is good for a first pass, but really, we need to extract the information out of the string.

We could do that by using another regex statement:import urllib.

request, refileobject = urllib.

request.

urlopen('https://www.

nasdaq.

com/')for line in fileobject: if re.

findall('nasdaqHomeIndexChart.

storeIndexInfo', line.

decode()): temp = re.

findall('"(.

+?)"', line.

decode()) print(line.

decode()) for tmp in temp: print(tmp)Example of the data returned:NASDAQ7916.

946.

350.

082,342,496,9447949.

347759.

34As we have seen, this is a quick and easy way to scrape data off a website using regular expressions.

However, what if we wanted to parse the HTML more natively?Photo by Ella Olsson on UnsplashProcessing the data using BeautifulSoupOne of the most popular libraries for parsing HTML in Python, is known as BeautifulSoup.

Let’s us explore how you can use it for data scraping.

Unless you have previously used BeautifulSoup, you will most likely not have it installed on your machine.

To install it, execute the following command in your command line window:pip install beautifulsoup4Now that we have the library installed, we want to extract the HTML and inspect it — so we can determine exactly what it is we want to modify.

Before we examine the website we were looking at earlier, we better start with the basics.

Firstly, we will load some HTML in our script, and then we will see how we can navigate it.

We can use the following script for a step by step guide:import urllib.

requestfrom bs4 import BeautifulSoupinput = '''<html><head><title>Learning BeautifulSoup</title></head><body> <p id="FirstPTag" align="center">Text chunk <b>1</b> </p>.

<p id="SecondPTag" align="notcentre">Text chunk numero duo <b>2</b>.

</p> </body> </html>'''soup = BeautifulSoup(input, 'html.

parser')print('——–Soup:——–')print(soup.

prettify())print('——–soup.

html.

head:——–.',soup.

html.

head)print('——–soup.

html.

head.

string:——–.', soup.

html.

head.

string)print('——–soup('p'):——–.', soup('p'))print('——–soup('p', {'align' : 'center'}):————.', soup('p', {'align' : 'center'}))print('——–soup('p', {'align' : 'center'})[0]['id']:————.',soup('p', {'align' : 'center'})[0]['id'])which returns the following:As you can see, the above example shows you how to nagivate the html tree using BeautifulSoup and how to extract different information.

This should hopefully enable us to replicate what we have previously done with regex.

Looking for our Prices with BeautifulSoupThe first step in our analysis, is to grab the HTML and examine it closely.

This can be achieved relatively easily:import urllib.

requestfrom bs4 import BeautifulSoupfileobject = urllib.

request.

urlopen('https://www.

nasdaq.

com/')soup = BeautifulSoup(fileobject.

read(), 'html.

parser')print(soup.

prettify())The structure is quite similar to what we have seen before in the regex example, but this time around we are using the .

read() method on the fileobject and .

prettify() method on the soup variable.

This allows us to observe the HTML in a nicely formatted way (indented).

By doing so, we can observe the following:This will immediately give us some more context.

For one, we know the website is using JavaScript, meaning that our life might be a tad more difficult when it comes to scraping data.

We try first attempt to reduce the data set further with:import urllib.

requestfrom bs4 import BeautifulSoupfileobject = urllib.

request.

urlopen('https://www.

nasdaq.

com/')soup = BeautifulSoup(fileobject.

read(), 'html.

parser')container = soup("script", type = "text/javascript")for item in container: print('——————————–') print(item)This will quickly reveal that the data we are after is within CData.

CData is truly a pain to with with using BeautifulSoup.

At this point, it is easier to use string manipulation to get our data, than it is to try and continue using BeautifulSoup.

In the following script, we can filter the returned data for the ones that include CDATA.

Then we can break them line by line and filter them by the inclusion of the string nasdaqHomeIndexChart.

storeIndexInfo.

At this point, is a straight forward .

split() and return the data we are after.

import urllib.

requestfrom bs4 import BeautifulSoupfileobject = urllib.

request.

urlopen('https://www.

nasdaq.

com/')soup = BeautifulSoup(fileobject.

read(), 'html.

parser')container = soup.

find_all("script", type = "text/javascript", text = True)for item in container: if "![CDATA[" in item.

contents[0]: for line in item.

contents[0].

split('.'): if 'nasdaqHomeIndexChart.

storeIndexInfo' in line: data = line.

split('"') print(data[1],data[3],data[5],data[7],data[9],data[11],data[13])This will give us the following results:Storing the DataNow that we have all the necessary data that we need to carry out our analysis we probably should save it somewhere.

I personally like to use either Excel or SQL.

We can export our data into a CSV (pipe delimited, as they come with commas every thousand), using the following script:import urllib.

requestfrom bs4 import BeautifulSoupfileobject = urllib.

request.

urlopen('https://www.

nasdaq.

com/')soup = BeautifulSoup(fileobject.

read(), 'html.

parser')container = soup.

find_all("script", type = "text/javascript", text = True)fout = open('output.

csv', 'w')for item in container: if "![CDATA[" in item.

contents[0]: for line in item.

contents[0].

split('.'): if 'nasdaqHomeIndexChart.

storeIndexInfo' in line: data = line.

split('"') print(data[1],data[3],data[5],data[7],data[9],data[11],data[13]) lineout = data[1] + '|' + data[3] + '|' + data[5] + '|' + data[7] + '|' + data[9] + '|' + data[11] + '|' + data[13] + '. More details

Leave a Reply