Easy Web Scraping with Python BeautifulSoup

Easy Web Scraping with Python BeautifulSoupFeliciaBlockedUnblockFollowFollowingJan 3Possibly Faster than Selenium WebDriver to MasterWeb ScrapingI first started learning about web scraping using Selenium, an open-source framework for automated testing.

We had needed a way to test the browser’s User Interface for correctness in legacy applications.

Selenium IDE is so simple, it can be learned in minutes for basic functionality.

If you need any “advanced” programming need, such as for or while loops, you need to graduate to Selenium WebDriver coupled with one of the many programming languages it supports.

In my case, I used Java to write automated testing scripts.

However, Java isn’t the fastest language to learn, and the Eclipse IDE configuration wasn’t the easiest to set up.

Now that I have learned Python, web scraping seems much simpler with Beautiful Soup, an open-source framework.

You don’t have to tirelessly “walk” the DOM if the elements do not have proper ID attributes.

In Beautiful Soup, the DOM elements (<a>, <div>, <p>, etc) can be aggregated into an array with one command.

Here’s a quick tutorial based on the work created by Antonia Blair.

Her explanations helped me learn Beautiful Soup in an amazingly short amount of time.

SetupI use Vagrant as my Linux environment running Ubuntu (trusty64) v14.

04.

You will need to install:python3 (my version is 3.

4.

3)requests module (version 2.

2.

1).

BeautifulSoup module (version 4.

2.

1).

I used this to check my module versions.

Check python module versionsNote, if you use python (version) 2, you will use pip, not pip3.

Basic BeautifulSoup CodeOnce everything is set up, let’s see what HTML content looks like at PyLadies (https://www.

pyladies.

com) homepage.

With just a few lines of python code, we include the modules, retrieve the contents, and then print out the HTML code to the screen.

It is remarkable how short this python program is.

Let’s call this program, beautifulSoup.

py.

And, make sure you set the right Linux file permissions with$ chmod 755 beautifulSoup.

pyThe program is below.

#!/usr/bin/python3import requests # Include HTTP Requests modulefrom bs4 import BeautifulSoup # Include BS web scraping moduleurl = "http://www.

pyladies.

com" # Website / URL we will contactr = requests.

get(url) # Sends HTTP GET Requestsoup = BeautifulSoup(r.

text, "html.

parser") # Parses HTTP Responseprint(soup.

prettify()) # Prints user-friendly resultsTo run this program, type:$ .

/beautifulSoup.

pyA small screenshot of running the program is below.

The HTTP response sent back from the Pyladies.

com’s web server is below:Results of running a simple Beautiful Soup programYou can see what your browser requires to display the index.

html page.

If you want to display the HTML status code, just add a single command below, where 200 is the standard response for successful HTTP Request.

The program now looks like:#!/usr/bin/python3import requests # Include HTTP Requests modulefrom bs4 import BeautifulSoup # Include BS web scraping moduleurl = "http://www.

pyladies.

com" # Website / URL we will contactr = requests.

get(url) # Sends HTTP GET Requestprint(r.

status_code) # —> Print HTML status code <—soup = BeautifulSoup(r.

text, "html.

parser") # Parses HTTP Responseprint(soup.

prettify()) # Prints user-friendly resultsYou can see only one line of code was added.

print(r.

status_code)The result is belowHTTP status code of 200 (successful HTTP Request) is now outputtedIn this blog, data is stored in my soup variable.

You, of course, can name your variable any name you want.

Finding a Match in the BeautifulSoup objectfind() Methodfind() is one of the best features in BeautifulSoup.

It helps aggregate DOM elements easily so you can manipulate what you need.

Knowing which HTML element you want on a webpage is half the battle.

To do this I like to use the Google Chrome browser’s Inspect feature.

On a Mac, if you hover over the element you want to grab (in this instance, the “Buy Stickers” button on the pyladies.

com, and 2-finger press, a menu opens with the “Inspect” option.

On a Windows machine, it’s a right-click while hovering over the element with a similar menu option.

To access web page elements in other browsers, read more here.

How to Inspect the DOM of a webpageIdentifying the “Buy Stickers” button on the webpage’s HTML codeOnce you uniquely identify the element, then you can use BeautifulSoup’s find() to locate it.

In this case, it’ssoup.

find('div', id="stickers_btn") # Use print() for the resultsPrinting the results display the following.

Adding “print(soup.

find(‘div’, id=”stickers_btn”))”title(), h1(), body() MethodsOther useful ways of locating the right HTML element.

# returns the first div on the pagesoup.

find('div')# find the first div with id='welcome_message'soup.

find('div', id='welcome_message')# finds the respective HTML tag elementsoup.

titlesoup.

h1soup.

body.

divfind_all() MethodNow, if you want to put all of the same type of elements into an array, BeautifulSoup has find_all().

soup.

find_all('a') # finds all <a> elementssoup.

find_all('a')[0] # reference the first <a> elementsoup.

find_all('a')[1] # reference the second <a> elementOnce you have them in an array, now you can iterate over your data.

This is the power of using a programming language.

This is when I found Selenium IDE lacking and shifted over to Selenium WebDriver and Java.

Looping through elements was vital to manipulate the data and being able to use program logic.

for link in soup.

find_all('a'): # iterate over every <a> tag print(link) # print it to the screenPrint each <a> tag in pyladies.

comget_text() MethodBut this can be hard to read.

BeautifulSoup’s get_text() comes to the rescue.

Changing the code to:for link in soup.

find_all('a'): # iterate over every <a> tag print(link).

get_text() # print it to the screenPrint the text in each <a> tag in pyladies.

comget() MethodIf you want to get all the links on a page, get() is very useful.

for link in soup.

find_all(‘a’): print(link.

get(‘href’))Using get() to find all links on a webpageDiscovering the Rest of BeautifulSoup’s MethodsIf you want to see the many possible commands in Beautiful Soup, you can use the python’s Interactive Mode and use the double tab feature, <tab><tab> after the [object name] and the period ”.

” to list the possibilities.

Enter the program above into python.

$ python3Python 3.

4.

3 (default, Nov 28 2017, 16:41:13)[GCC 4.

8.

4] on linuxType "help", "copyright", "credits" or "license" for more information.

Hit <tab><tab> quickly at the “soup.

” text you just entered (including the period without spaces).

Generate a list of Beautiful Soup commands in python Interactive Mode using <tab><tab>In SummaryPython is a wonderful language, and the many modules help to make it easier to achieve your programming goals.

I hope this was a useful to those who just started learning about BeautifulSoup like me.

Many thanks to Antonia Blair (antoniablair@gmail.

com) for her tutorial upon which this was based and Pyladies (New York City chapter) that is helping me master python.

Felicia Hsieh is a software engineer in career transition, looking for a software engineering / devops role in the NYC/NJ area (or remote).

She has an MBA, BSCS, and BSEE.

Github: www.

github.

com/feliciahsiehEmail: 214@holbertonschool.

com.

. More details

Leave a Reply