Python Web Scraping Refactored

How do we get started?Let’s start with our imports and getting the web page.

We import requests to obtain the web_page with requests.

get().

pandas will be used later to clean up our scrapped data.

BeautifulSoup from the bs4 library will be used to parse our soupy HTML to help us get the information we want.

We will do that below.

Photo by Elli O.

on UnsplashNext, we will investigate the HTML code from the web_page to find the desired section of code we need.

To view a webpage’s HTML code, go to the webpage, right click and select “View page source”.

You can then ctrl-f to find a staff member’s name to see the piece of HTML code where their name and info is embedded.

If you scroll a bit through the code, you should notice that pieces of information are enclosed by lines of code such as :<title>……</title>or<p>….

</p>These are known as tags in the HTML code.

Between some of these tags lies the information we want to scrape.

Since we see the desired information is in between the <div> tag with the class=’matrix-content’, we can assume that the info for all the teachers is in each tag with that class.

That is why we use the tag and the class as the parameter for the find_all property of soup.

We need to start at the index where the first teacher profile occurs since we are only scraping teacher information.

The first teacher to appear is “Mr.

Brogan”.

You can use ctrl-f to search for his name in the HTML code.

If you count (starting from 0 of course), Mr.

Brogan’s index is 29.

That is why we are redefining results starting from index 29.

A check on the length of results and a mental count of the removed staff members confirms we can move to the next step!Now to get all the teacher data! Right?Photo by Florian Olivo on UnsplashWe will.

Before that, we should see how we would get the information we want from a single teacher first.

Then we will generalize that to our list comprehensions.

Let's look at the HTML code for one of the teacher’s profiles.

We will again inspect Mr.

Brogan’s info:<div class="matrix-content"> <h5>Mr.

Brogan</h5> <div class="matrix-copy"><p> Special Education: Geometry, Particular Topics of Geometry</p><p> <em>rbrogan31@charter.

newvisions.

org</em></p></div> </div>Again, we need to determine the tags that contain the teacher’s name, position(s), and his/her email.

Take a second to try and answer this question yourself, then read on to see if you were right.

This will provide good practice for investigating what parts of the HTML you need to indicate in your python scraping code.

Remember the examples I showed you earlier.

Teacher Name tag: The name is between the tags marked <h5>.

Position(s) tag: The position(s) is located between the <p>tags after the class tag <div class=”matrix-copy”> .

Email tag: The email is between the tags <p> and <em>.

Since the<em> tag directly encases the email, that is the tag we will indicate in our scraping code.

Great!.Now that we found the tags we need to indicate, let's write the code for our first teacher to determine how we will loop through the teacher entries to get all of the data!To begin, we will define our first teacher as test_result.

Teacher Name: By using the find method on the<h5>tag, we get the line of code with our teacher’s name.

But this doesn’t give us the name without the tags.

We don’t want the tags in our code.

So to extract just the name text, we will add .

text to the find method to get the text attribute of our tags.

Position(s): We will use the same find method as with the name, but this time our parameter will be the tag <p>.

Doing so gets us our position, but again we don’t want the tags attached.

Using .

text again returns the following….

'!.Special Education: Geometry, Particular Topics of Geometry'This gave us more than we wanted.

Specifically, we were given the string code for a new line(!.) and tab( ) at the beginning.

Since our info is in a string, we can remove the parts we don’t need using .

strip(‘.’) with our line of code to remove these characters from anywhere in the string.

Email: Obtaining this information was much more straight forward.

Again, using the find method with the <em> tag as our parameter.

Using the .

get_text() method helps us with this since some of the emails are embedded in multiple <em> tags.

Now we get all the data we want!That’s right.So let’s get straight to it.

First, we initiate a data frame object.

We then use list comprehensions combined with our code from the test_result to get all the teacher names and positions.

We also take these list comprehensions to create the first two columns of our data frame df.

When I first ran the code for the email collection, I was greeted with an Attribute Error.

This is where a variable explorer is handy.

Inspecting the webpage or the HTML code will reveal that “Ms.

Veninga” does not have an email address within <em> tags.

It is between a second set of <p> tags.

Since the page is small you can do this, but for larger info collection you are better off printing where the error occurs as the list comprehension is generated.

To address this we will create a get_email function with a try, except set up to have all emails within this second set of <p> tags with the find_all method on <p> and then use indexing to get the <p> tag we want.

We will also strip away the excess text as we did with obtaining the positions info.

Meanwhile, everyone else will have their email scraped as normal.

Running the code again allowed us to obtain all the entries successfully.

Proof of this can be done by checking the length of the records list (it should return 66).

You can use df.

shape[0] to check the number of rows in your data (66 for 66 teachers).

That was quick.Let’s go analyze this data!We could…but then we would find an error with our collected data.

One thing that you could check is if you have duplicate entries and remove them.

There is a chance some of the teachers may teach multiple subjects (such as Math and English) and thus have their name occur multiple times.

By summing all the boolean values of df.

duplicated we get a value of 11.

So we have 11 teachers whose names appear more than once.

We then use df.

drop_duplicates to keep the first entry of that teacher name and discard the rest.

Lastly, we export our data frame to a CSV to use for future analysis.

Final ThoughtsWeb scraping gives you that feeling of magic since you can pull info from any website once you find the tags you need.

I hope that this walkthrough was helpful to those considering learning how to web scrap in Python.

Of course, there is some feature engineering that could be done to aid in the analysis.

I am not going to include detail on these since I want to focus on the web scraping aspect.

Options include:1.

Creating a gender column by splitting the names by their titles at the period then use pandas to map the title to the appropriate gender.

2.

Separating the positions (since most of the teachers seem to teach more than one type of class).

For pandas practice, you can try doing the above yourself.

Photo by Thao Le Hoang on UnsplashWe could then turn to Tableau or matplotlib for visualization and statistics to answer questions regarding those pieces of data compared to the teacher population versus other charter and public Bronx schools.

Until next time,John DeJesus.

. More details

Leave a Reply