Web Scraping Craigslist: A Complete Tutorial

Sounds like a job for…Python and web scraping!In this article, I’m going to walk you through my code that scrapes East Bay Area Craigslist for apartments.

The code here, and/or the URI parameters rather, can be modified to pull from any region, category, property type, etc.

Pretty cool, huh?I’m going to share GitHub gists of each cell in the original Jupyter Notebook.

If you’d like to just see the whole code at once, clone the repo.

Otherwise, enjoy the read and follow along!Getting the DataFirst things first we need the get module from the requests package.

Then we’re going to define a variable, response, and assign it to the get method called on the base URL, as in the URL at the page you want to pull.

I went to the apartments section for the East Bay and went ahead and checked the Has Picture filter to narrow it down just a little.

We’ll import BeautifulSoup from bs4, which is the module that can actually parse the HTML of the webpage we got from the server.

Then check the type and length of that item to make sure it’s the right number (there are 120 posts per page).

Below are my import statements and setup.

It prints out the length of posts which is 120, as expected.

Using the find_all method on the newly created html_soup variable in the code above, we find the posts.

We need to examine the website’s structure to find the parent tag to the posts.

Look at the screenshot below and you’ll see that it’s <li class=“result-row”>.

That is the tag for one single post, literally the box that contains all the elements we will grab!Element inspection with Chrome (Ctrl+Shift+C shortcut!)In order to scale this, you want to work in the following way: grab the first post and all the variables you want from it.

Make sure you know how to access them for one before you loop the whole page, and lastly, before you loop all the pages.

Class bs4.

element.

ResultSet is indexed, so let’s look at the first apartment with posts[0].

Surprise, it’s all the code that belongs to that <li> tag!You should have this output for the first post in posts (posts[0]), assigned to post_one.

The price of the post is easy to grab:.

strip() removes whitespace before and after a stringFor date and time posted, we’re going to grab that by attribute ‘datetime’ because this saves a data cleaning step.

The URL and post title are easy, because the href attribute is the link and is pulled by specifying that argument.

The title is just the text of that tag.

The number of bedrooms and square footage are in the same tag, so we need to split it and grab them element-wise.

The neighborhood is the span tag of class “result-hood”, so we just grab the text of that.

The next block is the loop for all the pages for the East Bay.

Since there isn’t always information on square footage and number of bedrooms, we will go ahead and build in a series of if statements embedded within the for loop.

Go ahead and grab the code for the loop below.

The loop starts on the first page, and for each post in that page, it works through the following logic, written out in plain English if you like:I included some data cleaning steps in the loop so that we get out clean data for some of the variables before we even touch them.

Elegant code is the best!.I wanted to do more, but the code would become too specific to this region and might not work across areas.

The code below creates the dataframe from the lists of values!Awesome!.There it is.

Admittedly, there is still a little bit of data cleaning to be done.

I’ll go through that real quick and then it’s time to explore the data!Exploratory Data AnalysisSadly, after removing the duplicate URLs we are down to 120 instances.

These numbers will be different if you run the code, since there will be different posts at different times of scraping.

There were about 20 posts that didn’t have bedrooms or square footage listed too.

For statistical reasons, this isn’t an incredible data set, but we’ll keep that in mind and push forward.

We can see the distribution of the pricing for the East Bay in the plot above.

Calling the .

describe() method, we can see more detail.

The cheapest place is $850, and the most expensive is $4,800.

The next code block generates a scatter plot, where the points are colored by the number of bedrooms.

This shows a clear and understandable stratification: we see layers of points clustered around particular prices and square footages, and as price and square footage increase, so do the number of bedrooms.

Let’s not forget the workhorse of Data Science: linear regression.

We can call a regplot() on these two variables to get a regression line with a bootstrap confidence interval shading area calculated about the line like so:It looks like we have an okay fit of the line on these two variables.

Let’s check the correlations.

I called eb_apts.

corr() to get these:Correlation matrix for our variablesAs suspected.

Correlation is strong between number of bedrooms and square footage.

That makes sense since you necessarily must increase square footage as the number of bedrooms increases.

Pricing By Neighborhood ContinuedLet’s next go ahead and group by neighborhood, and find the mean for each variable.

This will give us a sense of how location affects the price.

The following is produced with this single line of code: eb_apts.

groupby('neighborhood').

mean() where ‘neighborhood’ is the ‘by=’ argument, and the aggregator function is the mean.

I noticed that there are two North Oaklands: North Oakland and Oakland North, so I will recode one of them into the other like so:eb_apts['neighborhood'].

replace('North Oakland', ‘Oakland North', inplace=True).

Grabbing just the price and sorting those values can show us the cheapest and most expensive places to live.

The full line of code is now: eb_apts.

groupby('neighborhood').

mean()['price'].

sort_values() and results in:Average price by neighborhood sorted in ascending orderLastly, I’m going to take a look at the spread of each neighborhood in terms of price.

By doing this, we can see more visually how prices in neighborhoods can vary, and to what degree.

Here’s the code that produces the plot that follows.

Berkeley has huge spread.

This is probably because it includes South Berkeley, West Berkeley, and Downtown Berkeley.

In a future version of this project, it may be important to consider changing the scope of each of the variables, so they are more or less aggregated within any one neighborhood.

Well, there you have it!.Take a look at this the next time you’re in the market and what a good price on housing (if that’s possible in the Bay Area).

Feel free to check out the repo and try it for yourself, or fork the project and do it for your city!.Let me know what you come up with!Scrape responsibly.

Happy coding!Riley.

. More details

Leave a Reply