Another thing that you notice, which calls for Selenium or alike— is the infamous infinite scroll.
The Blue and Infinite AirbnbSpider LogicThe logic of the scraper simulates the actions a user would take on the website, starting with the home screen and doing the search (let’s call it top-bottom for simplicity) followed by scraping the content (say, catalog page or the property page).
At the same time, you may choose to start building your crawler from bottom to top to see if you collect any data at all and identify any issues.
While it is useful to check if you are on the right track by starting to scrape the destination page (e.
g.
catalog page), some websites, such as Airbnb will be looking for specific information in your requests and if a suspicious activity is spotted, they would either ban you, modify the web elements they show, or even change the content of the page.
Testing Scrapy vs SeleniumAs Scrapy is nice and swift, I do attempt to use Scrapy alone, so let’s do a small experiment: we’ll try to get some response using Scrapy only.
You can install the required modules using pip.
After the modules are installed, we can start the project with scrapy start project projectname, which will create a “projectname” folder in your working directory.
Next, you can generate the default spider in the folder by changing the directory to your project folder cd projectname and scrapy genspider nameofthespider, which will create a default Scrapy crawler architecture in your project.
Tip: calling scrapy in command line will show you the basic scrapy commands, some of them are going to be used below, check them out!Install PackagesApart from installing the modules, we need to download chromedriver, unzip it and have Google Chrome installed on the machine, if not yet.
Chromedriver is not the only webdriver you can use, e.
g.
Firefox is quite popular, but in this post I will only cover Chrome.
Scrapy Architecture in a File DirectoryAs a note, in this tree, the spider “root directory” is where scrapy.
cfg resides, so whenever we want to launch the crawler, the working directory should be where scrapy.
cfg is.
Further on,settings.
py — with the spider’s settings — and homes.
py — with the spider’s script — will be the focus of this post.
Now we are good to see if we can grab any information from the catalog page using Scrapy.
Let’s go to the scrapy shell first, which is sort of a command line for scrapy.
It is convenient for testing Scrapy code, i.
e.
to see if the selectors return any result in particular.
scrapy shell 'URL' is the way to go.
The output contains logs, scrapy summary stats and, finally, a scrapy shell, which is similar to an ipython shell.
scrapy shell ‘url’You can also try scrapy fetch "url" right in bash.
The result will be more bulky with a bunch of statuses and an HTML tree, and it will look like this:scrapy fetchEither way, the important thing is the downloader/response_status_count/200, where 200 means that we successfully received a server response.
500’s and 400’s mean we request was rejected by the server — either you’ve been too aggressive or the bot has been identified and blocked.
Now, let’s focus on the scrapy shell.
Let’s try to get the first property name using different xpath’s, which we can find by pointing at the property name >> right-click >> inspect.
There are multiple ways of getting to the same element.
For example, the property type (“PRIVATE ROOM * 1 BED”, “ENTIRE APARTMENT * 3 BEDS”, etc.
) can be locating by using class="_ng4pvpo", itemtype="http://schema.
org/ListItem"and itemprop="name" along with other options.
One of the xpaths to a Property NameTesting Scrapy using xpath:Testing Scrapy’s ResponseDespite the fact that scrapy fetch produces a GET status 200, the selectors return blank lists.
It is what is meant when they say that scrapy is cannot handle JS-heavy pages.
This could also be the case when xpaths are incorrect (try any random ones), but it was not the case at the time of writing.
Note that there are no errors produced in both cases.
Getting the same element using Scrapy Selector and Selenium’s driver.
page_source returns a list of selectors:Chromedriver and Scrapy SelectorThe takeaway is: as Airbnb is relying on a JS-heavy React framework, Scrapy cannot get to the needed web elements and extract data from them.
This is where Selenium comes in handy by virtually making a request to the server while also sending the headers that the server will accept without blocking your bot or distorting the data.
That said, there is a difference among real browser headers and Selenium headers, user-agent in particular, but we will take care of it below.
On Architecture — Some TweaksScrapy architecture consists of several parts, and as I mentioned before, I am going to focus on the two of them.
spider.
pyThe actual name of the file after you call scrapy genspider nameofthespider will be nameofthespider.
py.
This is the file that will contain the script for the crawler to follow.
settings.
pySettings.
py is the settings for the spider.
The documentation is pretty comprehensive, give it a read if you have time.
Settings allow you to modify the crawler’s behaviour when necessary, avoid getting blocked and, importantly, to not overstress the website's server, which can harm its performance.
Generally, it is a good idea to tweak at least the following the settings: 1) DEFAULT_REQUEST_HEADERS, which are a part of any request your browser sends to the web server (check out wiki, and an example here).
2) USER_AGENT, which is actually part of the headers, but can also be set separately.
3) DOWNLOAD_DELAY that sets a delay between concurrent requests to different pages of the same website.
It won’t be necessary for this exercise, but it is a good idea to keep it in mind.
4) ROBOTSTXT_OBEY, which gives an option to follow or ignore robots.
txt file on the web site.
Robots.
txt file, stored at the website’s root, describes the desired behaviour of bots on the website, and it is considered “polite” to obey it.
While it is the best practice to scrape politely (there’s a good article here), you can’t always obtain the data by following the full list of recommendations, partly because some websites will block any bot that does not look like a browser, for example, when your request’s header is not matching the expected one.
Checking the expected request header from Chrome:Airbnb — Checking Network ResponseTo point out, set-up headers will be most relevant for Scrapy.
Selenium launches a browser instance, however, its header indicates the use of automated software, although being quite similar to a regular browser.
Scrapy Settings for This ProjectA Step-by-Step ScraperNavigation: Searching and LoadingThe backbone of scraping is web element selectors.
You can choose to work with css selectors or with xpath’s, I prefer the latter and will use them in this post, which is due to a habit.
If you use Google Chrome or Firefox, getting an xpath is easy: go to the element >> right-click >> choose “Inspect” (as we did with property type above).
The console will open where you’ll see the chosen element and many more, so you can go up and down the “tree” if you don’t find what you need immediately or you if you’re in search of something else.
This search field is an example:Airbnb — Search Field xpathUltimately, we want our scraper to go through this page and do the search.
Now that we have an understanding of how to find the element on the page, let’s test how it works.
To test it, we can keep using ipython.
A cool feature of Selenium is its ability to reproduce user’s actions on the website, which includes filling in the forms, pressing the buttons, hitting “Enter” and other keys, taking screenshots, etc.
The snippet below initiates the webdriver, goes to the airbnb hime page, enters the city name (Dubai) and hits “Enter” by using hue007.
Next, it stops for a bit — sleep(0.
8) — and clicks the search button by finding the element in xpath with type = "submit" and clicking on it using click() method.
Then the script pauses for 8.
7 sec while you are being redirected to the next page.
When it’s on the next page, the driver will search for the button “Homes” by xpath data-veloute="explore-nav-card:/homes.
The last action will finally take us to the catalog page, which is an infinite scroll page that we are going to scroll through using the while statement.
Once we are done, it’s time to extract selectors for elements of the catalog.
Going Through the Search Menu on AirbnbAfter the navigation part is tested and runs as expected we can add this part to the homes.
py file with a small modification — every time a driver instance is called, we need to call self.
in front of it (Corey explains the intuition of self here).
For the final piece, go to The Final Piece section.
Getting the Data: Product Page vs Catalog DetailsIn general, if you are presented with a set of items in a form of the list or a grid, where the elements exist in the same xpath (e.
g.
, share the same class, id, div, etc.
) and are repeated on the page, it is easy to iterate over them using a for loop.
If your elements are unique to the page, you find them directly in the page.
Thinking back to Airbnb, you can get this site’s data by going into the catalog page or visiting homes’ profile one-by-one.
However, home data points seem to be more easily accessible from the home profile as some of the elements in the catalog do not have a common structure.
For example, you can see all the property type tags (that are found in a page) that are returned in a span, while iterating over the elements of the same class.
There is a pattern, but unfortunately, it breaks constantly, so you cannot rely on it for getting the data.
It does take longer to go from profile to profile, however in case you need to get the property type or any additional info from the property profile, scrape “homes” one-by-one.
Getting Homes’ URLs: the Catalog PageWhile you can certainly scrape Airbnb from the catalog page, there are some elements, like property type, which are tricky to get from there.
Moreover, not all the information is available in the catalog view — getting the full info will require going to profile pages.
Therefore, when the catalog is loaded, we will get profile urls for the homes.
I am excluding “plus” homes from the equation for the purpose of this post, so we’ll go for regular Airbnb listings.
Scraping: the “Home’s” PageAs I mentioned in the intro, I noticed Airbnb changing catalog layouts every now and then, so apart from getting more details about a property, the advantage visiting each home’s page to get the details is a relative ease of working with different layouts Airbnb generates for the catalog.
Once the list of urls from the catalog page is there, there are several ways of getting to the product page, the two simplest ones are: 1) Scrapy’s yield Request (url, callback = self.
parse_home_page_function, .
), which is described in the official documentation or 2) use driver.
get('url'), which will take you to the property page directly.
The two have their own advantages and disadvantages.
A huge advantage of Scrapy’s Request is its ability to request several URLs in parallel (the exact number can be changed in settings.
py of the spider, see CONCURRENT_REQUESTS).
On the other hand, sending parallel requests will open multiple windows on a machine and you might run into resource allocation issue.
If it’s your machine, the windows are likely to be thrown over your working windows, whatever you are doing, so it’s hard to carry on.
Secondly, Scrapy’s requests have a higher chance to be denied by the server, especially when you they hit hit the server aggressively, or just not rarely enough.
Speaking of Selenium’s driver.
get, it is slow and the speed depends on your connection, however, it opens pages in one window and I never had problems with it being blocked.
For the reasons above, I chose to go through the profiles using driver.
get('url').
On the home page, let’s get property type, name, price per night, overall rating as well as the splits per category, summary and a brief description of the home and the neighboourhood.
I wanted to stop there, but then I remembered about the reviews section.
Extracting reviews are a good exercise for 2 reasons: 1) reviews provide valuable information for further analysis, such as NLP 2) reviews section has a different navigation compared to catalog 3) extracting information from this section required me to travel along the HTML tree, back and forth, until I got to the place where I could compose a dictionary of reviews, so I decided to share the thoughts on it.
Airbnb : That’s Gonna Be Scraped…and This…and the Ratings…and This TooThere are going to be 2 gists for everything highlighted above.
I provide them as if they are part of the original code.
If you want to try them out in the console, don’t forget to remove self from the .
The first one, which comes right below, gets everything but reviews: it’s fairly simple and very similar to retrieving the selectors on the search and catalog pages.
Out of the elements, link_to_home, property_type, property_name, price_night, rating_overall come in strings (use extract_first()) ; summary, rating_categories, rating_stars, home_neighborhood_short are lists (use extract(), by the way, the output is subsettable); and finally, rating_categories and rating_stars are organised into a dictionary, rating_split (with a key-value pairs rating_categories:rating_stars).
The gist below navigates the reviews section by clicking the button “next”.
The returned object is a list of list.
Each list inside the list contains reviews from the single reviews page in a string form.
Originally, I was planning to build a dictionary with a reviewer and review date as a key vs review as a value.
I am leaving my experiment with it commented in the gist, just in case you need a ground to start playing with it or have a suggestion of how to implement it.
I decided to forgo it for the purpose of this post.
If you feel like trying it at home, the 2 hints: 1) home owner’s response to a review is stored as the same class and dir in the HTML tree, so the navigation to a review should be precise enough to differentiate between the two cases; 2) some of the reviews are empty, but the reviewer name is there, leaving the potential key-value pair without the value.
In the gist’s comments, the first issue is resolved, but you’ll still stumble across the second one.
If you have a beautiful idea of how to improve the code, you’re very much welcome.
In here, I first get to the section to make sure it’s loaded, and get the page_source to make sure the section is captured.
After getting the first reviews page, I try clicking “next” button (you can also go button by button), and if it’s not available,The Final PieceThis part of the post merely connects all the pieces together in one chunk of code.
This is the content of the homes.
py file in the “spiders” folder.
One thing that is different is the yield that yields a dictionary object to form the output.
As you run the scraper as scrapy crawl homes -o output_file_name.
json or .
csv, yield will be passed to the output file.
The file will be stored in your project root directory (same where scrapy.
cfg lives).
On a Closing NoteWhat Was Different About AirbnbThere are a two things that were thought-provoking in this Airbnb experience.
One, they change the website layout quite frequently, and one could expect that maintaining the scraper will require effort.
I finished the first version of a scraper several weeks before finalizing the current one, and I do hope it lasts!Two, in some sections, like “reviews” many elements are nested in the HTML tree under the same element name, share the same class, id, etc.
, but represent logically different content (e.
g.
, reviews vs responses to reviews).
In this case, collecting and storing the data efficiently becomes a crucial part of a crawler.
Taking it FurtherLast but not least, if you want to containerize your application using Docker, you might want to have a look at this blog post, which I found to be comprehensive, with one exception that the author uses R instead of Python.
And finally, Airbnb has an API, so if you have your API access approved and it fits your purpose, use it!Thanks for reading this post and getting this far!.Hope it was helpful and if you have any comments, please leave them below.
You can also contact me on LinkedIn: https://www.
linkedin.
com/in/areusova/ (mention that you’re coming from this Medium post).
.