An introduction to web scraping with Python

In order to make them complete, we just need to add before them the URL of the main page: http://books.toscrape.com/index.html (after removing the index.html part).Now let’s use this to define a function to retrieve book links on any given page of the website:Find book categories URLs on the main pageNow let’s try retrieving the URLs corresponding the different product categories:Inspecting HTML codeBy inspecting, we can see that they follow the same URL pattern: ‘catalogue/category/books’.We can tell BeautifulSoup to match the URLs that contain this pattern in order to retrieve easily the categories URLs:50 fetched categories URLsSome examples:[u'http://books.toscrape.com/index.htmlcatalogue/category/books/travel_2/index.html', u'http://books.toscrape.com/index.htmlcatalogue/category/books/mystery_3/index.html', u'http://books.toscrape.com/index.htmlcatalogue/category/books/historical-fiction_4/index.html', u'http://books.toscrape.com/index.htmlcatalogue/category/books/sequential-art_5/index.html', u'http://books.toscrape.com/index.htmlcatalogue/category/books/classics_6/index.html']We managed to retrieve the 50 categories URLs successfully!Remember to always check what you fetched to be sure that all the information is relevant.Getting the URLs of subsections of a website can be very useful if we want to scrape a specific part of it.Scrape all books dataFor the last part of this tutorial, we will finally tackle our main objective: gather data about all the books of the website.We know how to get the links of the books within a given page.If all the books were displayed on a same page this would be easy.However this situation is unlikely as it is not very user friendly to display all the catalog to the user on the same page.Usually products are displayed on multiple pages or on one page but through scrolling.We can see here at the bottom of the main page that there are 50 products pages and a button ‘next’ to access to the next product page.End of the main pageOn the next pages there is also a ‘previous’ button to come back to the last product page.End of the second pageGet all pages URLsIn order to fetch all the products URLs, we need to be able to get through all the pages.To do so, we can go iteratively through all the ‘next’ buttons.Inspecting HTML codeThe ‘next’ button contains the pattern ‘page’.We can use this to retrieve the URLs of the next pages.But let’s be careful: the ‘previous’ button also contains this pattern!If we have two results when matching with ‘page’, we should take the second one as it will correspond to the next page.For the first and the last pages we will have only one result because we will have either the ‘next’ button or the ‘previous’ button.50 fetched URLsSome examples:['http://books.toscrape.com/index.html', u'http://books.toscrape.com/catalogue/page-2.html', u'http://books.toscrape.com/catalogue/page-3.html', u'http://books.toscrape.com/catalogue/page-4.html', u'http://books.toscrape.com/catalogue/page-5.html']We successfully managed to get the 50 pages URLs.What is interesting here is that the URL of those pages is highly predictable.We could have just created this list by incrementing ‘page-X.html’ until 50.This solution could work for this exact example but would not work anymore if the number of pages changed (e.g. if the website decided to print more products per pages, or if the catalog changed).One solution could be to increment the value until we get on a 404 page.404 error pageHere we can see that trying to go to the 51th page effectively gets us a 404 error.Fortunately the result of a request has a very useful attribute that can show us the return status of the HTML request.status code for page 50: 200status code for page 51: 404The 200 code indicates that there is no error.The 404 code tells us that the page was not found.We can use this information to get all our pages URLs: we should iterate until we get a 404 code.Let’s try this method now:50 fetched URLsSome examples:['http://books.toscrape.com/catalogue/page-1.html', 'http://books.toscrape.com/catalogue/page-2.html', 'http://books.toscrape.com/catalogue/page-3.html', 'http://books.toscrape.com/catalogue/page-4.html', 'http://books.toscrape.com/catalogue/page-5.html']We managed to obtain the same URLs using this simpler method!Get all products URLsNow the next step consists in fetching all the products URLs for every page.This step is quite simple as we already have the list of all pages and the function to get products URLs from a page.Let’s iterate through the pages and apply our function:1000 fetched URLsSome examples:[u'http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html', u'http://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html', u'http://books.toscrape.com/catalogue/soumission_998/index.html', u'http://books.toscrape.com/catalogue/sharp-objects_997/index.html', u'http://books.toscrape.com/catalogue/sapiens-a-brief-history-of-humankind_996/index.html']We finally got the 1000 book URLs. More details

Leave a Reply