Web Scraping: A Brief Overview of Scrapy and Selenium, Part I

kaggle.comWeb Scraping: A Brief Overview of Scrapy and Selenium, Part IThoughts on a scraper design that could save your timeAnastasia ReusovaBlockedUnblockFollowFollowingDec 4In this post, I am sharing my first experience with web scraping and the tools I have used (Scrapy and Selenium)..For the most part, the course covers the use of Scrapy for web crawling, but also touches upon the use of Selenium..Another peculiarity of Scrapy is that it goes through pages by accessing their URLs, however, you will find that some buttons won’t have any URLs linked to them when you inspect the element or get the source code (through xpath or css)..Like this “show all” button:airbnb.aeIn these cases, if you want to use Python, you will turn to other tools, like Selenium, which I found to be a fairly beginner-friendly but less optimised scraping tool..Specifically, Selenium makes it easy to interact with the website, or simply click through pages, while getting to the element of my interest.At the same time, Selenium is clumsy at handling certain exceptions that Scrapy handles gracefully..For example, consider this review count for homes on Airbnb, if a property has a review, the counter is displayed, you can see it in the class="_1lykgvlh", inside the span.airbnb.aeThe property below, however, has no reviews and the counter is not there as an element of the source code, and there’s nothing to “inspect” in the same class="_1lykgvlh":airbnb.aeSo if you are looping through all these classes to get all the elements from it, such as “new” tag, reviews count and “free cancellation” tag, Selenium will return all these elements for the first property and drop off of these for the second one (even if only finding 1 element triggers the NoSuchElementException)..For example, this USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36' will work for Mac, but will not work for Ubuntu.There are plenty of tools out there Scrapy and Selenium are not the only options for web crawling..Therefore, I do recommend taking an online course, like this Udemy course, which I found really helpful, and build up understanding gradually if you are a beginner.As this was Part I of this post, I will follow up with Part II, where I will share a Python code you and include explanations of what it does, so you can replicate it.Comment below if you have questions and connect with me on LinkedIn if you want to network.LinkedIn: https://www.linkedin.com/in/areusova/GitHub: https://github.com/khunreus. More details

Leave a Reply