Hands-On Introduction to Web Scraping in Python: A Powerful Way to Extract Data for your Data Science Project

One of the most effective and simple ways to do this is through web scraping.

I have personally found web scraping a very helpful technique to gather data from multiple websites.

Some websites these days also provide APIs for many different types of data you might want to use, such as Tweets or LinkedIn posts.

But there might be occasions when you need to collect data from a website that does not provide a specific API.

This is where having the ability to perform web scraping comes in handy.

As a data scientist, you can code a simple Python script and extract the data you’re looking for.

So in this article, we will learn the different components of web scraping and then dive straight into Python to see how to perform web scraping using the popular and highly effective BeautifulSoup library.

A note of caution here – web scraping is subject to a lot of guidelines and rules.

Not every website allows the user to scrape content so there are certain legal restrictions at play.

Always ensure you read the website’s terms and conditions on web scraping before you attempt to do it.

  Table of Contents 3 Popular Tools and Libraries used for Web Scraping in Python Components of Web Scraping Crawl Parse and Transform Store Scraping URLs and Email IDs from a Web Page Scraping Images Scraping Data on Page Load   3 Popular Tools and Libraries used for Web Scraping in Python You’ll come across multiple libraries and frameworks in Python for web scraping.

Here are three popular ones that do the task with efficiency and aplomb: BeautifulSoup BeautifulSoup is an amazing parsing library in Python that enables us to extract data from HTML and XML documents.

It can automatically detect encodings and gracefully handles HTML documents even with special characters.

We can navigate a parsed document and find what we need which makes it quick and painless to extract the data from the webpages.

In this article, we will learn how to build web scrapers using Beautiful Soup in detail Scrapy Scrapy is a Python framework for large scale web scraping.

It gives you all the tools you need to efficiently extract data from websites, process them as you want, and store them in your preferred structure and format.

You can read more about Scrapy here Selenium Selenium is another popular tool for automating browsers.

It’s primarily used for testing in the industry but is also very handy for web scraping.

Check out this amazing article to know more about how it works in Python   Components of Web Scraping Here’s a brilliant illustration of the three main components that make up web scraping: Let’s understand these components in detail.

We’ll do this by scraping hotel details like the name of the hotel and price per room from the goibibo website: Note: Always follow the robots.

txt file of the target website which is also known as the robot exclusion protocol.

This tells web robots which pages not to crawl.

So, looks like we are allowed to scrape the data from our targeted URL.

We are good to go and write the script of our web robot.

Let’s begin!.  Step 1: Crawl The first step is to navigate to the target website and download the source code of the web page.

We are going to use the requests library to do this.

A couple of other libraries to make requests and download the source code are http.

client and urlib2.

Once we have downloaded the source code of the webpage, we need to filter the contents that we need: View the code on Gist.

https://s3-ap-south-1.

amazonaws.

com/av-blog-media/wp-content/uploads/2019/09/part_1.

mp4                     Step 2: Parse and Transform The next step is to parse this data into an HTML Parser and for that, we will use the BeautifulSoup library.

Now, if you have noticed our target web page, the details of a particular hotel are on a different card like most of the web pages.

So the next step would be to filter this card data from the complete source code.

Next, we will select the card and click on the ‘Inspect Element’ option to get the source code of that particular card.

You will get something like this: The class name of all the cards would be the same and we can get a list of those cards by just passing the tag name and attributes like the <class> tag with its name like I’ve shown below: View the code on Gist.

We have filtered the cards data from the complete source code of the web page and each card here contains the information about a separate hotel.

Select only the Hotel Name, perform the Inspect Element step, and do the same with the Room Price: Now, for each card, we have to find the above Hotel Name which can be extracted from the <p> tag only.

This is because there is only one <p> tag for each card and Room Price by <li> tag along with the <class> tag and class name: View the code on Gist.

  Step 3: Store the Data The final step is to store the extracted data in the CSV file.

Here, for each card, we will extract the Hotel Name and Price and store it in a Python dictionary.

We will then finally append it to a list.

Next, let’s go ahead and transform this list to a Pandas dataframe as it allows us to convert the dataframe into CSV or JSON files: View the code on Gist.

Congrats!.We have successfully created a basic web scraper.

I want you to try out these steps and try to get more data like ratings and address of the hotel.

Now let’s see how to perform some common tasks like scraping URLs, Email IDs, Images, and Scrape Data on Page Loads.

  Scrape URLs and Email IDs from a Web Page Two of the most common features we try to scrape are website URLs and email IDs.

I’m sure you’ve worked on projects or challenges where extracting email IDs in bulk was required (see marketing teams!).

So let’s see how to scrape these aspects in Python.

  Using the Console of the Web Browser Let’s say we want to keep track of our Instagram followers and want to know the username of the person who unfollowed our account.

First, log in to your Instagram account and click on followers to check the list: Scroll down all the way so that we have all the usernames loaded in the background in our browser’s memory Right-click on the browser’s window and click ‘Inspect Element’ In the Console Window, type this command: urls = $$(‘a’); for (url in urls) console.

log ( urls[url].

href); With just one line of code, we can find out all the URLs present on that particular page: Next, save this list at two different time stamps and a simple Python program will let you know the difference between the two.

We would be able to know the username of who unfollowed our account!.There can be multiple ways we can use this hack to simplify our tasks.

The main idea is that with a single line of code we can get all the URLs in one go   Using the Chrome Extension Email Extractor Email Extractor is a Chrome plugin that captures the Email IDs present on the page that we are currently browsing It even allows us to download the list of Email IDs in CSV or Text file:   BeautifulSoup and Regex The above solutions are efficient only when we want to scrape data from just one page.

But what if we want the same steps to be done on multiple webpages?.There are many websites that can do that for us at some price.

But here’s the good news – we can also write our own web scraper using Python!.Let’s see how to do that in the live coding window below.

Scrape Images in Python In this section, we will scrape all the images from the same goibibo webpage.

The first step would be same to navigate to the target website and download the source code.

Next, we will find all the images using the <img> tag: View the code on Gist.

From all the image tags, select only the src part.

Also, notice that the hotel images are available in jpg format.

So we will select only those: View the code on Gist.

Now that we have a list of image URLs, all we have to do is request the image content and write it in a file.

Make sure that you open the file ‘wb’ (write binary) form: View the code on Gist.

You can also update the initial page URL by page number and request them iteratively to gather data in a large amount.

  Scrape Data on Page Load Let’s have a look at the web page of the steam community Grant Theft Auto V Reviews.

You will notice that the complete content of the webpage will not get loaded in one go.

We need to scroll down to load more content on the web page (the age of endless scrolling!).

This is an optimization technique called Lazy Loading used by the backend developers of the website.

But the problem for us is when we try to scrape the data from this page, we will only get a limited content of the webpage: https://s3-ap-south-1.

amazonaws.

com/av-blog-media/wp-content/uploads/2019/10/part_2.

mp4 Some websites also create a ‘Load More’ button instead of the endless scrolling idea.

This will load more content only when you click that button.

The problem of limited content still remains.

So let’s see how to scrape these kinds of web pages.

Navigate to the target URL and open the ‘Inspect Element Network’ window.

Next, click on the reload button and it will record the network for you like the order of image loads, API requests, POST requests, etc.

: https://s3-ap-south-1.

amazonaws.

com/av-blog-media/wp-content/uploads/2019/10/part_3.

mp4 Clear the current records and scroll down.

You will notice that as you scroll down, the webpage is sending requests for more data: Scroll further and you will see the pattern in which the website is making requests.

Look at the following URLs – only some of the parameter values are changing and you can easily generate these URLs through a simple Python code: You need to follow the same steps to crawl and store the data by sending requests to each of the pages one by one.

  End Notes This was a simple and beginner-friendly introduction to web scraping in Python using the powerful BeautifulSoup library.

I’ve honestly found web scraping to be super helpful when I’m looking to work on a new project or need information for an existing one.

As I mentioned, there are other libraries as well which you can use for performing web scraping.

I would love to hear your thoughts on which library you prefer (even if you use R!) and your experience with this topic.

Let me know in the comments section below and we’ll connect!.You can also read this article on Analytics Vidhyas Android APP Share this:Click to share on LinkedIn (Opens in new window)Click to share on Facebook (Opens in new window)Click to share on Twitter (Opens in new window)Click to share on Pocket (Opens in new window)Click to share on Reddit (Opens in new window) Related Articles (adsbygoogle = window.

adsbygoogle || []).

push({});.

. More details

Leave a Reply