Learn Web Scraping using Python in under 5 minutes

Learn Web Scraping using Python in under 5 minutesKaustumbh JaiswalBlockedUnblockFollowFollowingJan 28Figure 1: Image Source- The Data SchoolWhat is Web Scraping?Web scraping is harvesting or extracting desired information from a webpage.

Scraping using BeautifulSoupFor web scraping we are going to use the very popular Python library called BeautifulSoup.

For web scraping you first need to have some basic knowledge about the HTML tags.

Some of the tags used in HTML are shown below.

For more information on HTML tags please refer to https://www.

w3schools.

com/tags/.

Getting StartedTo get started with scraping make sure you have Python (version 3+) and BeautifulSoup installed on your system.

If you don’t have BeautifulSoup installed, then just type the following command in your Terminal/Command Prompt-pip install beautifulsoup4Let’s scrape!InspectingThe first step in scraping is to select the website you wish to scrape data from and inspect it.

In this tutorial we will try to scrape information from this article published on BBC.

To inspect a website right click anywhere on the page and choose ‘Inspect Element’ / ‘View Page Source’ .

To view the location of a particular entity on a webpage like text or image, select that portion on the webpage and then right click and choose ‘Inspect Element’ / ‘View Page Source’.

Figure 2: Webpage to be scrapedAfter you inspect a webpage, a window will pop up showing you the exact location of the selected content in HTML code of the page as shown below.

Figure 3: HTML code of the webpageSince our aim is to extract the entire body of the article, it is important to make a note of the <div> tag under which the entire text of the article is enclosed.

Now let’s take a closer look at the webpage and identify the <div> tag.

Figure 4: HTML code showing the required tagsAs we can see, <div class=”story-body sp-story-body gel-body-copy”> is the tag we are looking for.

Now, we have all we need so let’s straight dive into the code and do some scraping!ParsingNow we can begin parsing the webpage and searching for the specific elements we need using BeautifulSoup.

For connecting to the website and getting the HTML we will use Python’s urllib.

Let us import the required libraries-from urllib.

request import urlopenfrom bs4 import BeautifulSoupGet the url-url = "https://www.

bbc.

com/sport/football/46897172"Connecting to the website-# We use try-except incase the request was unsuccessful because of # wrong URLtry: page = urlopen(url)except: print("Error opening the URL")Create a BeautifulSoup object for parsing-soup = BeautifulSoup(page, 'html.

parser')Extracting the required elementsWe now use BeautifulSoup’s soup.

find() method to search for the tag <div class=”story-body sp-story-body gel-body-copy”> which contains the text of the article we are interested in.

content = soup.

find('div', {"class": "story-body sp-story-body gel- body-copy"})We now iterate through content to find all the <p> (paragraph) tags in it to get the entire body of the article.

article = ''for i in content.

findAll('p'): article = article + ' ' + i.

textSaving the parsed textWe can save the information we scraped in a .

txt or .

csv file.

with open('scraped_text.

txt', 'w') as file: file.

write(article)The entire code-Output- Cristiano Ronaldo’s header was enough for Juventus to beat AC Milan and claim a record eighth Supercoppa Italiana in a game played in Jeddah, Saudi Arabia.

The Portugal forward nodded in Miralem Pjanic’s lofted pass in the second half to settle a meeting between Italian football’s two most successful clubs.

It was Ronaldo’s 16th goal of the season for the Serie A leaders.

Patrick Cutrone hit the crossbar for Milan, who had Ivorian midfielder Franck Kessie sent off.

Gonzalo Higuain, reportedly the subject of interest from Chelsea, was introduced as a substitute by Milan boss Gennaro Gattuso in Italy’s version of the Community Shield.

But the 31-year-old Argentina forward, who is currently on loan from Juventus, was unable to deliver an equalising goal for the Rossoneri, who were beaten 4–0 by Juve in the Coppa Italia final in May.

ConclusionWeb scraping can be really useful when you want to gather data from multiple sources for analysis or for research.

BeautifulSoup is an excellent web scraping library which can be used for small projects but for large projects other libraries like Scrapy are more suitable.

Hope you have understood the concept of web scraping and can now scrape data from different websites as per your need.

Thanks for reading.

Happy scraping!.????.

. More details

Leave a Reply