Web Scraping news articles in Python

pip install requestsWe will use the requests module to get the HTML code from the page and then navigate through it with the BeautifulSoup package.

We will learn to use two commands that will be enough for our task:find_all(element tag, attribute): it allows us to locate any HTML element from a webpage introducing its tag and attributes.

This command will locate all the elements of the same type.

In order to get only the first one, we can use find() instead.

get_text(): once we have located a given element, this command will allow us to extract the text inside.

So, at this point, what we need to do is to navigate through the HTML code of our webpage (for example, in Google Chrome we need to enter the webpage, press right click button and go to See source code) and locate the elements we want to scrape.

We can simply do this searching with Ctrl+F or Cmd+F once we are seeing the source code.

Once we have identified the elements of interest, we will get the HTML code with the requests module and extract those elements with BeautifulSoup.

We will carry out an example with the El Pais English newspaper.

We will first try to web scrape the news articles titles from the frontpage and then extract the text out of them.

Once we enter the website, we need to inspect the HTML code to locate the news articles.

After a fast look we can see that each article in the frontpage is an element like this:The title is an <h2> (heading-2) element with itemprop=”headline" and class=”articulo-titulo" atributes.

It has an <a> element with an href attribute which contains the text.

So, in order to extract the text, we need to code the following commands:# importing the necessary packagesimport requestsfrom bs4 import BeautifulSoupWith the requests module we can get the HTML content and save into the coverpage variable:r1 = requests.

get(url)coverpage = r1.

contentNext, we need to create a soup in order to allow BeautifulSoup to work:soup1 = BeautifulSoup(coverpage, 'html5lib')And finally, we can locate the elements we are looking for:coverpage_news = soup1.

find_all('h2', class_='articulo-titulo')This will return a list in which each element is a news article (because with find_all we are getting all ocurrences):If we code the following command, we will be able to extract the text:coverpage_news[4].

get_text()If we want to access the value of an attribute (in this case, the link), we can type the following:coverpage_news[4]['href']And we’ll get the link in plain text.

If you have understood until this point, you are ready to web scrape any content you want.

The next step would be to access each of the news articles content with the href attribute, get the source code again and find the paragraphs in the HTML code to finally get them with BeautifulSoup.

It’s the same idea as before, but we need to locate the tags and attributes that identify the news article content.

The code of the full process is the following.

I will show the code but won’t enter in the same detail as before since it’s exactly the same idea.

# Scraping the first 5 articlesnumber_of_articles = 5# Empty lists for content, links and titlesnews_contents = []list_links = []list_titles = []for n in np.

arange(0, number_of_articles): # only news articles (there are also albums and other things) if "inenglish" not in coverpage_news[n].

find('a')['href']: continue # Getting the link of the article link = coverpage_news[n].

find('a')['href'] list_links.

append(link) # Getting the title title = coverpage_news[n].


get_text() list_titles.

append(title) # Reading the content (it is divided in paragraphs) article = requests.

get(link) article_content = article.

content soup_article = BeautifulSoup(article_content, 'html5lib') body = soup_article.

find_all('div', class_='articulo-cuerpo') x = body[0].

find_all('p') # Unifying the paragraphs list_paragraphs = [] for p in np.

arange(0, len(x)): paragraph = x[p].

get_text() list_paragraphs.

append(paragraph) final_article = " ".

join(list_paragraphs) news_contents.

append(final_article)All the details can be found in my github repo.

It is important to mention that this code is only useful for this webpage in particular.

If we want to scrape another one, we should expect that elements are identified with different tags and attributes.

But once we know how to identify them, the process is exactly the same.

At this point, we are able to extract the content of different news articles.

The final step is to apply the machine learning model we trained in the first post to predict its categories and show a summary to the user.

This will be covered in the final post of this series.

.. More details

Leave a Reply