Web scraping with Python — A to Z

Web scraping with Python — A to ZHandling BeautifulSoup, avoiding blocks, enriching with API, storing in a DB and visualizing the dataShai ArdaziBlockedUnblockFollowFollowingFeb 7Photo by michael podger on UnsplashIntroductionWhat is web scraping and when would you want to use it?The act of going through web pages and extracting selected text or images.

An excellent tool for getting new data or enriching your current data.

Usually the first step of a data science project which requires a lot of data.

An alternative to API calls for data retrieval.

Meaning, if you don’t have an API or if it’s limited in some way.

For example:Tracking and predicting the stock market’s prices by enriching the up to date stock prices with the latest news stories.

These news stories may not be available from an API and therefore would need to be scraped from a news website.

This is done by going through a web page and extracting text (or images) of interest.

BackgroundOur web scraping project was part of the Data Science fellows program at ITC (Israel Tech Challenge) which was designed to expose us to the real world problems a data scientist faces as well as to improve our coding skills.

In this post, we show our main steps and challenges along the way.

We have included code snippets and recommendations on how to create an end to end pipeline for web scraping.

The code snippets we show here are not OOP (Object Oriented Programming) for the sake of simplicity, but we highly recommend to write OOP code in your web scraper implementation.

Main tools we used:Python (3.

5)BeautifulSoup library for handling the text extraction from the web page’s source code (HTML and CSS)requests library for handling the interaction with the web page (Using HTTP requests)MySQL database — for storing our data (mysql.

connector is the MySQL API for Python)API calls — for enriching our dataProxy header rotations — generating random headers and getting free proxy IPs in order to avoid IP blocksWorkflowWeb scraping timelineThe websiteIn this project we were free to choose any website.

The websites chosen by the rest of the cohort ranged from e-commerce to news websites showing the different applications of web scraping.

We chose a website for scientific articles because we thought it would be interesting to see what kind of data we could obtain and furthermore what insights we could gather as a result of this data.

We have chosen to keep the website anonymous.

In any case the goal of this post is to outline how to build a pipeline for any website of interest.

Scraping Scraping ScrapingBeautifulSoupFirst, one must inspect the website in order to determine which data one would like to scrape.

It involves a basic understanding of the websites structure so that your code can scrape the data you want.

In order to inspect the structure of the website, open the inspector of the web page, right click on the page → hit “Inspect element”.

Inspect element of a web pageThen, locate the data you want to scrape and click on it.

The highlighted part in the inspector pane shows the underlying HTML text of the webpage section of interest.

The CSS class of the element is what Beautifulsoup will use to extract the data from the html.

In the following screenshot one can see that the “keywords” section is what needs to be scraped.

Using the inspector, one can locate the HTML element of the “keywords” section and its CSS class.

Getting the exact location of keywordsThe structure is as follows:div (class=”keywords-section”) → div (class=“keyword”).

Using beautiful soup, the code to get all keywords is as follows:Code snippet — getting all keywords out of an articleFrom here, it’s pretty much the same.

Locate the desired section, inspect the HTML element and get the data.

Full documentation and much more examples of beautifulsoup can be found here (very friendly).

The scraping process involves many HTTP GET requests in a short amount of time because in many cases one may need to navigate automatically between multiple pages in order to get the data.

Moreover, having an awesome scraper is not just about getting the data one wants, it’s also about getting new data or updating existing data frequently — This might lead to being blocked by the website.

This leads us to the next section:How to avoid blocks?In general, websites don’t like bot scrapers but they probably don’t prevent it completely because of the search engine bots that scrape websites in order to categorize them.

There’s a robots exclusion standard that defines the website’s terms and conditions with bot crawlers, which is usually found in the robots.

txt file of the website.

For example, the robots.

txt file of Wikipedia can be found here: https://en.

wikipedia.

org/robots.

txt.

The first few lines of Wikipedia’s robots.

txt:# robots.

txt for http://www.

wikipedia.

org/ and friends## Please note: There are a lot of pages on this site, and there are# some misbehaved spiders out there that go _way_ too fast.

If you're# irresponsible, your access to the site may be blocked.

As you can see, Wikipedia’s restrictions are not too strict.

However, some websites are very strict and do not allow crawling part of the website or all of it.

Their robots.

txt would include this:User-agent: *Disallow: /How to deal with blocks?One way of doing this is by rotating through different proxies and user agents (headers) when making requests to the website.

Also, it is important to be considerate in how often you make requests to the website to avoid being a ‘spammer’.

Note — This is only for learning purposes.

We do not encourage you to breach terms of any website.

See below on how to implement this method in just a few simple steps.

Proxies poolImplementing a proxy server can be done easily in Python.

A list of free proxies can be found here (Note that free proxies are usually less stable and slower than paid ones.

If you don’t find the free ones good enough for your needs, you may consider getting a paid service).

Looking at the free proxies list, one can use BeautifulSoup in order to get the IP addresses and ports.

The structure of the above-mentioned website can seen below.

Table of free proxiesThe following function retrieves all the proxies’ IPs and ports and returns a list of them:A function to get a proxies IPs and portsHeaders poolThere are many HTTP headers that can be passed as part of a request when using the requests package in Python.

We passed two header elements (which were sufficient for us), namely the Accept header (user permissions) and user agent (Pseudo-Browser).

The pool of pseudo random headers was created as follows (see code below):Create a dictionary object of “accepts” where each accept header is related to a specific browser (depending on the user agent).

A list of accept headers can be found here.

This list contains default values for each user-agent and can be changed.

Get a random user-agent using fake-useragent package in Python.

This is super easy to use as seen in the code below.

We suggest creating a list of user-agents beforehand just in case the fake-useragent is unavailable.

An example of a user-agent:‘Mozilla/5.

0 (Windows NT 6.

2; rv:21.

0) Gecko/20130326 Firefox/21.

0’Create a dictionary object with accept and user-agent as keys and the corresponding valuesThe partial code (full function in the appendix below):Using the headers and proxies poolsThe following code shows an example of how to use the function we wrote before.

We did not include the OOP code for the sake of simplicity.

See Appendix for the full function random_header().

Pools creation and usageUp until here we gave a brief introduction of web scraping and spoke about more advanced techniques on how to avoid being blocked by a website.

In the following section we show 2 examples of how to use API calls for data enrichment: Genderize.

io and Aylien Text Analysis.

Using API for data enrichmentGenderizeGenderize uses the first name of an individual to predict their gender (limited to male and female).

The output of this API is structured as JSON as seen in the example below:{“name”:”peter”,”gender”:”male”,”probability”:”0.

99",”count”:796}This makes it very convenient to enrich the author data with each one’s gender.

Since the probability of the predicted gender is included, one can set a threshold to ensure better quality predictions (we set our threshold at 60% — see below for code snippets).

The value this API brings is the ability to determine the gender distribution of authors for a specified topic.

We did not have to worry about the API limit (1000 calls/day) since we were only able to scrape around 120 articles/day which on average resulted in less than 500 authors per day.

If one is able to exceed this daily limit, the API limit would have to be taken into account.

One way of avoiding this daily limit would be to check if the first name being evaluated has already been enriched in our database.

This would allow us to determine the gender based on the existing data without wasting an API call.

Some code snippets for the tech hungry:Connecting genderize:Author gender enrichment:Aylien Text AnalysisWe were interested in seeing the growth of keywords over time for a specified topic (think Google Trends) and therefore decided that we should enrich our data with more keywords.

To do this, we used an API called Aylien Text Analysis, specifically the concept extraction API.

This API allows one to input text which after processing outputs a list of keywords extracted from the text using NLP.

Two of the various fields we scraped for each article were the Title and Abstract, these fields were concatenated and used as the input for the API.

An example of the output JSON can be seen below:{ “text”:”Apple was founded by Steve Jobs, Steve Wozniak and Ronald Wayne.

”, “language”:”en”, “concepts”:{ “http://dbpedia.

org/resource/Apple_Inc.

":{ “surfaceForms”:[ { “string”:”Apple”, “score”:0.

9994597361117074, “offset”:0 } ], “types”:[ “http://www.

wikidata.

org/entity/Q43229”, “http://schema.

org/Organization", “http://dbpedia.

org/ontology/Organisation", “http://dbpedia.

org/ontology/Company" ], “support”:10626 } }}In order to avoid duplicate keywords we checked that the keyword did not already exist in the keyword table of our database.

In order to avoid adding too many keywords per article, two methods were instituted.

The first was a simple keyword limit as seen in the code snippet below.

The other made use of the score (probability of relevance) available in the output file for each keyword — This allows one to set a threshold (we used 80%) to ensure the most relevant keywords were added for each article.

An example of how the API works is seen in the figure below:Below is a snippet of the code we used to connect to the Aylien Text API service:Connect to aylien:Enrich keywords using Aylien API:Let’s move to the final part.

So far we gave an introduction to web scraping and how to avoid being blocked, as well as using API calls in order to enrich one’s data.

In the final part of this post we will go through how to set up a database in order to store the data and how to access this data for visualization.

Visualizations are a powerful tool one can use to extract insights from the data.

Store data — MySQL DBWhen setting up the database for a web scraping project (or others in general) the following should be taken into account:Tables creationNew data insertionData update (every hour/day…)Tables creationThis stage of the pipeline should be done with caution and one should validate that the chosen structure (in terms of columns types, lengths, keys etc.

) is suitable for the data and can handle extreme cases (missing data, non-English characters etc.

).

Avoid relying on an ID that is used by the website as the primary/unique key unless you have a really good reason (in our case doi_link of an article is a unique string that is acceptable everywhere, so we use it as a unique identifier of an article).

An example of tables creation using mysql.

connector package:The SQL command:The function for building the database:Tables creationNote — in lines 12, 17, 23 and 25 we use the logger object.

This is for logging to an external logs file and it’s super important.

Creating a Logger class is recommended, you can see more below in this post, or click here.

Data InsertionInsertion of new data differs a bit from updating existing data.

When new data is inserted to DB, one should make sure there are not duplicates.

Also, in case of an error, one should catch it, log it and save the portion of data that caused that error for future inspection.

As seen below, we used again the cursor of mysql.

connector in order to execute the SQL insert command.

Data insertionData UpdateDynamic data requires frequent updates.

One should define the time deltas (differences) between two updates which depends on the data type and source.

In our project, we had to take into account that the number of citations for all articles would have to be updated periodically.

The following piece of code illustrates the update process:Data updateVisualizationsIn order to help make sense of the collected data , one can use visualizations to provide an easy-to-understand overview of the data.

The visualizations we created enabled us to gain insights into the following use cases:High-level trendsIdentify leading Institutions/countries in the specified topicIdentify top researchers in the specified topicThe above use cases allow for a data-driven approach for: R&D Investment, consultation, general partnershipsRedash — an open source tool for visualizationsIn order to explore the above use cases we created visualizations of our data.

We did this by using a simple but powerful open-source tool called Redash that was connected to our AWS machine(other kinds of instances are also available).

In order to set up Redash, do the following:Click on the following link: https://redash.

io/help/open-source/setup#awsChoose the relevant AWS instance in order to create the Redash image on your machine.

Before moving on, here is an overview of the data we collected for the topic “Neural Networks”.

As you can see, not a lot of data was retrieved — this was because of the limited time we had available on the AWS (Amazon Web Services) machines.

Due to the lack of sufficient data, the reader should evaluate the results with a pinch of salt — this is at this stage a proof of concept and is by no means a finished product.

Addressing the use cases above — High level trends:For the high level trends we simply plotted the number of research papers published per month for the past 5 years.

SELECT publication_date, count(id) AS num FROM articles GROUP BY publication_date ORDER BY publication_date;Gender DistributionThis visualization makes use of enriched author data from genderize to view the gender distribution of authors within the specified topic.

As seen in the figure below there are a large proportion of authors whose gender are unknown due to limitations of the genderize API.

SELECT gender, count(ID) FROM authors GROUP BY gender;Identifying leading countries in the fieldWhen scraping affiliations for each author we were able to extract the country of each one, allowing us to create the visualization below.

China publishes the majority of research papers for the topic “Neural Networks” as is expected due to their keen interest in AI.

This information could be interesting for policy makers since one can track the advancements in AI in leading countries.

Firstly, it can be helpful to monitor these leading countries to find opportunities for partnership in order to advance AI in both countries.

Secondly, policy makers can use these insights in order to emulate leading countries in advancing AI within their own country.

Leading countries in the field of Neural NetworksSELECT country, count(affiliation_id) AS counterFROM affiliations GROUP BY country ORDER BY counter DESC;Identifying top lead authors in the fieldAs a first approach to identifying the top researchers we decided to to compare the lead authors with the most citations associated to their name.

Number of citations for each authorSELECT CONCAT(authors.

first_name,” “, authors.

last_name) AS name, SUM(articles.

citations) AS num_citations FROM authors JOIN authors_article_junction JOIN articlesWHERE authors_article_junction.

author_ID = authors.

ID AND articles.

ID = authors_article_junction.

article_ID AND authors_article_junction.

importance = 1 GROUP BY authors.

ID ORDER BY num_citations DESC LIMIT 10;Keywords mapThe larger the word, the more frequent it is in the databaseKeywords mapSELECT keywords.

keyword_name, COUNT(keywords_ID) AS num FROM keyword_article_junction JOIN keywords WHERE keyword_article_junction.

keywords_ID = keywords.

ID GROUP BY keywords.

keyword_name ORDER BY num DESC LIMIT 20;Snapshot of our dashboard in RedashRedash dashboardConclusionWe have reached the end of our Web Scraping with Python A — Z series.

In the first part we gave a brief introduction of web scraping and spoke about more advanced techniques on how to avoid being blocked by a website.

Also, we showed how one can use API calls in order to enrich the data to extract further insights.

And lastly, we showed how to create a database for storing the data obtained from web scraping and how to visualize this data using an open source tool — Redash.

AppendixFuture notesConsider using grequests for parallelizing the get requests.

This can be done by doing the following:This may not be as effective as it should be due to the limited speed of the free proxies but it is still worth trying.

Using Selenium for handling Javascript elements.

Complete function — random_headerThe full function to create random headers is as follows:Note — In line 22 we saved a message into a logs file.

It’s super important to have logs in your code!.We suggest using logging package which is pretty simple to use.

Logging the flowThe logger class that we built and used everywhere in our code:Logger class and example of how to use it.

. More details

Leave a Reply