Tutorial: From The Hypothesis To The Analysis With Web Scraping

Tutorial: From The Hypothesis To The Analysis With Web ScrapingA tutorial for a real data science project with scraping the needed data, cleaning, transforming and finally analyzing itBenedikt DrosteBlockedUnblockFollowFollowingJul 4Often people have interesting ideas for analysis and then find that there is no freely available data to answer the question.

An interesting way to get needed data is web scraping.

There is a lot of data on the internet, but they are not available in such a way that they can be analyzed in a structured way.

Often, tutorials start with reading the data and end with saving it.

In this tutorial, I would like to take a different path.

I would like to take you on the journey of a real-world Data-Science project.

We start with our idea: we scrape, clean up and transform the data and answer our question with the scraped data set.

A much more detailed analysis without the scraping part can be found here.

The Idea And Our HypothesisRecently, the national and international competitions of the 2018/2019 season ended in soccer.

Traditionally, in the summer break the transfer window opens and the clubs compete for the best players in the market.

It is striking that again and again the same clubs spend new record sums for their stars.

Real Madrid alone invested 135 million in Eden Hazard.

Therefore, I wanted to pursue the question of how strong the squad value and the sporting success are related.

Will there be other champions than Barcelona and Real Madrid in the next few years?.What about the other leagues?.My hypothesis is that a higher squad value also leads to higher success.

So I had an idea for an interesting analysis, but no data.

However, I knew that there are databases for squad values ​​in the web.

So let’s get some data first!We Need Some Data — Data Collection Via Web ScrapingThere are different packages for web scraping.

Two of the best known libraries are certainly Scrapy and BeautifulSoup.

Scrapy is a powerful framework built on an asynchronous networking library, which makes it very performant.

It also has many features to avoid typical scraping problems.

These include, for example, redirections, retrying requests, avoiding to overload the servers, etc.

Due to the complexity, however, the learning curve for Scrapy is also significantly higher than for BeautifulSoup.

This library is primarily used for parsing and extracting data from HTML.

All exceptions and problems that arise when scraping must be identified by the programmer and taken into account in the coding.

However, this has the advantage that the user actually has to deal with the matter here and learn the Webscraping from the ground up.

Therefore, we will use BeautifulSoup here.

The DatabaseThe required data can be found on transfermarkt.

com[1].

For example, in Figure 1, you can see the squad values ​​for the German Bundesliga in the 2018/2019 season.

On the same page you will find a table with the results of the current season: placement, goal difference and points.

If we take the trouble to scrape the data, we should extract all the information we need for further analysis.

Figure 1: The red-marked area contains the required informationWhen it comes to web scraping, we are usually interested in a section of the page from which we would like to selectively extract information.

In our example, we first want to extract the contents of the two tables (Figure 1).

To do this, we identify the HTML tags that enclose our elements.

Each browser offers the possibility to identify the HTML tags.

In this example I am using Google Chrome (Figure 2).

First, you click with a right click in the area, which contains the necessary data.

Then you click Inspect and it opens a menu on the right.

If you move the mouse over each row in the menu, the site highlights the areas associated with the HTML code (Figure 3).

Figure 2: Inspecting the region of interest with your browser / Figure 3: The HTML-Code of the are of interestSo the first table is in the HTML-elementtable with the class name items .

Therein, there are the elements thead, tfood and tbody, which in turn contain further elements.

The essential contents are found in tbody, which contains tr elements for each row of the table.

We do not need more information in a first step.

Hence, we first import the required modules:import pandas as pdimport numpy as npimport requestsfrom bs4 import BeautifulSoupimport warningswarnings.

filterwarnings('ignore')Then we assign the URL to a variable and set a header that serves to simulate a real user.

Otherwise, some pages block the request directly.

We will use the module requests to call up the URL and load the contents.

Then we will parse the content of the requests object with BeautifulSoup to extract the content we need:url = 'https://www.

transfermarkt.

com/bundesliga/startseite/wettbewerb/L1/plus/?saison_id=2018'headers = {"User-Agent":"Mozilla/5.

0"}response = requests.

get(url, headers=headers, verify=False)soup = BeautifulSoup(response.

text, 'html.

parser')The soup element now contains the parsed page.

From the soup element we are able to extract the desired table:table = soup.

find('table', {'class' : 'items'})Now we can save every single row of the table in a list:row = table.

findAll('tr')By typing len(row) we find out that there is a total of 20 elements.

The 18 teams of the Bundesliga, the header and footer.

With row[2] we should find Bayern Munich:Figure 4: Bayern MunichNow we have the data of the first club.

In the red-marked areas in figure 4, we have the name, the size of the squad, the average age, the number of foreigners, the total market value and the average market value.

But how do we get the data into a reasonable form to analyze it?.To do this, we first have to remove all the HTML code with the suffix .

text.

Row[2].

text gives us the information more readable:'.Bayern Munich Bayern Munich 3224,715835,55 Mill.

€26,11 Mill.

€835,55 Mill.

€26,11 Mill.

€'That looks a lot better, but now we have all the information in one variable.

Thus, we divide the rows with findAll['td'] by columns so we can address each single cell and save it.

Now, analogous to the lines, we can also address the individual columns or combine both:In:row[2].

findAll('td')[1].

textrow[2].

findAll('td')[3].

textrow[2].

findAll('td')[4].

textrow[2].

findAll('td')[5].

textrow[2].

findAll('td')[6].

textrow[2].

findAll('td')[7].

textOut:'Bayern Munich ''32''24,7''15''835,55 Mill.

€''26,11 Mill.

€'Now we have everything to read the whole table with a loop:We are able to create a DataFrame from the lists.

We always skip the first line, as we have scraped the header we do not need.

Scraping The Data More DynamicallyOf course we do not go to all that trouble to read only one table.

We will be reading out the top 10 leagues [2] over a period of seven years.

Then we have a solid database to do our calculations.

First of all, we create a dictionary for each league, in which we store the data for every single season:Then we create a list with the URL for each league.

Can you still remember the Bayern-URL for the 2018/2019 season?https://www.

transfermarkt.

com/bundesliga/startseite/wettbewerb/L1/plus/?saison_id=2018'In the list we only deposit:https://www.

transfermarkt.

com/bundesliga/startseite/wettbewerb/L1/plus/?saison_id=We add the year using a loop.

In this way, we can write a flexible script for our project.

The complete list for the top 10 leagues looks like this:Now we need a nested loop.

The outermost loop iterates over the list with the League URLs and the inner loop iterates over the years 2012 through 2018:The Data TransformationAs a result, we have ten dictionaries with each season from 2012 to 2018 for each league.

First, we will combine all seasons for each league and then unite all the leagues into one data set.

The question arises as to why we did not directly collect everything in one data set, if we combine them all together in the end anyway.

I have included try and except blocks in the scraping part to avoid errors if there is no data for a year or league available.

Thus, we still get the largest possible amount of data and data quality.

First, we connect each season for each league, deposit the respective country in the individual data sets and then merge all leagues:Let’s take a look at three selected clubs:df_final.

loc[df_final['Team'] == 'Bayern Munich']df_final.

loc[df_final['Team'] == 'KV Oostende']df_final.

loc[df_final['Team'] == 'Real Madrid']Figure 5: String values ​​that still need to be convertedThe data looks good so far, but we have no numbers, though strings at Total value ​​and Average value.

In Figure 5, we see the additions Mill.

€, Th.

€ and Bill.

€ in each cell.

We have to remove these before we can convert the columns ​​into float.

In the Age column, the delimiter is still a comma instead of a dot.

We also have to clean up this column:Then we can convert all columns to float:For safety, we save the data set once:df_final.

to_pickle(“…pathdf_final.

pkl”)df_final.

to_excel(“…pathdf_final.

xlsx”)Finally We Can Answer Our QuestionAfter scraping and cleaning the data, we have a clean data set to check our hypothesis.

First let’s take a look at the evolution of the average squad values with the following code:We see that the average squad values ​​keep increasing.

The teams invest across all leagues, so in total, more in their teams.

Now let’s see how much the two variables correlate with each other:df_final.

corr()['Total Value']['Points']The correlation coefficient is 0.

664, this means that we have a strong positive correlation.

The correlation does not say anything about causality.

We assume that higher squads values ​​also lead to greater sporting success.

We will do a regression to see how strong the influence of the squad values ​​is on the number of points scored:The adjusted R is 0.

569.

This means that the squad value explains 56,9% of the variance of the points.

The result is statistically significant and for every million Euros, the number of points scored increases by 0.

2029 on average.

ConclusionIn this article, we first developed a hypothesis and then identified a database on the Web.

We looked at the basics of web scraping and developed a script based on a specific problem with which we can read the data of the website purposefully.

Then we prepared and cleaned up the data to form the basis for our analysis.

Using basic statistical methodology, we checked our hypothesis.

 The correlation between squad value and athletic success is significant.

Furthermore the influence of the squad value on the scores is very high.

 A much more detailed analysis can be found here.

Github:The archiv:https://github.

com/bd317/tutorial_scrapingThe first steps in the tutorial:https://github.

com/bd317/tutorial_scraping/blob/master/Soccer_Scrape_Tutorial_Part_1%20(1).

ipynbThe full script:https://github.

com/bd317/tutorial_scraping/blob/master/Soccer_Scrape_Tutorial_Part_2.

ipynbSources:[1] https://www.

transfermarkt.

com[2] https://www.

uefa.

com/memberassociations/uefarankings/country/#/yr/2019.. More details

Leave a Reply