Expensive Players Then and Now — An Introductory Scraping and Pandas DataFrame Example

Expensive Players Then and Now — An Introductory Scraping and Pandas DataFrame ExampleThe Most Expensive January Signings for Each Position in a Football TeamStephen FordhamBlockedUnblockFollowFollowingMay 31In Football, January offers a chance for teams in the League to make changes to their squad.

With this in mind, this tutorial will focus on the most expensive January Premier League signings for each position in a standard 4–4–2 formation, and explore the change in their Market value from when they were first purchased, to their market value now, in 2019.

This Tutorial will use a guided example to illustrate how to scrape data, transform it into a Pandas DataFrame and group data.

I first stumbled on the Forbes Web-page shown below.

This Web-page has all the information of interest.

Notably, all the January signings in all positions, the club the players transferred from, and the club they transferred to.

In addition, the year that transfer took place, and their transfer fees in £’s and $’s, when the players were originally purchased, and their value now in £’s and $’s based on a ‘football player inflation rate’.

Note: ‘The football player inflation rate is based on the average change for each Premier League transfer per player in each season’For this tutorial example, we will be using the figures calculated and provided on the Forbe’s Web-page.

The inflation rate is used to calculate the cost of a player if they were to be purchased in today’s market, as opposed to their original cost.

To begin, I import the requests library.

I then create a url variable, and assign the URL link from the Forbes website to this variable.

This link contains information relating to the most expensive transfers in the January Transfer window.

I use the get method from the requests library to begin extracting my data.

To confirm the connection from my python environment to the Web-page, I print the status_code, which returns 200.

200 is the HTTP status code for “OK”, a successful response.

(Other codes you may be familiar with are 404 Not Found, 403 Forbidden, and 500 Internal Server Error.

)Once the connection is validated, I use the text attribute to retrieve the underlying html source code of the Forbes Web-page.

When I print the html text, the output is quite large in my console.

To help narrow down my area of interest, I use ‘CTRL-F’ whilst in the console, and search for the name of a player I know will lead me to where I would like to extract data from.

This search leads me to the information I require, immediately after the <strong> tag.

From a glance at the source output, I can see Shay Given, Newcastle to Manchester City, his transfer fee in £’s and $’s, when the player was originally purchased, and his value now in £’s and $’s.

What is particularly useful in this scenario, is the regularity in the underlying HTML source code.

For example, each player’s name is embedded the same way within a <strong> tag.

In addition, the Team the player transferred from is always after the ‘nbsp;’ characters, and before the word ‘to’.

This pattern means I can create a for loop to iterate over my code and extract the relevant information.

Initially, I like to narrow down my search area.

The output from htmltext = response.

text returns a giant string, which I can split, using the split string method.

I split on ‘*Fees reported’ and take everything after that string by indexing at 1, and split again on a unique string, and take the string before that by indexing at 0.

This leave me with a shortened string.

To convert the string, ‘secondsplit’ into a list, I split it at a strong tag and take everything after the second element (python index 1) until the end of my list.

This leaves me a list (thirdsplit) with all the information I want to extract inside for each player.

Creating the DataFrameI can now iterate over my list.

To find the original transfer fee for example, I use the split method to split each element in my thirdsplit list, and extract the string after the ‘($’ sign and before the ‘m)’ characters by using substring notation.

I then append each transfer fee to a newly constructed list called transfer cost.

I repeat this same process for get the players, their previous club, year of transfer, new club, and fee as of 2019 when a football player inflation rate is applied.

In some instances, the for loop does not quite remove all the necessary characters.

When I run the for loop as shown, the prefix, ‘Jos&amp;eacute;’ appears before the player Antonio Reyes.

To create the new players list, I simply add a condition to remove these characters, and generate a new list.

Once the lists have been assembled and cleaned up, I can create a pandas DataFrame.

To do so, I import pandas, and zip my lists together inside the pd.

DataFrame method.

I assign appropriately labelled column names for the lists so that they are representative of their content, for example the ‘Players’ column refers to the list, ‘Players_refined’.

The datatypes of the transfer_df DataFrame columns, Transfer_Fee, Transfer_Year and Cost_2019 are changed to appropriate numeric types using the .

astype() method, before the DataFrame is sorted based on highest Transfer_Fee using the .

sort_values() method.

I finally create two additional columns.

One which determines the ratio between the Cost of the player in 2019 (based on player inflation rate) and their transfer fee at the time of purchase.

The second column refers to the player’s respective position.

When the transfer_df DataFrame is viewed all the necessary changes have been implemented.

The data has been successfully extracted from the Forbes Webpage and inputted into a Pandas DataFrame!An aspect of the pandas library which I really like is the groupby method.

Here, the data is in a very clear and readable form.

It is clear that in January, Arsenal signed the most expensive left-winger(LW), Antonio Reyes for $23.

9M.

Further to this, Chelsea have signed 4 of the 11 most expensive January signings (based on the 4–4–2 formation), 2 of which were in 2011.

This data is easier to interpret for comparison purposes that the original text displayed on the Web-page.

A bar-plot can now be produced to help visualize the relative change in market value, between time of original purchase to the present day.

Both the Transfer_Fee_change_column and the schematic illustrate this point more clearly than the raw text depicted on the Forbes Web-page.

This schematic shows that if Fernando Torres had been purchased in today’s transfer market, he would cost 2.

10 times more than he did in 2011.

However, from our table and schematic we can clearly see that Aymeric Laporte would cost less, 0.

84 times what was paid for him in 2018.

At the extreme end, if Scott Parker was purchased today, he would be 3.

37 times more expensive than his original 2004 transfer fee.

A normalization dotted line has been added at 1.

0 on the y-axis to shown the relative change in ratio.

ConclusionThis introductory example described how we can scrape data and transform it into a pandas DataFrame.

By doing so, the data can easily be manipulated and analysed.

In this case, it is now easy to group, based on club, create a new column which calculates the transfer fee change ratio between original purchase and the present market value.

Further to this, we can create informative plots, like the bar-plot shown in this article.

.

. More details

Leave a Reply