Transforming Xs and Ys (Mostly Ys) into Football Formations

If the Ys describes how low or high up the pitch they are, the Xs become irrelevant.

Again, for simplicity’s sake, let’s focus on transforming the home team’s formation, as well as keep the match ID for later use.

home_players = match[['match_api_id','home_player_y1','home_player_y2', 'home_player_y3','home_player_y4','home_player_y5','home_player_y6','home_player_y7', 'home_player_y8','home_player_y9','home_player_y10','home_player_y11']].

copy()home.

info()<class 'pandas.

core.

frame.

DataFrame'>RangeIndex: 25979 entries, 0 to 25978Data columns (total 12 columns):match_api_id 25979 non-null int64home_player_Y1 24158 non-null float64home_player_Y2 24158 non-null float64home_player_Y3 24147 non-null float64home_player_Y4 24147 non-null float64home_player_Y5 24147 non-null float64home_player_Y6 24147 non-null float64home_player_Y7 24147 non-null float64home_player_Y8 24147 non-null float64home_player_Y9 24147 non-null float64home_player_Y10 24147 non-null float64home_player_Y11 24147 non-null float64dtypes: float64(11), int64(1)memory usage: 2.

4 MBCalling info on the new dataframe, we see that there are about ~1800 matches that do not have any information on where the players lined up.

Let’s remove those:home = home.

dropna()homes.

info()Output: <class 'pandas.

core.

frame.

DataFrame'>Int64Index: 24147 entries, 144 to 25978Data columns (total 12 columns):match_api_id 24147 non-null int64home_player_Y1 24147 non-null float64home_player_Y2 24147 non-null float64home_player_Y3 24147 non-null float64home_player_Y4 24147 non-null float64home_player_Y5 24147 non-null float64home_player_Y6 24147 non-null float64home_player_Y7 24147 non-null float64home_player_Y8 24147 non-null float64home_player_Y9 24147 non-null float64home_player_Y10 24147 non-null float64home_player_Y11 24147 non-null float64dtypes: float64(11), int64(1)memory usage: 2.

4 MBPerfect.

Now some feature engineering!Above we see that the Y coordinates are floats, but in reality, I eventually want to make it into categorical strings to identify formations.

So let’s make a function that changes the format of the numbers into stringsdef convert_num_str(s,oth=''): nums = '0123456789' for c in s: if (c in nums) == False: s = s.

replace(c,''); return sThe parenthesis around the c in nums was integral to eventually return the formation in a string form rather than maintaining the dict_keys form in the end.

Next, initiate an empty dictionary to store the formations.

Every row represents a game, so by using the match ID, we can map our dictionary back onto the dataframe using it as an identifier to align it correctly.

To iterate through each row and abstract the data, we will use .

iterrows() recapitulating through each row and to also keep the row index to match the match IDs and formation together:#every row is a match, take out formation according to Y coordinatesfor index,row in home.

iterrows(): home_player_y = list() for i in range(2,12): home_player_y.

append(row['home_player_Y%d' % i]) c_home = Counter(home_player_y) formation_home = Counter(sorted(c_home.

elements())).

values() formation_home = OnlyNum(str(formation_home)) home_formation.

update({row['match_api_id'] : formation_home}) That’s a lot of code to read through, so let’s break it down!The numbers in the dataframe range from 1–11, with the goalie being the 1.

Since formations don’t ever change the goalie, we can skip it in creating the formation in our for loop while appending to our list of Ys:for i in range(2,12): home_player_y.

append(row['home_player_Y%d' % i])example output of the first iteration:[3.

0, 3.

0, 3.

0, 3.

0, 7.

0, 7.

0, 7.

0, 7.

0, 10.

0, 10.

0]Now we have the Ys together, we then count them using the Counter function.

c_home = Counter(home_player_y)which outputs a dictionary where the key:value pairs are the elements(Ys):counts.

To retain the order from lowest to highest, we can also call on the sorted function, specifically on the elements (the Ys) to retain this order so that we get something like a 4–4–2 ( four 3s, 4 7s, and 2 10s).

formation_home = Counter(sorted(c_home.

elements())).

values()Now we have that, we have a list of the formation ‘4,4,2’ and to transform this into a string ‘442’, we can use the handy function we created earlier to do that:formation_home = convert_num_str(str(formation_home))And finally, the last line updates the empty dictionary, home_formation, using the row index to identify the correct match_api_id as the key, and the formation string for that match together:home_formation.

update({row['match_api_id'] : formation_home})Calling on home_formation, we can see the fruit of our labor:SUCCESSWe now have a dictionary of the match IDs and the formation.

To finally reign it all in and add it back into our dataframe, we can create the new column ‘home_formation’ by using the match_api_id to map the home_formation dictionary:home['home_formation'] = home['match_api_id'].

map(home_formation)Let’s now delete the columns that contain the Ys:home.

drop(columns=['home_player_Y1','home_player_Y2','home_player_Y3','home_player_y4','home_player_Y5','home_player_Y6','home_player_Y7','home_player_Y8','home_player_Y9','home_player_Y10','home_player_Y11'],inplace = True)And calling on our dataframe one last time, we can see we are successful!While this blog post was long-winded, I hope that my instructions are clear for you to be able to try it out for yourself.

Remember, transforming our crude X,Y coordinates into usable team formations, we are now able to use this refined product for future analysis, model training, and/or hypothesis testing.

If you would like to see the code in my project, check out my github for the repository.

This is where I created a separate jupyter notebook for this blog post!.. More details

Leave a Reply