The Magic of Tidying Up

It could.

We aren’t sure exactly what we’re trying to find.

My reference earlier was to commentators using statistics during games.

If we use the data for that purpose, the Date is very important.

However, 282 rows out of 16,781 is minimal.

As with all null values, we can get rid of them, or change them to something like the average, the mode, the median.

That decision is individual and different based on the data you are working with.

Let’s take a quick look at the minimum, maximum, average, mode, median and quartile values in this data.

When we do this, we will only get those values that have columns with numbers.

We’ll talk about the categorical columns next.

The season’s year has no null values here.

The number of goals scored at home and away has the same number of null values.

Replacing those null values with the mean would not affect our data for those columns much at all so that would be a good value to use.

For home goals the mean is 1.

880657.

For the away goals scored, the mean is 1.

230657.

We can use a command in pandas called .

replace() to replace all the null values in each column with their means.

We also know the mean of vgoal.

We can do the exact same thing to get rid of the null values in that column.

If we check for null values again right after we do this, you’ll see they are gone.

We are now only left with the Date, visitor, FT, and tie columns to contend with.

Remember the original data frame was 16781 rows.

With only three null values in the tie column out of over 16,000, dropping those three rows will likely not affect our data much if at all.

So, we will drop those three rows.

What should we do with the others?.The date is the exact data of the game.

We have another column called Season with no null values.

Do you think the date is important?.This also depends on what we want to know from this data.

The visitor column represents the away team playing against the home team.

Due to the fact this is a string and not numerical, we can’t take the average of that like we did using a number.

There are 163 missing values for the visiting team out of 16,781 games in this set.

That is just under 1% of the total number of games.

We could drop those rows as well.

Do you think it would make a difference in our overall data and conclusions?.We could fill the null values with the mode (the team that played most often in the visitor’s column).

Ultimately, the decision is up to the data scientist.

Which is the better option?.Should we drop null values?.Should we fill them with other information?.And, if we do fill them, what do we fill them with?.These are all good questions to ask yourself.

In this case, I’m going to fill them with the mode of the visitor column.

You don’t need to find the mode first.

You can replace the null values with the mode by using the following code.

We’ve now reassigned the visitor column to itself but with the null values filled with the mode.

We have only the Date and Final Score to contend with.

The final score has 283 null values.

We can replace those with a 0–0 score, the mode of the scores, drop those rows, or choose a different value to fill them with.

For the final score (FT), I’m going to fill them again with the mode of the final scores.

Let’s deal with the date at the same time.

We have 282 null values in the date but we never had any null values in the Season.

This means we know what season the game took place in, there are just some specific dates, games took place we don’t know.

Is the specific date important?.It could be depending on what we’d like to find out.

I’ve already showed you have to fill null values with a different value.

Here, I’m going to drop the Date column as I’m content having at least the season that the game took place in.

After filling the final score and dropping the Date, we should have no null values left!Yay!.No more null values!.dropping the column just used ‘drop’ after the data frame name.

Specified what column to drop, and the specified the axis.

axis=1 is for columns.

axis=0 is for rows.

It is at this point we can use the data to find out additional information.

A question I might ask is, “Is there a correlation between winning games and playing them on your home turf?” Was there any trend in games lost and won depending on the season?.What team won the most games?. More details

Leave a Reply