Pandas for People In A Hurry

character tells Jupyter Notebook to treat the code as an OS shell command.

Expert tip: If you have happen to have any columns that are of the datetime type, add in parse_dates = [column_name] as an argument inside the parenthesis, so that Pandas can recognize the column as a date.

For example, if you have a date column called Collection_Date, you would you use pd.

read_excel("Soils.

xls", parse_dates = ['Collection_Date'])Here’s a snippet of my dataframe once I’ve successfully imported.

Exploring Your DataNow that your data is inside Pandas’ dataframe, you’re ready to investigate.

Here are several useful methods to call every time you bring in a data set.

These steps should help you understand what you have in your data.

Exploring DataFramedf.

head() Returns the first 5 rows of the dataframe.

To override the default, you may insert a value between the parenthesis to change the number of rows returned.

Example: df.

head(10) will return 10 rows.

df.

tail() Returns the last 5 rows of the dataframe.

You may insert a value between the parenthesis to change the number of rows returned.

df.

shape Returns a tuple representing the dimensions.

For example, an output of (48, 14) represents 48 rows and 14 columns.

df.

info() Provides a summary of the data including the index data type, column data types, non-null values and memory usage.

df.

describe() Provides descriptive statistics that summarizes the central tendency, dispersion, and shape.

Exploring ColumnsOne of my favorite methods is .

value_counts().

This method provides counts for each unique value in the column you selected.

Expert tip: If you would rather see the breakdown by percentages instead of absolute numbers, try .

value_counts(1) instead.

Here are a couple more useful methods to see what’s in your dataframe.

df['Contour'].

isnull().

sum() Returns count of null values in column ‘Contour’df['pH'].

notnull().

sum() Returns count of non-null values in column ‘pH’df['Depth'].

unique() Returns the unique values in column ‘Depth’df.

columns Returns names of all the columnsSelecting DataThere are many different ways to select data that you are interested in.

Here are some of the most useful ones that I’ve commonly used.

Column SelectionIf you want to select just one column, you can use df['Group'].

Here ‘Group’ is the column name.

To select multiple columns, you would use df[[‘Group’, ‘Contour’, ‘Depth’]] Note that double brackets are used when selecting multiple columns.

Just pass in the names of the columns that you want.

Subset Selection/IndexingTo select specific subsets of data, we rely on the .

loc or .

iloc methods.

I recommend that you just play around with these two methods in order to better understand how they work.

We can use df.

loc[] to select by the label of the rows and columns.

df.

loc[:,[‘Contour’]] selects all the rows and column ‘Contour’.

The single colon : select all of the rows.

To the left of the comma, you designate the rows you want and to the right of the comma ,you designate the columns.

df.

loc[0:4,[‘Contour’]] selects our index 0 to 4 and column ‘Contour’.

On the other hand, we use df.

iloc[] to select rows and columns by position.

We would pass in integer locations.

df.

iloc[:,2] selects all rows and column at position 2.

df.

iloc[3,:] selects row at position 3 and all columns.

FilteringIn addition, we can use a mask to help us select specific rows in our dataframe that fit a certain criteria.

In the mask below, each value in the column is evaluate against the statement.

That is, does the value equal “Top”.

mask = df[‘Contour’] == “Top”df[mask]A boolean value of True or False is returned from the code df[‘Contour’] == “Top” .

When we use our boolean mask in df[mask], it will print only the rows in which the mask reads True.

The mask can also be used to evaluate numeric columns.

A boolean value of True is returned when the value in the ‘Conduc’ column is greater than 10 in the example below.

mask2 = df[‘Conduc’] > 10mask2Another useful way to use the mask is to filter for items in a list.

mask3 = df[‘Contour’].

isin([‘Depression’, ‘Slope’])df[mask3]Expert Tip: If you want to select the opposite of your mask, you can use the tilda symbol.

For example, df[~mask3] would select all rows that DO NOT contain Depression nor Slope in the ‘Contour’ column.

Cleaning DataA data science project can only be good if the data itself is good.

That is, we want to have good data quality before we plug it into a machine learning algorithm.

Below are some useful methods to help speed up your data cleaning.

Notice that I pass in inplace=True for many of the methods below.

This ensures that the change takes place in the actual dataframe.

If you do not pass in this line of code, the changes do not actually take place in the dataframe.

Replacing weird charactersdf.

replace({‘Topk’: ‘Top’}, inplace=True)Deleting null valuesdf[‘pH’].

dropna(inplace=True)Imputing null valuesdf[‘pH’].

fillna(df[‘pH’].

mean(), inplace=True) #nulls are imputed with mean of pH columnDeleting rows and columnsdf.

drop(columns = [‘Na’], inplace = True) #This drops the 'Na' columndf.

drop(2, axis=0, inplace=True) #This drops the row at index 2Something to note is that axis=0 tells Pandas to drop by row.

You can use axis=1 to drop column.

This applies for many of the other Pandas methods available.

Changing column namesdf.

rename(columns = {‘Conduc’ : ‘Cond’, ‘Dens’ : ‘Density’}, inplace = True)Manipulating dataYou can use .

apply to apply a function across the row or column of your dataframe.

The code below applies the square root to all the values in the ‘Cond’ columnsdf[‘Cond’].

apply(np.

sqrt)Grouping Data, Data AggregationSometimes, differences and patterns can be better identified if we grouped the data for observation.

Here are a couple of different ways in which you can group the data.

The example below groups the data by the ‘Contour’ column and calculates the mean, sum, or count of the records in the ‘Ca’ column.

See output in Jupyter Notebook to better grasp what the code returns.

df.

groupby(by=[‘Contour’])[‘Ca’].

mean()df.

groupby(by=[‘Contour’])[‘Ca’].

count()df.

groupby(by=[‘Contour’])[‘Ca’].

sum()You are allowed to group by multiple columns as well.

df.

groupby(by=[‘Contour’, ‘Gp’])[‘Ca’].

mean()Combining Multiple DataFramesThere are several ways to combine two dataframes together and depending on your particular task, one may be better to use to perform the task.

The top two that I tend to use are concat and merge.

Concat is good for stacking rows of multiple dataframes.

Concatenate data by columnpd.

concat([df, df2], axis=1)Concatenate data by rowpd.

concat([df, df2], axis=0)Merge is good for combining dataframes when you have common columns between your datframes.

Merge datapd.

merge(df, df2, left_on=’Contour’, right_on=’Contour’, how=’outer’)I won’t go into too much detail about merge, but you can find great examples on how to better utilize it here.

Output DataFrameOnce you’re done cleaning and manipulating your data.

You can output it to a csv or excel file to work.

To outputs data to a csv file that is delimited by tab, use the code below.

' ' indicates that you want it to be tab delimited.

If you prefer a , instead, you can use ','df.

to_csv(‘myDataFrame.

csv’, sep=' ')To output into excel:writer = pd.

ExcelWriter(‘myDataFrame.

xlsx’) df.

to_excel(writer, ‘DataFrame’) writer.

save()Thanks for reading.

Don’t forget to comment if you have some cools Pandas tips to share.

Cheers.

.. More details

Leave a Reply