10 Python Pandas tricks to make data analysis more enjoyable

There’re ways to fix these issues.

A.

Highlight all negative values in a dataframe.

(example revised from https://pandas.

pydata.

org/pandas-docs/stable/user_guide/style.

html)import pandas as pddef color_negative_red(val): color = 'red' if val < 0 else 'black' return 'color: %s' % colordf = pd.

DataFrame(dict(col_1=[1.

53,-2.

5,3.

53], col_2=[-4.

1,5.

9,0]) )df.

style.

applymap(color_negative_red)B.

Hide the index.

Try df.

head().

style.

hide_index()!C.

Add hovering effects.

(example revised from https://pandas.

pydata.

org/pandas-docs/stable/reference/api/pandas.

io.

formats.

style.

Styler.

set_table_styles.

html)df = pd.

DataFrame(np.

random.

randn(5, 3))df.

style.

set_table_styles([{'selector': 'tr:hover', 'props': [('background-color', 'yellow')]}])D.

More CSS styles.

You can use CSS to change the appearance of the table.

df = pd.

DataFrame(dict(departure=['SFO', 'SFO', 'LAX', 'LAX', 'JFK', 'SFO'], arrival=['ORD', 'DFW', 'DFW', 'ATL', 'ATL', 'ORD'], airlines=['Delta','JetBlue','Delta',’AA','SouthWest', 'Delta']),columns=['airlines', 'departure','arrival'])df.

style.

set_table_styles([{'selector': 'tr:nth-of-type(odd)', 'props': [('background', '#eee')]}, {'selector': 'tr:nth-of-type(even)', 'props': [('background', 'white')]}, {'selector': 'th', 'props': [('background', '#606060'), ('color', 'white'), ('font-family', 'verdana')]}, {'selector': 'td', 'props': [('font-family', 'verdana')]},]).

hide_index()2.

Pandas optionsThe reader may have experienced the following issues when using .

head(n) to check the dataframe:(1) There’re too many columns / rows in the dataframe and some columns / rows in the middle are omitted.

(2) Columns containing long texts get truncated.

 (3) Columns containing floats display too many / too few digits.

One can setimport pandas as pd pd.

options.

display.

max_columns = 50 # None -> No Restrictionspd.

options.

display.

max_rows = 200 # None -> Be careful with this pd.

options.

display.

max_colwidth = 100pd.

options.

display.

precision = 3to solve these issues.

3.

Group by with multiple aggregationsIn SQL we can do aggregations likeSELECT A, B, max(A), avg(A), sum(B), min(B), count(*)FROM tableGROUP BY A, BIn Pandas it can be done with .

groupby() and .

agg():import pandas as pdimport numpy as np df = pd.

DataFrame(dict(A=['coke', 'sprite', 'coke', 'sprite', 'sprite', 'coke', 'coke'], B=['alpha','gamma', 'alpha', 'beta', 'gamma', 'beta', 'beta'], col_1=[1,2,3,4,5,6,7], col_2=[1,6,2,4,7,9,3]))tbl = df.

groupby(['A','B']).

agg({'col_1': ['max', np.

mean], 'col_2': ['sum','min','count']})# 'count' will always be the count for number of rows in each group.

And the result will look like this:Both the rows and columns are multi-indexed.

A quick solution to change it to a dataframe without multi-indices istbl = tbl.

reset_index()tbl.

columns = ['A', 'B', 'col_1_max', 'col_2_sum', 'col_2_min', 'count']If you would like to have the column renaming process automated, you can do tbl.

columns.

get_level_values(0) and tbl.

columns.

get_level_values(1) to extract the indices in each level and combine them.

4.

Column slicingSome of you might be familiar with this already, but I still find it very useful when handling a dataframe with a ton of columns.

df.

iloc[:,2:5].

head() # select the 2nd to the 4th columndf.

loc[:,'column_x':].

head() # select all columns starting from 'column_x'5.

Add row ID / random row ID to each groupTo add a row ID / random row ID for each group by A, B, one can first append an ID / random ID to all rows:import numpy as np# df: target dataframe np.

random.

seed(0) # set random seeddf['random_ID_all'] = np.

random.

permutation(df.

shape[0])df['ID_all'] = [i for i in range(1, df.

shape[0]+1)]To add a random ID to each group (by A, B), one can then dodf['ID'] = df.

groupby(['A', 'B'])['ID_all'].

rank(method='first', ascending=True).

astype(int)df['random_ID'] = df.

groupby(['A', 'B'])['random_ID_all'].

rank(method='first', ascending=True).

astype(int)to get6.

List all unique values in a groupSometimes after we performed group by, we’d like to aggregate the values in the target column as a list of unique values instead of max, min, …etc.

This is how it’s done.

df = pd.

DataFrame(dict(A=['A','A','A','A','A','B','B','B','B'], B=[1,1,1,2,2,1,1,1,2], C=['CA','NY','CA','FL','FL', 'WA','FL','NY','WA']))tbl = df[['A', 'B', 'C']].

drop_duplicates() .

groupby(['A','B'])['C'] .

apply(list) .

reset_index()# list to string (separated by commas) tbl['C'] = tbl.

apply(lambda x: (','.

join([str(s) for s in x['C']])), axis = 1)If you’d like to save the result, don’t forget to change the separator to anything other than commas.

7.

Add row total and column total to a numerical dataframeThis is another common data manipulation.

All you need is .

apply().

df = pd.

DataFrame(dict(A=[2,6,3], B=[2,2,6], C=[3,2,3]))df['col_total'] = df.

apply(lambda x: x.

sum(), axis=1)df.

loc['row_total'] = df.

apply(lambda x: x.

sum())8.

Check memory usage.

memory_usage(deep=True) can be used on Pandas dataframes to see the amount of memory used (in bytes) for each column.

It’s useful when building machine learning models which may require a lot memory in training.

9.

Cumulative sumFrom time to time, cumulative sum is required when you generate some statistical outcomes.

Simply do df['cumulative_sum'] = df['target_column'].

cumsum() .

10.

CrosstabWhen you need to count the frequencies for groups formed by 3+ features, pd.

crosstab() can make your life easier.

Thanks for reading!.Comment below if you find bugs / better solutions.

.

. More details

Leave a Reply