There’re ways to fix these issues.
A.
Highlight all negative values in a dataframe.
(example revised from https://pandas.
pydata.
org/pandas-docs/stable/user_guide/style.
html)import pandas as pddef color_negative_red(val): color = 'red' if val < 0 else 'black' return 'color: %s' % colordf = pd.
DataFrame(dict(col_1=[1.
53,-2.
5,3.
53], col_2=[-4.
1,5.
9,0]) )df.
style.
applymap(color_negative_red)B.
Hide the index.
Try df.
head().
style.
hide_index()!C.
Add hovering effects.
(example revised from https://pandas.
pydata.
org/pandas-docs/stable/reference/api/pandas.
io.
formats.
style.
Styler.
set_table_styles.
html)df = pd.
DataFrame(np.
random.
randn(5, 3))df.
style.
set_table_styles([{'selector': 'tr:hover', 'props': [('background-color', 'yellow')]}])D.
More CSS styles.
You can use CSS to change the appearance of the table.
df = pd.
DataFrame(dict(departure=['SFO', 'SFO', 'LAX', 'LAX', 'JFK', 'SFO'], arrival=['ORD', 'DFW', 'DFW', 'ATL', 'ATL', 'ORD'], airlines=['Delta','JetBlue','Delta',’AA','SouthWest', 'Delta']),columns=['airlines', 'departure','arrival'])df.
style.
set_table_styles([{'selector': 'tr:nth-of-type(odd)', 'props': [('background', '#eee')]}, {'selector': 'tr:nth-of-type(even)', 'props': [('background', 'white')]}, {'selector': 'th', 'props': [('background', '#606060'), ('color', 'white'), ('font-family', 'verdana')]}, {'selector': 'td', 'props': [('font-family', 'verdana')]},]).
hide_index()2.
Pandas optionsThe reader may have experienced the following issues when using .
head(n) to check the dataframe:(1) There’re too many columns / rows in the dataframe and some columns / rows in the middle are omitted.
(2) Columns containing long texts get truncated.
(3) Columns containing floats display too many / too few digits.
One can setimport pandas as pd pd.
options.
display.
max_columns = 50 # None -> No Restrictionspd.
options.
display.
max_rows = 200 # None -> Be careful with this pd.
options.
display.
max_colwidth = 100pd.
options.
display.
precision = 3to solve these issues.
3.
Group by with multiple aggregationsIn SQL we can do aggregations likeSELECT A, B, max(A), avg(A), sum(B), min(B), count(*)FROM tableGROUP BY A, BIn Pandas it can be done with .
groupby() and .
agg():import pandas as pdimport numpy as np df = pd.
DataFrame(dict(A=['coke', 'sprite', 'coke', 'sprite', 'sprite', 'coke', 'coke'], B=['alpha','gamma', 'alpha', 'beta', 'gamma', 'beta', 'beta'], col_1=[1,2,3,4,5,6,7], col_2=[1,6,2,4,7,9,3]))tbl = df.
groupby(['A','B']).
agg({'col_1': ['max', np.
mean], 'col_2': ['sum','min','count']})# 'count' will always be the count for number of rows in each group.
And the result will look like this:Both the rows and columns are multi-indexed.
A quick solution to change it to a dataframe without multi-indices istbl = tbl.
reset_index()tbl.
columns = ['A', 'B', 'col_1_max', 'col_2_sum', 'col_2_min', 'count']If you would like to have the column renaming process automated, you can do tbl.
columns.
get_level_values(0) and tbl.
columns.
get_level_values(1) to extract the indices in each level and combine them.
4.
Column slicingSome of you might be familiar with this already, but I still find it very useful when handling a dataframe with a ton of columns.
df.
iloc[:,2:5].
head() # select the 2nd to the 4th columndf.
loc[:,'column_x':].
head() # select all columns starting from 'column_x'5.
Add row ID / random row ID to each groupTo add a row ID / random row ID for each group by A, B, one can first append an ID / random ID to all rows:import numpy as np# df: target dataframe np.
random.
seed(0) # set random seeddf['random_ID_all'] = np.
random.
permutation(df.
shape[0])df['ID_all'] = [i for i in range(1, df.
shape[0]+1)]To add a random ID to each group (by A, B), one can then dodf['ID'] = df.
groupby(['A', 'B'])['ID_all'].
rank(method='first', ascending=True).
astype(int)df['random_ID'] = df.
groupby(['A', 'B'])['random_ID_all'].
rank(method='first', ascending=True).
astype(int)to get6.
List all unique values in a groupSometimes after we performed group by, we’d like to aggregate the values in the target column as a list of unique values instead of max, min, …etc.
This is how it’s done.
df = pd.
DataFrame(dict(A=['A','A','A','A','A','B','B','B','B'], B=[1,1,1,2,2,1,1,1,2], C=['CA','NY','CA','FL','FL', 'WA','FL','NY','WA']))tbl = df[['A', 'B', 'C']].
drop_duplicates() .
groupby(['A','B'])['C'] .
apply(list) .
reset_index()# list to string (separated by commas) tbl['C'] = tbl.
apply(lambda x: (','.
join([str(s) for s in x['C']])), axis = 1)If you’d like to save the result, don’t forget to change the separator to anything other than commas.
7.
Add row total and column total to a numerical dataframeThis is another common data manipulation.
All you need is .
apply().
df = pd.
DataFrame(dict(A=[2,6,3], B=[2,2,6], C=[3,2,3]))df['col_total'] = df.
apply(lambda x: x.
sum(), axis=1)df.
loc['row_total'] = df.
apply(lambda x: x.
sum())8.
Check memory usage.
memory_usage(deep=True) can be used on Pandas dataframes to see the amount of memory used (in bytes) for each column.
It’s useful when building machine learning models which may require a lot memory in training.
9.
Cumulative sumFrom time to time, cumulative sum is required when you generate some statistical outcomes.
Simply do df['cumulative_sum'] = df['target_column'].
cumsum() .
10.
CrosstabWhen you need to count the frequencies for groups formed by 3+ features, pd.
crosstab() can make your life easier.
Thanks for reading!.Comment below if you find bugs / better solutions.
.