Reshaping Pandas DataFrames

To do this, I'll take our DataFrame and make the following adjustments:Remove the extra columns.

Drop rows where language value is 0.

Perform a sum() aggregate.

meltDF.

drop(columns=['Respondent', 'Salary'], inplace=True)meltDF = meltDF.

loc[meltDF['value'] == 1]meltDF = meltDF.

groupby(['Language']).

sum()meltDF.

sort_values(by=['value'], inplace=True, ascending=False)print(meltDF.

head(10))Output:LanguageJavaScript 35736SQL 29382Java 21451Python 19182PHP 14644Ruby 5512Swift 4037Go 3817Objective-C 3536R 3036Pivot TablesPivot tables allow us to view aggregates across two dimensions.

In the aggregation we performed above, we found the overall popularity of programming languages (1-dimensional aggregate).

To image what a 2-dimensional aggregate looks like, we’ll expand on this example to split our programming language totals into a second dimension: popularity by age group.

Visualizing a Pivot Table.

On the left is our melted DataFrame reduced to three columns: Age, Language, and Value (I’ve also shown the Respondent ID column for reference).

When we create a Pivot table, we take the values in one of these two columns and declare those to be columns in our new table (notice how the values in Age on the left become columns on the right).

When we do this, the Language column becomes what Pandas calls the ‘id’ of the pivot (identifier by row).

Our pivot table only contains a single occurrence of the values we use in our melted DataFrame ( JavaScript appears many times on the left, but once on the right).

Every time we create a pivot table, we’re aggregating the values in two columns and splitting them out two-dimensionally.

The difference between a pivot table and a regular pivot is that pivot tables always perform an aggregate function, whereas plain pivots do not.

Our pivot table now shows language popularity by age range.

This can give us some interesting insights, like how Java is more popular in kids under the age of 25 than Python (somebody needs to set these kids straight).

Here’s how we do this in Pandas:pivotTableDF = stackoverflowDF.

filter(items=['Age', 'Language', 'value'])pivotTableDF = pd.

pivot_table(stackoverflowDF, index='Language', columns='Age', values='value', aggfunc=np.

sum, margins=True)pivotTableDF.

sort_values(by=['All'], inplace=True, ascending=False)print(pivotTableDF)pd.

pivot_table() is what we need to create a pivot table (notice how this is a Pandas function, not a DataFrame method).

The first thing we pass is the DataFrame we'd like to pivot.

Then are the keyword arguments:index: Determines the column to use as the row labels for our pivot table.

columns: The original column which contains the values which will make up new columns in our pivot table.

values: Data which will populate the cross-section of our index rows vs columns.

aggfunc: The type of aggregation to perform on the values we’ll show.

count would give us the number of occurrences, mean would take an average, and median would… well, you get it (for my own curiosity, I used median to generate some information about salary distribution… try it out).

margins: Creates a column of totals (named “All”).

TransposeNow this is the story all about how my life got flipped, turned upside down.

That’s what you’d be saying if you happened to be a DataFrame which just got transposed.

Transposing a DataFrame simply flips a table on its side, so that rows become columns and vice versa.

Here’s an awful idea: let’s try this on our raw data!:stackoverflowDF = stackoverflowDF.

filter(items=['Respondent', 'Country', 'OpenSource', 'Employment', 'HopeFiveYears', 'YearsCoding', 'CurrencySymbol', 'Salary', 'Age'])stackoverflowDF = stackoverflowDF.

transpose()Transposing a DataFrame is as simple as df.

transpose().

The outcome is exactly what we'd expect:Of course, those are just the first 5 columns.

The actual shape of our DataFrame is now [9 rows x 49236 columns].

Admittedly, this wasn't the best example.

Stack and UnstackHave you ever run into a mobile-friendly responsive table on the web?.When viewing tabular data on a mobile device, some sites present data in a compact “stacked” format, which fits nicely on screens with limited horizontal space.

That’s the best way I can describe what stack() does, but see for yourself:stackoverflowDF = stackoverflowDF.

stack()print(stackoverflowDF)Output:Our data is now “stacked” according to our index.

See what I mean?.Stacking supports multiple indexes as well, which can be passed using the level keyword parameter.

To undo a stack, simply use df.

unstack().

Anyway, I’m sure you’re eager to hit the beach and show off your shredded data.

I’ll let you get to it.

You’ve earned it.

Originally published at https://hackersandslackers.

com on May 20, 2019.

.. More details

Leave a Reply