Visual Storytelling with Seaborn

Visual Storytelling with SeabornUsing Seaborn to Improve Your Data VisualizationsReilly MeinertBlockedUnblockFollowFollowingJun 11IntroductionData visualizations are a great way to turn a dataset full of numbers into a story.

This article will focus on creating visualizations using Seaborn.

It assumes that you already have some knowledge creating visualizations in Python, so check out this article if you’ve never done it before or if you want to brush up on your skills.

This tutorial will look at the data used in FiveThirtyEight’s article The Economic Guide to Picking a College Major.

The data used can be found here.

As a girl studying two different STEM fields (Statistics & Computer Science), I’m especially interested in the breakdown of women in STEM.

The gender gap in computer science became especially obvious to me when I started taking higher-level CS classes and I realized that I was grossly outnumbered by the boys in my classes.

This article will look at the breakdown of men and women in STEM majors visually, as well as the associated salaries for those particular majors, using Seaborn in Python.

Plotting Numerical VariablesPrior to beginning, ensure that you have the latest version of Seaborn (0.

9.

0) installed, as several of these plots are only able to be constructed with this version of Seaborn.

Let’s start by constructing a simple scatter plot of the percent of women in a major versus the median salary of that particular major.

sns.

relplot(x='ShareWomen', y = 'Median', data = df)plt.

show()The top line of code is pretty simple.

By using data = df as one of the parameters passed into the relplot() function, it allows the xIf you’re familiar with Matplotlib, one of the first things that you’ll probably notice is that this graph was automatically given axes labels, which is a nice feature the Seaborn provides.

The graph doesn’t have a title, but adding a plt.

title() call prior to showing the graph will add the title.

Seaborn is built on top of Matplotlib, so a lot of graph aesthetics can be modified using Matplotlib commands.

When looking at the actual graph, there’s a pretty apparent trend that majors with higher percentages of women have lower median salaries.

We need to do more exploration before we can make any assumptions about this.

sns.

set_palette('Paired', 10)sns.

relplot(x='ShareWomen', y='Median', hue = 'Major_category', data = df)plt.

title('Median Salary vs Share of Women in a Particular Major')plt.

show()The first line of code changes the color palette used in the creation of this graph and all future graphs created after it was run.

I chose a ColorBrewer color palette.

More information can be found here.

The next line is the same as the first plot we created, with the addition of the hue parameter.

Adding this will color each point on the graph in accordance with which category the particular major the point is representing is.

Looking at this graph, a few things become apparent that we could not see before.

Engineering majors end up in jobs that have the highest median salary, but also have the lowest percentage of women in those majors.

On the other hand, Health majors end up in jobs that have a much lower median salary, but have the highest percentage of women in those majors.

Already we can see that significantly higher percentages of women end up studying in fields that have significantly lower median salaries than their male counterparts.

You can also adjust the shape of a point to further emphasize differences among the points, as shown here:sns.

relplot(x='ShareWomen', y='Median', hue = 'Major_category', style = 'Major_category', data = df)This line of code will change the shape in addition to the colors of a point to emphasize the different major categories.

If you were to pass a different column name instead, it would assign different colors to the different major categories and different shapes to that additional column.

An important thing to note is that the eye is more likely to notice different colors than different shapes, so the more important variable should be represented by different colors.

We’ve used color as a way to represent a categorical variable, but what if we want it to represent a continuous variable, like the total number of people in a major?sns.

relplot(x='ShareWomen', y='Median', hue = 'Total', data = df)plt.

title('Median Salary vs Share of Women in a Particular Major')plt.

show()There’s a few things to notice here.

When it comes to the graph itself, the colors used to represent different totals of people are not in the color palette that we set before.

If you want to change this, you need to manually set the color mapping of the continuous variable.

More information on that can be found here.

In terms of what the graph tells us, there are a few things to take away.

The majors with the highest median salaries are some of the least popular majors overall, which would make sense.

There is probably a higher demand for people in that field, and with low numbers of them available, it is logical that employers would want to pay people more for their expertise.

Additionally, there are a few majors that are exceptionally popular.

The two darkest points both have a majority of women in those fields.

However, this information isn’t the most useful, as we don’t know what category those two majors fall into.

Another way to represent this information is to have the size of the point represent the total number of people in the major.

sns.

relplot(x='ShareWomen', y='Median', hue = 'Major_category', size = 'Total', sizes = (15, 200), data = df)plt.

title('Median Salary vs Share of Women in a Particular Major')plt.

show()This code also specifies the range of sizes of the points.

The default sizes are smaller, so it may not be the best to really get the information you are trying to get across.

This is an easy parameter to adjust, so it is probably wise to try a few options of sizes before you choose what is best for your particular visualization.

Looking at this graph, we can now see that the most popular major is one that falls into the Biology & Life Science category, which as a whole appears to have the lowest overall median salary.

Another thing to note about this category is that most of these points are between 50 and 60% women, so there are also a large number of men in this category who are making a lower median salary than their counterparts who are studying engineering.

This data is not the most conducive to create a line graph from, but it is a relatively simple process which is shown below.

sns.

relplot(x= 'ShareWomen', y = 'Women', kind = 'line', data = df)plt.

title('Share of Women vs Total Women in a Particular Major')plt.

show()The only real difference between creating a line and scatterplot in Seaborn is the addition of the kind parameter, which is set equal to “line” when you want a line graph instead of a scatterplot.

This code produces that following graph:This graph isn’t the most helpful, but we can see more clearly what the previous graph told us in terms of total number of people in a particular major.

There are two majors that significantly more popular than the rest, both of which have a majority of women studying that particular major.

However, we can’t get any more information from this graph, like the major category or median salary from that major.

It is still good for explanatory purposes though, and might be able to tell us which things we should investigate more later on.

There is not an easy way to plot a regression line on top of a scatterplot in Matplotlib, but there is a method in Seaborn that’ll do this for you.

sns.

lmplot(x = 'ShareWomen', y='Median', data = df)plt.

show()Here, we can see that there is a definite downward trend in the relationship between the median salary of a major and the share of women in that major.

However, it is important to note that this does not indicate causation.

A higher number of women in a major does not mean that jobs in that major automatically pay less.

It simply means that there are fewer women studying in fields that have higher salaries.

Plotting Categorical VariablesA dataset like this one has a lot of categorical variables that we can further investigate.

Just like with numerical data, categorical data is easy to plot on a scatter plot.

However, scatterplots of categorical data are not constructed the same way as scatterplots of numeric data.

As shown below, these plots are created using the catplot() function, rather than the relplot() function.

sns.

catplot(x = 'Major_category', y = 'Median', data = df)plt.

title('Median Salary of Different Major Categories')plt.

show()This code seems simple enough.

Instead of using a numerical variable on the x-axis, we use a categorical one.

The following graph is produced by running this code:The x-axis labels are an obvious problem here.

They overlap and become almost completely unreadable.

Luckily, this is a pretty easy fix with two different options.

#Option 1sns.

catplot(x = 'Median', y = 'Major_category', data = df)plt.

title('Median Salary of Different Major Categories')plt.

show()#Option 2sns.

catplot(x = 'Major_category', y = 'Median', data = df)plt.

xticks(rotation = 90)plt.

show()Option 1 (left) & Option 2 (right)The first option is a little bit easier, as all it requires is switching the variables that are plotted on the x- and y-axes.

However, it is usually standard to put categorical variables on the x-axis.

In most cases, the second option is the more appropriate fix for this problem.

The labels also do not have to be rotated the full 90 degrees.

If you want the labels at an angle rather than straight up and down, simply set rotation equal to a smaller number, like 30 or 45 rather than 90.

Let’s say we’re interested in the majors that have a majority of women in addition to the median salaries for majors in each category.

Passing one additional parameter while constructing the scatterplot, as shown below, will show us what we’re looking for.

sns.

catplot(x = 'Major_category', y = 'Median', hue = 'Gender Majority', data = df)plt.

title('Median Salary of Different Major Categories')plt.

xticks(rotation = 90)plt.

show()We can immediately see that engineering majors, which have the highest median salary, all have a minority of women.

Health and biology/life science majors, on the other hand, have a majority of women in these majors and have the lowest median salary.

An interesting, and possibly unexpected, observation from this graph is that women make up the majority of several physical science majors and also hold some of the higher median salaries out of all the different majors behind engineering majors.

This scatterplot looks a little bit messy, with the different points appearing to be a bit randomly placed within their category.

There are a few ways to handle this.

# Option 1sns.

catplot(x = 'Major_category', y = 'Median', jitter = False, data = df)plt.

xticks(rotation = 90)plt.

show()# Option 2sns.

catplot(x = 'Major_category', y = 'Median', kind = 'swarm', data = df)plt.

xticks(rotation = 90)plt.

show()Option 1 (left) & Option 2 (right)The first option, which has jitter = False as one of its parameters, shows all of the median salaries for each major in each major category in a single vertical line.

This graph looks much cleaner than our original scatter plot that displayed this information, but there is some overlap of points, so we cannot necessarily see all the information as clearly as we would like.

The second option, which has kind = “swarm” as one of its parameters, also lines the points up vertically, but has points with the same median salary lined up horizontally within their respective major category.

By looking at the second option, we can clearly see that there are 6 different engineering majors that result in a median salary of $60,000, which we could not see in any of the other scatterplots we have made so far.

How you choose to represent your points will rely heavily on what information you are trying to get across with your visualizations.

If we’d like to get a better idea of the distributions of our data, we have several options.

The first and most common is probably a boxplot.

Boxplots show the quartiles and median of the part of our data we are interested in.

Boxplots can be constructed as follows:sns.

catplot(x = 'Major_category', y = 'Median', kind = 'box', data = df)plt.

xticks(rotation = 90)plt.

show()This visualization tells us some things that our scatterplots were not able to tell us just by looking at them.

We can now see the median of the median salaries.

The median salary for all health majors is lower than that for biology/life science majors, but the maximum salary for health majors is higher than the maximum for biology/life science majors.

Additionally, the lowest salary for engineering majors is a bit higher than the median for physical science majors.

Another interesting takeaway from this graph is that the top 75% of salaries for computer/mathematics majors are higher than 50% of physical science majors.

In the previous scatterplots, it seemed like the median salaries were much closer than they actually are.

If you want to get a better idea of the distributions of salaries within the box plots, there are a few different things you can do.

You’ll probably want to use a boxen plot or a violin plot, both of which show the distribution of values within a boxplot.

Both are easy to create in Seaborn.

All you have to do is set the kind parameter equal to either “boxen” or “violin”.

# Boxen Plotsns.

catplot(x = 'Major_category', y = 'Median', kind = 'boxen', data = df)plt.

title('Boxen Plot')plt.

xticks(rotation = 90)plt.

show()# Violin Plotsns.

catplot(x = 'Major_category', y = 'Median', kind = 'violin', data = df)plt.

title('Violin Plot')plt.

xticks(rotation = 90)plt.

show()Both these plots look really cool, but I would advise against using them unless your audience knows what they’re actually looking at.

They show the same information as a box plot, which most people have seen before, in a way that most people have never seen.

However, it might be a good idea to split the violin plot based on a category, such as which gender is the majority in a particular major.

sns.

catplot(x = 'Major_category', y = 'Median', kind = 'violin', hue = 'Gender Majority', split = 'True', data = df)plt.

xticks(rotation = 90)plt.

show()Here, we can again see that women are not the majority in any engineering major, but they do hold the majority in quite a few of the physical science majors.

While they do have the majority in the lowest paid physical science major, they do the same for the highest paid major, which you wouldn’t be able to tell from the previous violin plot.

If you are constructing plots like this one, it is essential that you understand the information it is portraying.

Being able to create a cool-looking visualization means nothing if you cannot understand what the visualization is telling you.

If you’ve Matplotlib before, you probably know that it is a bit more difficult to create a bar graph than one would expect.

That is not the case when using Seaborn.

In your catplot() function, all you have to do is add kind = “count” and it’ll construct a simple bar graph for you.

There are the most different engineering majors, which is not something that we could tell from our box plots.

We could see that engineering majors had the widest range of salaries, but it is quite possible that the sheer number of engineering majors played a role in that.

It was much more difficult to see the number of different majors for the other categories, but now it is clearly laid out for us in this visualization.

The other four categories are all close in the number of majors in each category, but we had no way of easily telling which category had the most majors within it.

ConclusionWith a bit of experience, Seaborn is easy to use and very powerful in creating data visualizations.

Visualizations can be used to turn a dataset into a story by making it easier to understand, as well as providing the creator and ther audience with real insights about the data.

There are several different types of visualizations one can create for several different types of data.

All the code used to create the visualizations shown here, as well as some additional visualizations created using Seaborn, can be found on my Github.

.

. More details

Leave a Reply