Cleaning, Analyzing, and Visualizing Survey Data in Python

Let’s write another function to do it for us.

I believe it was Jenny Bryan, in her wonderful talk “Code Smells and Feels,” who first tipped me off to the following:If you find yourself copying and pasting code and just changing a few values, you really ought to just write a function.

This has been a great guide for me in deciding when it is and isn’t worth it to write a function for something.

A rule of thumb I like to use is that if I would be copying and pasting more than 3 times, I write a function.

There are also benefits other than convenience to this approach, such as that it:reduces the possibility for error (when copying and pasting, it’s easy to accidentally forget to change a value)makes for more readable codebuilds up your personal toolbox of functionsforces you to think at a higher level of abstraction(All of which improve your programming skills and make the people who need to read your code happier!)Hooray, laziness!This is, of course, generated data from a uniform distribution, and we would thus not expect to see any significant differences between groups.

Hopefully your own survey data will be more interesting.

Next, let’s address another format of question.

In this one, we need to see how interested each age group is in a given benefit.

Happily, these questions are actually easier to deal with than the former type.

Let’s take a look:And look, since this is a small DataFrame, age_group is appended already and we won't have to add it.

Cool.

Now we have the subsetted data, but we can’t just aggregate it by count this time like we could with the other question — the last question had NaNs that would be excluded to give the true count for that response, but with this one, we would just get the number of responses for each age group overall:This is definitely not what we want!.The point of the question is to understand how interested the different age groups are, and we need to preserve that information.

All this tells us is how many people in each age group responded to the question.

So what do we do?.One way to go would be to re-encode these responses numerically.

But what if we want to preserve the relationship on an even more granular level?.If we encode numerically, we can take the median and average of each age group’s level of interest.

But what if what we’re really interested in is the specific percentage of people per age group who chose each interest level?.It’d be easier to convey that info in a barplot, with the text preserved.

That’s what we’re going to do next.

And — you guessed it — it’s time to write another function.

Quick note to new learners: Most people won’t say this explicitly, but let me be clear on how visualizations are often made.

Generally speaking, it is a highly iterative process.

Even the most experienced data scientists don’t just write up a plot with all of these specifications off the top of their head.

Generally, you start with .

plot(kind='bar'), or similar depending on the plot you want, and then you change size, color maps, get the groups properly sorted using order=, specify whether the labels should be rotated, and set x- or y-axis labels invisible, and more, depending on what you think is best for whoever will be using the visualizations.

So don’t be intimidated by the long blocks of code you see when people are making plots.

They’re usually created over a span of minutes while testing out different specifications, not by writing perfect code from scratch in one go.

Now we can plot another 2×2 for each benefit broken out by age group.

But we’d have to do that for all 4 benefits!.Again: who has time for that?.Instead, we’ll loop over each benefit, and each age group within each benefit, using a couple of for loops.

But if you're interested, I'd challenge you to refactor this into a function if you happen to have many questions that are formatted like this.

Success!.And if we wanted to export each individual set of plots, we would simply add the line plt.

savefig('{}_interest_by_age.

png'.

format(benefit)), and matplotlib would automatically save a beautifully sharp rendering of each set of plots.

This makes it especially easy for folks on other teams to use your findings; you can simply export them to a plots folder, and people can browse the images and be able to drag and drop them right into a PowerPoint presentation or other report.

These could use a tad more padding, so if I were to do this again, I would increase the allowed height for the figure slightly.

Let’s do one more example: numerically encoding the benefits, as we mentioned earlier.

Then we can generate a heatmap of the correlations between interest in different benefits.

And lastly, we’ll generate the correlation matrix and plot the correlations.

Again, since the data is randomly generated, we would expect there to be little to no correlation, and that is indeed what we find.

(It is funny to note that SQL tutorials are slightly negatively correlated with drag-and-drop features, which is actually what we might expect to see in real data!)Let’s do one last type of plot, one that’s closely related to the heatmap: the clustermap.

Clustermaps make correlations especially informative in analyzing survey responses, because they use hierarchical clustering to (in this case) group benefits together by how closely related they are.

So instead of eyeballing the heatmap for which individual benefits are positively or negatively associated, which can get a little crazy when you have 10+ benefits, the plot will be segmented into clusters, which is a little easier to look at.

You can also easily change the linkage type used in the calculation, if you’re familiar with the mathematical details of hierarchical clustering.

Some of the available options are ‘single’, ‘average’, and ‘ward’ — I won’t get into the details, but ‘ward’ is generally a safe bet when starting out.

Long labels often require a little tweaking, so I’d recommend renaming your benefits to shorter names prior to using a clustermap.

A quick assessment of this shows that the clustering algorithm believes drag-and-drop features and ready-made formulas cluster together, while custom dashboard templates and SQL tutorials form another cluster.

Since the correlations are so weak, you can see that the “height” of when the benefits link together to form a cluster is very tall.

(This means you should probably not base any business decisions on this finding!) Hopefully the example is illustrative despite the weak relationships.

I hope you enjoyed this quick tutorial about working with survey data and writing functions to quickly generate visualizations of your findings!.If you think you know an even more efficient way of doing things, feel free to let me know in the comments — this is just what I came up with when I needed to produce insights on individual questions as quickly as possible.

.

. More details

Leave a Reply