Write Less, Explore More

In order to answer this you need to balance the information density with the meaning density of your graphic.

The purpose of binning data is to create groups that will have significantly different means from each other (µ1 ≠ µ2 ≠ µ3 … ≠ µn).

In this case, the best graph overlays the distributions of the binned data so that you can quickly assess whether or not binning is appropriate.

That blue, green, red combination makes a gross color.

This is a reminder to yourself not use those colors in the future.

It’s clear from this graph that binning this data won’t make it predictive of price, even if it’s binned at extreme quantiles.

I am confident in making this statement because there isn’t a significant difference in the distributions of the different groups, which is something I can tell just by looking at the graph.

Looking at the most meaningful graphic allowed me to understand the nuance in the data, which in this case means that I won’t end up with a meaningless predictor in our model.

Use simple graphicsThe success of the previous examples lie in their simplicity.

At a glance you, and anyone else, are able to understand the relationships that are being described.

You should aim to simplify your graphics whenever possible, following the tenet of balancing information density to meaning density.

Automate your processYou’re probably not alone if you feel annoyed by the contradictory message so far.

I started this post by asking you to “write less” and “explore more”, but so far I’ve told you that you should write more and explore more.

The truth is that you can’t explore without some friction, which in our case is writing code.

So yes, you do have to write a little bit more code in order to explore more.

The key, however, is to increase the ratio of exploring to writing.

One of the best ways to do this is to reuse code effectively to make graphs automatically.

You should maximize the time that you spend exploring your dataTo do this, I employ ipywidgets.

Ipywidgets is a simple module that allows you to interact with the parameters of a function on-the-fly by embedding controls into your output.

It’s easier to explain with an example, so here’s a chunk of code and the resulting output.

If only this was interactive!What’s not immediately apparent is that this chunk of code allows me to look at 60 different combinations of violin plots for this dataset!.It’s a dizzying quantity, but the ability to choose the plot improves the likelihood that I will understand what I’m looking at.

The reason for this is simple: I’ve prepared myself to update my mental model, and I can browse through plots with purpose.

Now that I’ve shown you ipywidgets, I can come clean about the examples in the previous section.

Each one of those graphics was generated automatically using ipywidgets.

They were each part of a larger investigation: the first example examined discrete predictors and their relationships to the target; the second example examined data for neighbors.

If you’re interested to learn more about ipywidgets, then I recommend that you read Will Koehrsen’s Medium post titled “Interactive Controls in Jupyter Notebooks” linked here.

He walks you through how to use ipywidgets step-by-step with loads of examples.

The documentation on ipywidgets is also very well written for those that want to go straight to the source.

Armed with ipywidgets, you should now feel comfortable exploring more of your data through simple graphics.

Good luck, and happy exploring!.

. More details

Leave a Reply