Data Science Code Refactoring Example

Data Science Code Refactoring ExampleJohn DeJesusBlockedUnblockFollowFollowingMar 10When learning to code for data science we don’t usually consider the idea of modifying our code to reap a particular benefit in terms of performance.

We code to modify our data, produce a visualization, and to construct our ML models.

But if your code is going to be used for a dashboard or app, we have to consider if our code is optimal.

In this code example, we will make a small modification to an ecdf function for speed.

If you are not sure what an ecdf is you can check out my blog post on it for more details.

Here is a quick visual example for your convenience.

Ecdf Plot made the function we will make below.

The data we will be using to plot the above ecdf function is on avocado prices from 2015 to 2018.

Wait, what exactly is code refactoring anyway?Code Refactoring is the modification of code to make improvements in its readability and performance.

For the performance part, this implies you will have to adjust your code to either decrease memory usage or for shorter run time.

First, let us get the imports and data loading done.

# Load Librariesimport pandas as pdimport numpy as npimport matplotlib.

pyplot as pltimport seaborn as snsimport time# Load the data from data.

worldavocado = pd.

read_csv('https://query.

data.

world/s/qou5hvocejsu4qt4qb2xlndg5ntzbm')Seaborn is here by preference.

When I create plots in a jupyter notebook I use seaborn to also set the plot backgrounds using sns.

set().

Now that our imports and data are set up let's check out our ecdf plotting function.

# Create a function for computing and plotting the ECDF with default parametersdef plot_ecdf(data, title='ECDF Plot', xlabel='Data Values', ylabel='Percentage'): """ Function to plot ecdf taking a column of data as input.

""" xaxis = np.

sort(data) yaxis = np.

arange(1, len(data)+1)/len(data) plt.

plot(xaxis,yaxis,linestyle='none',marker='.

') plt.

title(title) plt.

xlabel(xlabel) plt.

ylabel(ylabel) plt.

margins(0.

02)For the speed refactoring, we are going to focus on where the yaxis is defined.

yaxis = np.

arange(1, len(data)+1)/len(data)Notice that we call len() twice to construct the yaxis.

This causes an unnecessary increase in run time.

To remedy this, we will refactor our yaxis code into the following:length = len(data)yaxis = np.

arange(1,length+1)/lengthIn the above code:we assigned the variable length to hold the data length value.

replaced the calls for the len() with the variable length we defined beforehand.

Now to look at our function with these changes:# ECDF plot function with modificationsdef plot_ecdf_vtwo(data, title='ECDF Plot', xlabel='Data Values', ylabel='Percentage'): """ Function to plot ecdf taking a column of data as input.

""" xaxis = np.

sort(data) length = len(data) yaxis = np.

arange(1,length+1)/length plt.

plot(xaxis,yaxis,linestyle='none',marker='.

') plt.

title(title) plt.

xlabel(xlabel) plt.

ylabel(ylabel) plt.

margins(0.

02)Pic from CollegeDegrees360 on FlickrBut now we have another line of code!.How does this improve our run time?Yes, we did use another line of code.

It also does use a bit more memory than our last version of the function.

But now our function will generate a plot faster than before.

To determine the improvement we imported the time module for this exact purpose.

Let’s take a look.

# Generating Run Time of plot_ecdfstart1 = time.

time()plot_ecdf(avocado['AveragePrice'])end1 = time.

time()diff1 = end1-start1print(diff1)0.

04869723320007324So the first version clocks in at about 5 hundredths of a second.

Now to see if our refactored version is really an improvement.

# Generating Run Time of plot_ecdf_vtwostart2 = time.

time()plot_ecdf_vtwo(avocado['AveragePrice'])end2 = time.

time()diff2 = end2-start2print(diff2)0.

019404888153076172Great!.We improved the run time of our plot function by about 2 hundredths of a second!So how is this beneficial in the long run?Suppose this was a plot function that was going into a dashboard.

A dashboard that you are making for your employer.

What if this was instead a function for an ML model that is going into an app?These are a couple of cases when faster run time is important.

Hundredths of seconds may not seem like much.

But if you consider the amount of usage the function will get, that is a large amount of time you are saving for the users of the dashboard or app.

It will make your product for them as quick as lightning.

Pic from GoodFreePhotosCool!.How can I learn more code refactoring?If you need more examples, check out this awesome blog post by Julian Sequeria of Pybites.

Review your own code from old projects.

Find some of your code where you use a function more than once.

Then try to refactor the code so the function is only called once.

If you would like to see me go through this same example verbally you can watch the short youtube video version of this post.

Here is the notebook for this post and video also.

Quick shoutout before we go.

Tony Fischetti is a data scientist at the NY Public Library.

He was the first mentor I had in the Data Science word.

He was the one that made me more aware of code refactoring.

Thanks for reading!.Hope you enjoyed this example of code refactoring and you consider refactoring your code also.

What code do you think you will be refactoring?Until next time,John DeJesus.

. More details

Leave a Reply