4 Unique Methods to Optimize your Python Code for Data Science

You might be wondering how all of this applies to data science projects.

Well, you might have noticed that a lot of the time we have to execute the same query on a large number of data points.

This is especially true during the data pre-processing stage.

It’s essential that we use some optimized techniques instead of basic programming to get things done as quickly and efficiently as possible.

So, here I will share some of the best techniques that I use to improve and optimize my Python code.

  1.

Pandas.

apply() – A Feature Engineering Gem Pandas is already a highly optimized library but most of us still do not make the best use of it.

Think about the common places in a data science project where you use it.

One function I can think of is Feature Engineering where we create new features using existing features.

One of the most effective ways to do this is using Pandas.

apply().

Here, we can pass a user-defined function and apply it to every single data point of the Pandas series.

It is one of the best add-ons to the Pandas library as this function helps to segregate data according to the conditions required.

We can then efficiently use it for data manipulation tasks.

Let’s use the Twitter sentiment analysis data to calculate the word count for each tweet.

We will be using different methods, like the dataframe iterrows method, NumPy array, and the apply method.

We’ll then compare it in the live coding window below.

You can download the data set from here.

  You might have noticed that the apply function is much faster than the iterrows function.

Its performance is comparable to the NumPy array but the apply function provides much more flexibility.

You can read more about its documentation here.

  2.

Pandas.

DataFrame.

loc – A Brilliant Hack for Data Manipulation in Python This is one of my favorite hacks of the Pandas library.

I feel this is a must-know method for data scientists who deal with data manipulation tasks (so almost everyone then!).

Most of the time we are required to update only some values of a particular column in a dataset based upon some condition.

Pandas.

DataFrame.

loc gives us the most optimized solution for these kinds of problems.

Let’s solve a problem using this loc function.

You can download the dataset we’ll be using here.

View the code on Gist.

Check the value counts of the ‘City’ variable: Now, let’s say we want only the top 5 cities and want to replace the rest of the cities as ‘Others’.

So let’s do that: View the code on Gist.

See how easy it was to update the values?.This is the most optimized way to solve a data manipulation task of this kind.

  3.

Vectorize your Functions in Python Another way to get rid of slow loops is by vectorizing the function.

This means that a newly created function will be applied on a list of inputs and will return an array of results.

Vectorizing in Python can speed up your computation by at least two iterations.

Let’s verify this in the live coding window below on the same Twitter Sentiment Analysis Dataset.

Incredible, right?.For the above example, vectorization is 80 times faster!.This not only helps to speed up our code but also makes it cleaner.

  4.

Multiprocessing in Python Multiprocessing is the ability of a system to support more than one processor at the same time.

Here, we break our process into multiple tasks and all of them run independently.

Even the apply function looks slow when we are working with huge datasets.

So, let’s see how can we make use of the multiprocessing library in Python and speed things up.

We will create one million points at random and calculate the number of divisors for each point.

We will compare its performance using both the apply function and the multiprocessing method: View the code on Gist.

View the code on Gist.

View the code on Gist.

Here, multiprocessing generates the output 13 times faster than the apply method.

The performance might vary with different hardware systems but it will definitely improve the performance.

End Notes This is by no means an exhaustive list.

There are many other methods and techniques to optimize Python code.

But I’ve found and used these four a LOT during my data science career and I believe you’ll find them useful too.

Are there any other methods you use to optimize your code?.Do share those with us and the community in the comments section below!.And as I mentioned earlier, you should check out our popular courses if you’re new to Python and data science: Python for Data Science Free Course Introduction to Data Science (using Python) You can also read this article on Analytics Vidhyas Android APP Share this:Click to share on LinkedIn (Opens in new window)Click to share on Facebook (Opens in new window)Click to share on Twitter (Opens in new window)Click to share on Pocket (Opens in new window)Click to share on Reddit (Opens in new window) Related Articles (adsbygoogle = window.

adsbygoogle || []).

push({});.. More details

Leave a Reply