Make your Pandas apply functions faster using Parallel Processing

It breaks the dataframe into n_cores parts, and spawns n_cores processes which apply the function to all the pieces.

Once it applies the function to all the split dataframes, it just concatenates the split dataframe and returns the full dataframe to us.

How can we use it?It is pretty simple to use.

train = parallelize_dataframe(train_df, add_features)Does this work?To check the performance of this parallelize function, I ran %%timeit magic on this function in my Jupyter notebook in a Kaggle Kernel.

vs.

just using the function as it is:As you can see I gained some performance just by using the parallelize function.

And it was using a kaggle kernel which has only got 2 CPUs.

In the actual competition, there was a lot of computation involved, and the add_features function I was using was much more involved.

And this parallelize function helped me immensely to reduce processing time and get a Silver medal.

Here is the kernel with the full code.

ConclusionParallelization is not a silver bullet; it is buckshot.

It won’t solve all your problems, and you would still have to work on optimizing your functions, but it is a great tool to have in your arsenal.

Time never comes back, and sometimes we have a shortage of it.

At these times we should be able to use parallelization easily.

Parallelization is not a silver bullet it is a buckshotAlso if you want to learn more about Python 3, I would like to call out an excellent course on Learn Intermediate level Python from the University of Michigan.

Do check it out.

I am going to be writing more beginner friendly posts in the future too.

Let me know what you think about the series.

Follow me up at Medium or Subscribe to my blog to be informed about them.

As always, I welcome feedback and constructive criticism and can be reached on Twitter @mlwhiz.

.

. More details

Leave a Reply