Become a Pro at Pandas, Python’s data manipulation Library

Become a Pro at Pandas, Python’s data manipulation LibraryJulien KervizicBlockedUnblockFollowFollowingMay 9The pandas library is the most popular data manipulation library for python.

It provides an easy way to manipulate data through its data-frame api, inspired from R’s data-frames.

Photo by Damian Patkowski on UnsplashUnderstanding The pandas libraryOne of the keys to getting a good understanding of pandas, is to understand that pandas is mostly a wrapper around a series of other python libraries.

The main ones being Numpy, sql alchemy, Matplot lib and openpyxl.

The core internal model of the data-frame is a series of numpy array’s, functions in pandas such as the now deprecated “as_matrix” return results in that internal representation.

Pandas leverages other libraries to get data in and out of data-frames, SQL Alchemy for instances is used through the read_sql and to_sql functions.

And openpyxl and xlsx writer are used for read_excel and to_excel functions.

Matplotlib and Seaborn are instead used to provide an easy interface to plot information available within a data frame, using command such as df.

plot()Numpy’s Panda — Efficient pandasOne of the complain that you often hear is that Python is slow or that it is difficult to handle large amount of data.

Most often than not, this is due to poor efficiency of the code being written.

It is true that native python code tends to be slower than compiled code, but libraries like Pandas effectively provides an interface in python code to compiled code.

Knowing how to properly interface with it, let us get the best out of pandas/python.

APPLY & VECTORIZED OPERATIONSPandas like its underlying library Numpy performs vectorized operations more efficiently than performing loops.

This efficiencies is due to vectorized operations are performed through C compiled code rather than native python code and on the ability of vectorized operations to operate on entire datasets.

The apply interface allows to gain some of the efficiency by using a CPython interfaces to do the looping:df.

apply(lambda x: x['col_a'] * x['col_b'], axis=1)But most of the performance gain would be obtained from the use of vectorized operation themselves, be it directly in pandas or by calling its’ internal Numpy arrays directly.

As you can see from the picture above the difference in performance can be drastic, between processing it with a vectorized operation (3.

53ms) and looping with apply to do an addition (27.

8s).

Additional efficiencies can be obtained by directly invoking the numpy’s arrays and api, eg:Swifter: swifter is a python library that makes it easy to vectorize different types of operations on dataframe, its API is fairly similar to that of the apply functionSwiftapply — automatically efficient pandas apply operationsEasily apply any function to a pandas dataframe in the fastest available mannermedium.

comEFFICIENT DATA STORING THROUGH DTYPESWhen loading a data-frame into memory, be it through read_csv, or read_excel or some other data-frame read function, SQL makes type inference which might prove to be inefficient.

These api allow you to specify the types of each columns explicitly.

This allows for a more efficient storage of data in memory.

df.

astype({'testColumn': str, 'testCountCol': float})Dtypes are native object from Numpy, which allows you to define the exact type and number of bits used to store certain informations.

Numpy’s dtype np.

dtype('int32') would for instance represent a 32 bits long integer.

Pandas default to set up integer to 64 bits, we could be save half the space when using a 32 bits:memory_usage() shows the number of bytes used by each of the columns, since there is only one entry (row) per column, the size of each int64 column is 8bytes and of int32 4bytes.

Pandas also introduces the categorical dtype, that allows for efficient memory utilization for frequently occurring values.

In the example below, we can see a 28x decrease in memory utilization for the field posting_date when we converted it to a categorical value.

In our example, the overall size of the data-frame drops by more than 3X by just changing this data type:Not only using the right dtypes allows you to handle larger datasets in memory, it also makes some computations become more effective.

In the example below, we can see that using categorical type brought a 3X speed improvement for the groupby / sum operation.

Within pandas, you can define the dtypes during the data load (read_ ) or as a type conversion (astype).

CyberPandas: Cyber pandas is one of the different library extension that enables a richer variety of datatypes by supporting ipv4 and ipv6 data types and storing them efficiently.

HANDLING LARGE DATASETS WITH CHUNKSPandas allows for the loading of data in a data-frame by chunks, it is therefore possible to process data-frames as iterators and be able to handle data-frames larger than the available memory.

The combination of defining a chunksize when reading a data source and the get_chunk method, allows pandas to process data as an iterator.

For instance, in the example shown above, the data frame is read 2 rows at the time.

These chunks can then be iterated through:i = 0for a in df_iter: # do some processing chunk = df_iter.

get_chunk() i += 1 new_chunk = chunk.

apply(lambda x: do_something(x), axis=1) new_chunk.

to_csv("chunk_output_%i.

csv" % i )The output of which can then be fed to a csv file, pickled, exported to a database, etc…setting up operator by chunks also allows certain operations to be perform through multi-processing.

Dask: is a for instance, a framework built on top of Pandas and build with multi-processing and distributed processing in mind.

It makes use of collections of chunks of pandas data-frames both in memory and on disk.

SQL Alchemy’s Pandas — Database PandasPandas also is built up on top of SQL Alchemy to interface with databases, as such it is able to download datasets from diverse SQL type of databases as well as push records to it.

Using the SQL Alchemy interface rather than the Pandas’ API directly allows us to do certain operations not natively supported within pandas such as transactions or upserts:SQL TRANSACTIONSPandas can also make use of SQL transactions, handling commits and rollbacks.

Pedro Capelastegui, explained in one of his blog post notably, how pandas could take advantage of transactions through a SQL alchemy context manager.

with engine.

begin() as conn: df.

to_sql( tableName, con=conn, .

)the advantage of using a SQL transaction, is the fact that the transaction would roll back should the data load fail.

SQL extensionPandaSQLPandas has a few SQL extension such as pandasql a library that allows to perform SQL queries on top of data-frames.

Through pandasql the data-frame object can be queried directly as if they were database tables.

SQL UPSERTsPandas doesn’t natively support upsert exports to SQL on databases supporting this function.

Patches to pandas exists to allow this feature.

MatplotLib/Seaborn — Visual PandasMatplotlib and Searborn visualization are already integrated in some of the dataframe API such as through the .

plot command.

There is a fairly comprehensive documentation as how the interface works, on pandas’ website.

Extensions: Different extensions exists such as Bokeh and plotly to provide interactive visualization within Jupyter notebooks, while it is also possible to extend matplotlib to handle 3D graphs.

Other ExtensionsQuite a few other extensions for pandas exists, which are there to handle no-core functionalities.

One of them is tqdm, which provides a progress bar functionality for certain operations, another is pretty pandas which allows to format dataframes and add summary informations.

tqdmtqdm is a progress bar extension in python that interacts with pandas, it allows user to see the progress of maps and applys operations on pandas dataframe when using the relevant function (progress_map and progress_apply):PrettyPandasPrettyPandas is a library that provides an easy way to format data-frames and to add table summaries to them:More from me on Hacking Analytics:One the evolution of Data EngineeringAirflow, the easy wayE-commerce Analysis: Data-Structures and ApplicationsSetting up Airflow on Azure & connecting to MS SQL Server3 simple rules to build machine learning Models that add value.

. More details

Leave a Reply