3 simple ways to handle large data with Pandas

Instead of trying to handle our data all at once, we’re going to do it in pieces.

Typically, these pieces are referred to as chunks.

A chunk is just a part of our dataset.

We can make that chunk as big or as small as we want.

It just depends on how much RAM we have.

The process then works as follows:Read in a chunkProcess the chunkSave the results of the chunkRepeat steps 1 to 3 until we have all chunk resultsCombine the chunk resultsWe can perform all of the above steps using a handy variable of the read_csv() function called chunksize.

The chunksize refers to how many CSV rows pandas will read at a time.

This will of course depend on how much RAM you have and how big each row is.

If we think that our data has a pretty easy to handle distribution like Gaussian, then we can perform our desired processing and visualisations on one chunk at a time without too much loss in accuracy.

If our distribution is a bit more complex like a Poisson, then it’s best to filter each chunk and put all of the small pieces together before processioning.

Most of the time, you’ll end up dropping many irrelevant columns or removing rows that have missing values.

We can do that for each chunk to make them smaller, then put them all-together and perform our data analysis on the final dataframe.

The code below performs all of these steps.

Dropping dataSometime, we’ll know right off the bat which columns of our dataset we want to analyse.

In fact, it’s often the case that there are several or more columns that we don’t care about like names, account numbers, etc.

Skipping over the columns directly before reading in the data can save on tons of memory.

Pandas allows us to specify the columns we would like to read in:Throwing away the columns containing that useless miscellaneous information is going to be one of your biggest memory savings.

The other thing we can do is filter out any rows with missing or NA values.

This is easiest with the dropna() function:There’s a few really useful variables that we can pass to the dropna() :how: this will let you specify either “any” (drop a row if any of its columns are NA) or “all” (drop a row only if all its columns are NA)thresh: Set a threshold of how many NA values are required for a row to be droppedsubset: Selects a subset of columns that will be considered for checking for NA valuesYou can use those arguments, especially the thresh and subset to get really specific about which rows will be dropped.

Pandas doesn’t come with a way to do this at read time like with the columns, but we can always do it on each chunk as we did above.

Set specific data types for each columnFor many beginner Data Scientists, data types aren’t given much thought.

But once you start dealing with very large datasets, dealing with data types becomes essential.

The standard practice tends to be to read in the dataframe and then convert the data type of a column as needed.

But with a big dataset, we really have to be memory-space conscious.

There may be columns in our CSV, such as floating point numbers, which will take up way more space than they need to.

For example, if we downloaded a dataset for predicting stock prices, our prices might be saved as 32 bit floating point!But do we really need 32 float?.Most of the time, stocks are bought at prices specified by two decimal places.

Even if we wanted to be really accurate, float16 is more than enough.

So instead of reading in our dataset with the columns’ original data types, we’re going to specify the data types we want pandas to use reading in our columns.

That way, we never use up more memory than we actually need.

This is easily done using the dtype parameter in the read_csv() function.

We can specify a dictionary where each key is a column in our dataset and each value is the data type we want to use that key.

Here’s an example in pandas:That concludes our tutorial!.Hopefully those 3 tips can save you a lot of time and a lot of memory!Like to learn?Follow me on twitter where I post all about the latest and greatest AI, Technology, and Science!.Connect with me on LinkedIn too!Recommended ReadingWant to learn more about Data Science?.The Python Data Science Handbook book is the best resource out there for learning how to do real Data Science with Python!And just a heads up, I support this blog with Amazon affiliate links to great books, because sharing great books helps everyone!.As an Amazon Associate I earn from qualifying purchases.

.

. More details

Leave a Reply