The most (time) efficient ways to import CSV data in Python

The most (time) efficient ways to import CSV data in PythonMihail YanchevBlockedUnblockFollowFollowingJul 13, 2018At some point in my work experience in the commercial banking sector I faced the issue of importing somewhat big files in CSV or other text formats in R.

At the time I managed with the first methods I found on the internet and that was good enough.

However, later on I decided to explore the different ways to do that in R and Python and check how much time each of the methods I found takes depending on the size of the input files.

It turns out, as it will become clear further down in this post that if you face the issue of larger input files (like several several hundred MBs or more) it is probably best to use some sort of partitioning or parallel processing for the import.

This post is dedicated to Python only, as I already know a similar post about R by Olga Mierzwa-Sulima.

This post will not focus on super accurate benchmarking, but rather on how to perform each of these CSV import options and how to run some simple benchmarking ourselves.

We can draw some conclusions afterwards on the basis of this casual analysis.

This post focuses only on CSV files, as these are quite often the format of choice for tabular data.

The options that I will cover here are: csv.

DictReader() (Python), pandas.

read_csv() (Python), dask.

dataframe.

read_csv()(Python), paratext.

load_csv_to_dict() (Python), paratext.

load_csv_to_pandas() (Python).

This is by no means an exhaustive list of all methods for CSV import.

However, it is probably a somewhat exhaustive list of the most common methods to do that.

Just a sidenote that all of the code I use here is Python 2.

7 compliant as well as Python 3.

7.

Generating a CSV file with random dataFirst we will create a CSV file with some random data in it (integers, floats and strings).

I will not go too deep into the script used to create it.

It is probably one of many ways to do that.

The script uses numpy and pandas methods to create arrays of random numbers or sequences of characters.

I decided to go for 1 000 000 rows which results in a CSV of around 100.

7 MB.

This is a realistic size of a normal data file for many fields of work, but still very humble compared to the multi-gigabyte and terabyte-sized data sets being collected in a growing number of fields nowadays.

import pandas as pdimport numpy as np# setting the number of rows for the CSV fileN = 1000000# creating a pandas dataframe (df) with 8 columns and N rows with random integers between 999 and 999999 and with column names from A to Hdf = pd.

DataFrame(np.

random.

randint(999,999999,size=(N, 7)), columns=list(‘ABCDEFG’))# creating one column 'H' of float type using the uniform distributiondf[‘H’] = np.

random.

rand(N)# creating two additional columns with random stringsdf[‘I’] = pd.

util.

testing.

rands_array(10, N)df[‘J’] = pd.

util.

testing.

rands_array(10, N)# print the dataframe to see what we have createdprint df# export the dataframe to csv using comma delimitingdf.

to_csv(“random.

csv”, sep=’,’)Further down in this post we will generate another larger CSV file using the same code (by increasing the N), in order to see whether some methods have a more apparent advantage when importing bigger files.

How to measure the time-efficiencyHere I am interested in how much time it takes to import a CSV file using various methods.

This is pretty straightforward.

Get the time before and after a certain action and then get the difference in seconds:import timestart_time = time.

time()do_something()print(“%s seconds” % (time.

time() — start_time))The results given by this method can be very volatile.

Every split second different processes run on the computer with different intensity and therefore they interfere with the results of the measurement.

For the purposes of this post I am willing to ignore this inteference, since I am more interested in time efficiency in relative rather than absolute terms.

A somewhat important point here is that there is overhead associated with timing itself.

One way to see what the overhead is would be to run the script without anything to actually time.

The code below does exactly this and this shows that the overhead is very minuscule and can be generally disregarded.

However, if one is trying to benchmark some code which takes very little time (close to the time cost of timing itself) then probably one should be more worried about this.

start_time = time.

time()print(“Nothing took %s seconds” % (time.

time() — start_time))Nothing took 9.

53674316406e-07 secondsAnother more accurate way to measure elapsed time is to use the timeit package.

According to the package documentation: “It reduces the impact of startup or shutdown costs on the time calculation by executing the code repeatedly.

” It can be used inside a script, from the terminal or from the iPython shell.

I decided not to use it in this post as it not always produced consistent and meaningful results for me.

Now, since we have the main tools to do an accurate comparison let us move on to the main part of this post.

csv.

DictReader()The DictReader is a Python class which maps the data read as a dictionary, whose keys, unless specified are the first row of the CSV.

All values in the subsequent rows will be dictionary values and can be accessed with the respective dictionary key.

However, these values will be imported as strings.

I personally do not find this method very useful for data analysis tasks, but it is valuable in certain other cases (working with the JSON format).

You would go about it in the following way:import csv with open("random.

csv") as csvfile: data = csv.

DictReader(csvfile) for row in data: print(row['A'])Or alternatively with:data = csv.

DictReader(open("random.

csv")) for row in data: print(row['A'])The two snippets above will create the CSV object (data) and print all rows from the column named ‘A’.

pandas.

read_csv()Pandas are data structures tailored for data analysis and data science work.

They are suited for tabular data with heterogeneously-typed columns (integers, floats, strings like the random data we already created), as well as time series.

Many of us work with such data coming from SQL tables or excel files on a daily basis.

Pandas can also be used as vectors and matrices for example, if you want to perform linear algebra operations like matrix multiplication.

Pandas are my preferred data structure for data analysis tasks as pandas are very easy to manipulate and transform.

In a sense they are very similar to data.

frame and data.

table in R.

import pandasdata = pd.

read_csv(“random.

csv”)According to many opinions I have come across the pandas package is very well optimized and thus very efficient in many tasks including basic ones as data import.

An importnat point here is that pandas.

read_csv() can be run with the chunksize option.

This will break the input file into chunks instead of loading the whole file into memory.

This will reduce the pressure on memory for large input files and given an optimal chunksize found through trial and error, there can be significant increase in efficiency.

The code below will split the input file into chunks of 100 000 lineschunks = pd.

read_csv(input_file, chunksize=100000)data = pd.

concat(chunks)The difference with all other methods is that after reading them chunk by chunk, one needs to concatenate them afterwards.

I found this very useful post by Goutham Balaraman, where it is also explained how separating the input file in chunks can help speed up further processing as well.

Later he edited his post saying that he finds the dask.

dataframe superior to this method, which brings us to the next method.

dask.

dataframeThe dask.

dataframe is a collection of smaller pandas data frames split by the index (the row labels used for identification of data), which can be processed in parallel on a single machine or on multiple machines on a cluster.

I learnt about dask from this useful post by Eric Brown.

You can use it with the following Python script:import dask.

dataframedata = dask.

dataframe.

read_csv(“random.

csv”)Apparently, unlike pandas with dask the data is not fully loaded into memory, but is ready to be processed.

Also certain opperations can be performed again without loading the whole dataset into memory.

Another advantage is that the most functions used with pandas can be also use with dask.

The differences arise from the parallel nature of dask.

datatableAs Carlos Ortega from the comments section pointed out to me, data.

table from R was ported to Python.

Data table is known for being faster than the traditional R data frame both for data import as well as subsequent manipulation.

The usage is quite similar as in R.

One can use it in the following way:import datatable as dtdata = datatable.

fread(input_file)ParatextParatext is a C++ library built to read text files using parallel processing.

From this paper on benchmarking paratext against other CSV import methods and this discussion in github it is evident that paratext is not really tailored for low end consumer hardware.

The benchmarks were run in a server environment on machines with multiple SSDs and at least 16 cores.

It cannot be installed directly from homebrew or pip, but there is a good guide of how to install it and a list of the dependencies on the github page of the library.

On my macbook the installation was pretty straightforward, but for some reason I could never get it work with larger CSV files – the import will just freeze.

It is probably a hardware bottleneck and the relatively big overhead of setting up parallel processing.

I spent several hours trying to install it on my windows PC (as it has a much better processor and more memory), but I never managed to do it as compiling Python packages under windows can apparently be pure hell.

Anyway, I still want to try it and my best bet is to find a linux machine with more resources.

With paratext you can import csv either to pandas or to a dictionary similarly to two of the other methods I described above.

In order to do either of those one can use the following code snippets:import paratextdata = paratext.

load_csv_to_pandas(“random.

csv”)ordata = paratext.

load_csv_to_dict(“random.

csv”)As you can see in this reddit post Damian Eads (who apparently is the man behind Paratext) explains that you need a CSV file of at least 100 MBs in order to see the benefits of the library.

The overhead or the initial load of setting up parallel processing is not negligible and therefore the gains from using it will be more apparent in more formidable tasks like importing a multi-gigabyte data set.

But again, one needs the proper setup in order to reap the benefits of this library.

Due to my hardware/software problems with getting paratext to work properly, I excluded the method from the following comparison.

Simple BenchmarkingI perform the comparison for the Python methods with the following code all of which was already covered above:import pandas as pdimport timeimport csvimport paratextimport dask.

dataframeinput_file = "random.

csv"start_time = time.

time()data = csv.

DictReader(open(input_file))print("csv.

DictReader took %s seconds" % (time.

time() – start_time))start_time = time.

time()data = pd.

read_csv(input_file)print("pd.

read_csv took %s seconds" % (time.

time() – start_time))start_time = time.

time()data = pd.

read_csv("random.

csv", chunksize=100000)print("pd.

read_csv with chunksize took %s seconds" % (time.

time() – start_time))start_time = time.

time()data = dask.

dataframe.

read_csv(input_file)print("dask.

dataframe took %s seconds" % (time.

time() – start_time))The results I get are the following:csv.

DictReader took 0.

00013113021850585938 secondspd.

read_csv took 1.

9808268547058105 secondspd.

read_csv with chunksize took 2.

1136152744293213 secondsdask.

dataframe took 0.

06910109519958496 secondsdatatable took 0.

13840913772583008 secondscsv.

DictReader is by far the fastest, but as I pointed out earlier it imports everything as strings, while the other methods try to guess the data types of each column separately and possibly do multiple other validations upon import.

This means that if you use csv.

DictReader for data analysis tasks you will possibly have to do a number of transformations on it before you can analyze the contents (e.

g.

change the data types to the ones you expect).

I tried to run the same code, but with a larger CSV file.

I generated one which is 5 times bigger than the previous (with 5 000 000 rows and a size of around 487 MB).

I got the following results:csv.

DictReader took 9.

799003601074219e-05 secondspd.

read_csv took 11.

01493215560913 secondspd.

read_csv with chunksize took 11.

402302026748657 secondsdask.

dataframe took 0.

21671509742736816 secondsdatatable took 0.

7201321125030518 secondsIgnoring csv.

DictReader, dask.

dataframe is by far the quickest method.

This is understandable since it does not load the whole data set to memory, which the pandas methods do.

I re-ran the test with a CSV file of 10 000 000 rows and a size of around 990 MB.

The results are the following:csv.

DictReader took 0.

00013709068298339844 secondspd.

read_csv took 23.

0141019821167 secondspd.

read_csv with chunksize took 24.

249807119369507 secondsdask.

dataframe took 0.

49848103523254395 secondsdatatable took 1.

45100998878479 secondsAgain ignoring the csv.

DictReader, dask is by far the fastest.

However, datatable also performs pretty well.

SummaryPandas is very popular for data science work and has integration with a lot of other libraries.

I would recommend looking into the chunksize option, if you are dealing with a large dataset and you are low on RAM (There is an interesting story about that here).

Dask is an option which gives tremendous improvements in import speed.

It is the absolute leader in this simple comparison.

However, I have not used it very much on data analysis tasks, but from what I understand it should be quite the same as pandas, which is promising.

Datatable seems to perform quite decently as well, but I still have not looked into how well it is integrated with other libraries.

Paratext is definitely another good option which works well with pandas, but as I mentioned earlier the setup you run it on is important, as it is really focused towards more formidable tasks like importing terabyte-sized datasets.

Csv.

DictReader remains the quickest option, but as I mentioned earlier I would not recommend it for data analysis tasks.

Possibly working with the JSON format is one such area where this method will be quite useful.

I hope you found this overview useful.

I would be happy to answer any questions or discuss with you the topics I covered here.

EDIT:I did not use the chunksize options correctly.

If you want to process the dataset further on you need to concatenate the separate chunks.

This leads to longer time to process, but might be more efficient if you are low on RAM.

As pointed out there is now a version of R’s data table package for python.

Data table is known for being faster than the traditional R data frame both for data import as well as subsequent manipulation.

I included the overview of data table for python in the current post.

.. More details

Leave a Reply