Optimized I/O operations in Python

The Pandas library provides a multitude of classes and methods to read and write files in a good range of formats.We are going to study the following areas of data storage and retrieval methods here:Serialized storage using Pickle moduleI/O operations on textual dataSQL databasesI/O with PyTablesThe 2 major factors which are taken into consideration while optimizing I/O operations in the Python language are efficiency(performance) and flexibility..Let’s dive right into it:Serialized storage using Pickle moduleThere are numerous modules in the Python language which can be easily used in a large-scale deployment setting.Using the pickle module to read and write filesYou need to store the data on your disk in order to share, document or use it later..We have pickle module which serializes the python object to make the read and write operation swiftly.# On running the above code snippet, you'll see:CPU times: user 40.9 ms, sys: 14 ms, total: 54.9 msWall time: 54.5 msThe random floats build up a 9MB file which is serialized to a byte stream and written to the disk in 54.9 ms..There are several options to operate string objects and with text files in general.To write a CSV(Comma Separated Values), we can make use of the write and readline methods:csv_file.write(header)# time is time array and data is the dummy numpy arrayfor time, (a, b, c, d, e) in zip(time, data): s = '%s,%f,%f,%f,%f,%f..' % (time, a, b, c, d, e) csv_file.write(s)csv_file.close()# to read the file, we can use readlines functioncontent = csv_file.readlines()Though python provides methods to process text files, we have the pandas library which can read and write a variety of data formats and is way better and easier to get your hands on.Be it CSV(comma-separated value), SQL(Structured Query Language), XLS/XLSX(Microsoft Excel files), JSON(Javascript Object Notation), or HTML(Hypertext Markup Language).Pandas make the entire process of reading and writing CSV files a bit more convenient, concise and faster.%time data.to_csv(filename + '.csv')# CPU times: user 5.59 s, sys: 137 ms, total: 5.69 s# And to read the files back from the diskpd.read_csv(<path to the CSV file>)SQL databasesPython comes with support for SQL database which is SQLite3..Reading the database is much faster:con.execute('SELECT * FROM TODO_NUMBER').fetchall() If you are dealing with a lot of numbers and arrays in your database, you can make use of Numpy arrays to read the data directly into a numpy ndarray.np_query = 'SELECT * FROM TODO_NUMBER WHERE Num1 > 0 AND Num2 < 0'res = np.array(con.execute(np_query).fetchall()).round(3)This is a very good hack to read and plot the results of the query without any hustle..Now let’s populate the database, we’ll have to create some random values and write them row by row in the table like this:But again, we have one more optimized and Pythonic way to hit the same result, making use of NumPy structured arrays:dty = np.dtype([('Num1', 'i4'), ('Num2', '<i4')])sarray = np.zeros(len(ran_int), dtype=dty)Now that we are set with complete data in our table, it all boils down to the creation of the table as follows:%%timeh5.create_table('/', 'ints_from_array', sarray,title='Integers', expectedrows=rows, filters=filters)This approach is faster and we accomplish the same results with fewer lines of code.. More details

Leave a Reply