Data Science approaches

Now we have a list of fields going down instead of across.

Next, add the SQL data type to an adjoining column ( varchar, date, numeric ).

Finally, write a small formula to combine the column name and column data type to create a DDL description.

The screenshot below shows how I did it.

https://github.

com/mlexperience/nlp/blob/master/te_fields_loan.

odshttps://gist.

github.

com/mlexperience/8c219015ef7acbc1d5140542bd2f329bYou can retrieve my workbook and DDL file from the links under the screenshots above.

This activity took about 30minutes since I had to look up the data types for each column.

One hundred forty-five columns one by one.

Execute the DDLRunning the DDL is the easy part so long as the DDL is technically correct.

Using DBeaver — executes the SQL command is straightforwardGo ahead and use DBeaver and drop the DDL command into an SQL Script window and hit Run.

We create an empty table.

Import the data fileNow that we have created a table matching the characteristics of our data, we are ready to import the file.

DBeaver provides a mechanism to import a CSV file into a table.

Using the import wizardSince we created our DDL and table from the columns of our ‘loans.

csv’ file, we know that all the columns in our loan table will match exactly with each column in our dataset.

That is an essential tip; otherwise, you will spend time changing the target and source mapping, and that can be painful.

Further options include truncating ( clearing ) the tableGo ahead and press next, and your data transfer will proceed in the background.

Eventually, we get this pleasant news!Our Data Transfer took 8m 32seconds — a mere lifetime for most of us!Check your work!Always check your work.

If the table has loaded successfully, we should be able to verify the row count in our Postgres database matches what we expect and got from Pandas.

Running an SQL query over our new tableOur SQL reports 2,260,668 versus (2260668, 145) reported by Pandas.

The numbers agree, and it seems our data is now loaded.

#SQL1 row(s) fetched – 401ms#Python Pandas794 ns ± 3.

23 ns per loop (mean ± std.

dev.

of 7 runs, 1000000 loops each)You can see that DBeaver and Postgres took 401ms to count the rows while the in-memory approach of Python Pandas took a mere 794 ns helping to demonstrate why we talk about in-memory and Pandas.

It is rapid versus persistent storage based approaches.

That whole exercise to load Postgres with the large table took about Thirty Minutes of work which involved building the DDL, executing the SQL, importing the data and checking the work.

There is now nothing in-memory as all our data is in the Database.

With SQL approaches, it is usually a never-ending story.

We need an indexCREATE INDEX loan_id_idx ON public.

loan USING btree (id, member_id)If you try to query a table without an index, it will take forever!In-memory versus DatabaseSo far we did a quick load of the dataset using Pandas and noticed the initial load came in at 2.

4GB for a 1.

1GB disk file.

We expect that would break less powerful computers through ‘out of memory’.

Next, we created a DDL and imported the data into Postgres.

The traditional SQL approach is slow compared to the in-memory plan when stuff fits in memory.

Let’s run a simple query and see how they compare.

https://github.

com/mlexperience/nlp/blob/master/postgres_article-%20phase%202.

ipynbUsing psycopg2 to build a query and transfer the result set to PandasA small example running a simple SQL statement using the Python wrapper and from DBeaver is pretty much the same execution time.

( 2.

409s versus 2.

209711s )Doing some summary statisticsThe real strength of Pandas is the vectorised operations that we can perform quickly and efficiently.

Let’s do some statisticsHacking out some statistics in SQLDoing some statistics with Pandas is much simplerNotice that the SQL completes in 688ms, whereas Pandas can compute many more statistics in 1/10th of the time.

Pandas is fast!.And we don’t have to hack out SQL commands to get a measure of our column of data.

What about getting the size down so we can load the entire dataset?A trickWe can make pandas work faster by cleaning up the data and assigning the correct type to each column.

The memory required goes from 2.

4+ GB down to 2.

0 GB by updating all 36 columns with categorical data.

Naturally, you will find many articles on getting the size of a dataframe down.

‘Optimizing the size of a pandas dataframe for low memory environment’ by Vincent Teyssier, for instance, provides my trick and many more.

ClosingOur little voyage of discovery must draw to some conclusion.

Why is everyone talking, writing, and teaching Pandas and other in-memory techniques?.Because it is fast, user-friendly, and powerful when compared to the traditional SQL approach with smaller datasets.

For big data, in-memory strategies can still work, but we should work with streams, chunks or Clusters such as Spark.

So couldn’t we use SQL instead of Pandas?.I believe we should use SQL where it makes sense, and that helps avoid risks such as:-Moving large volumes of data around the network can be expensive, time-consuming, and wasteful of resources.

Moving and storing data in silos, local PC, mobile can lead to breaches of Data Protection Policy ( GDPR in Europe )Moving data out of the Database can change the protection of ‘Confidential’ data to unprotected and unencrypted, which could expose company secrets.

Each Data Science project requires careful planning.

Even though computer memory is cheaper than ever, we do see a considerable increase in data volumes.

Should you try to load large tables into memory?.Could you work with a sample of the data?.Your approach will be domain and project specific, and no template fits all.

.

. More details

Leave a Reply