Why every Data Scientist should know SQL

But apparently that is exactly what a lot of data scientists actually do.

They “order” all of the items in the data warehouse and then use tools like Pandas dataframes to sift through the data they need and discard the rest.

A lot of data that data scientists work with comes from databases.

This is especially true in enterprise environments where business data resides in relational database systems, data marts, and data warehouses.

Shortly after our dashboard discussion, I met with a Database Administrator (DBA) at one of the big banks.

Their CEO was sold on the fact that data science could help transform the company and data science teams were cropping up all over the company in the recent months, but that’s when his job had started to become “hell”.

DBAs run a tight ship.

They tune the system and queries to the umpteenth degree so the database can hum along fine responding to predictable queries efficiently.

And then comes along a hotshot data scientist running a data experiment who somehow managed to get hold of the database credentials, and runs a huge query like “SELECT * FROM ENROLLMENTS” against an operational database.

The database slows to a crawl, and the company’s clients on the website start seeing database errors and timeouts.

And the DBA responsible for the database gets called to the boss’s office.

I may have exaggerated a bit but I think you get the point.

If you want to get some specific data from a relational database, it’s highly wasteful to run a query like “SELECT * FROM ENROLLMENTS”, especially if the table contains millions of rows.

No long after this meeting with the bank DBA, I realized that Data Scientists needed help working with databases (and so did DBAs who now had to deal with Data Scientists).

And by leveraging both – my well-honed database skills and newly minted Data Science skills – I could help Data Scientists (or those aspiring to become one) work more efficiently with databases and SQL.

Working with my colleagues Hima Vasudevan and Raul Chong, we launched the course Databases and SQL for Data Science on Coursera.

It is an online self-study course that you can complete at your own pace.

This course introduces relational database concepts and helps you learn and apply knowledge of the SQL language.

It also shows you how to perform SQL access in a data science environment like Jupyter notebooks.

The course requires no prior knowledge of databases, SQL, Python, or programming.

It has four modules and each requires 2 – 4 hours of effort to complete.

Topics covered include: Module 1: – Introduction to Databases – How to Create a Database Instance on Cloud – CREATE Table Statement – SELECT Statement – INSERT Statement – UPDATE and DELETE Statements – Optional: Relational Model Concepts Module 2: – Using String Patterns, Ranges – Sorting Result Sets – Grouping Result Sets – Built-in Functions, Dates, Timestamps – Sub-Queries and Nested Selects – Working with Multiple Tables – Optional: Relational Model Constraints Module 3: – How to access databases using Python – Writing code by Using DB-API – Connecting to a Database by Using ibm_db API – Creating Tables, Loading Data, and Querying Data from Jupyter Notebooks – Analyzing Data with SQL and Python – Optional: INNER JOIN, LEFT, RIGHT OUTER JOIN Module 4: – Working with Real-world Data Sets – Assignment: Analyzing Chicago Data Sets using SQL and Python The emphasis in this course is hands-on and practical learning.

As such, you will work with real databases, real data science tools, and real-world datasets.

You will create a database instance in the cloud.

Through a series of hands-on labs, you will practice building and running SQL queries using cloud based tools.

You will also learn how to access databases from Jupyter notebooks by using SQL and Python.

Anyone can audit this course at no-charge.

If you want a certificate and access to graded components of the course, there is currently a limited time price of $39 USD.

And if you are looking for a Professional Certificate in Data Science, this course is one of the 9 courses in the IBM Data Science Professional Certificate.

So if you are interested in learning SQL for Data Science, you can enroll now and audit for free.

Share this:FacebookTwitterLinkedInGooglePocketRedditPrint Related.. More details

Leave a Reply