SQL and Pandas

Well, let me tell you.

PandasUnlike SQL, Pandas has built-in functions that help when you don’t even know what the data looks like.

This is especially useful when the data is already in a file format (.

csv, .

txt, .

tsv, etc).

Pandas also allows you to work on data sets without impacting database resources.

I’ll explain and show some examples of a few of the functions that I really like:pandas.

read_csv()First you need to pull the data into a dataframe.

Once you’ve set it to a variable name (‘df’ below), you can use the other functions to analyze and manipulate the data.

I used the ‘index_col’ parameter when loading the data into a dataframe.

This parameter is setting the first column (index = 0) as the row labels for the dataframe.

You can find other helpful parameters here.

Sometimes you have play around with parameters before it’s in the correct format.

This function won’t return an output if it is set to a variable, but once set you can use the next function to view the data.

# Command to import the pandas library to your notebookimport pandas as pd# Read data from Titanic dataset.

df = pd.

read_csv('.

titanic.

csv', index_col=0) # Location of file, can be url or local folder structurepandas.

head()The head function is really helpful in just previewing what the dataframe looks like after you have loaded it up.

The default is to show the first 5 rows, but you can adjust that by typing .

head(10).

df.

head()First 5 rows of dataframeThis is a great place to start.

We can see that there is a combination of strings, ints, floats, and that some columns have NaN values.

pandas.

info()The info function will give a breakdown of the dataframe columns and how many non-null entries each have.

It also tells you what the data type is for each column and how many total entries are in the dataframe.

df.

info()pandas.

describe()The describe function is really useful to see the distribution of your data, particularly numerical fields like ints and floats.

As you can see below, it returns a dataframe with the mean, min, max, standard deviation, etc for each column.

df.

describe()In order to see all columns, not just numeric, you’ll have to use the include parameter shown below.

Notice that ‘unique’, ‘top’, and ‘freq’ have been added.

These are only shown for non-numeric data types and NaN for numberic.

The other breakdowns from above are NaN for these new columns.

df.

describe(include='all')pandas.

isna()The isna function on it’s own isn’t particularly useful since it will return the whole dataframe with either False if the field is populated or True if it is a NaN or NULL value.

If you include .

sum() with isna(), then you’ll get an output like the one below with a count of NaN or NULL fields for each column.

df.

isna().

sum()pandas.

plot()Pandas plot function is really useful to just get a quick visualization of your data.

This function uses matplotlib for visualizations, so if you are familiar with that library, this will be easy to understand.

You can find all the different parameters you can use for this function here.

df.

plot(x='age', y='fare', kind='scatter')These are just some of the useful Pandas functions I’ve used for initial data analysis.

There are many other functions for data analysis and manipulation that you can find here.

When to use SQL vs.

 PandasWhich tool to use depends on where your data is, what you want to do with it, and what your own strengths are.

If your data is already in a file format, there is no real need to use SQL for anything.

If your data is coming from a database, then you should go through the following questions to understand how much you should use SQL.

Questions to answer:How much access do you have to the DB?.If you only have access to write a query and someone else runs it for you, you won’t be able to really look at your data.

This is a time where you should just pull all the data you think you might be needing and export into a csv to use pandas.

Another consideration: if a query you will need to run for your data is going to take up a lot of database resources and you know that your database admin wouldn’t allow it or like it, then just pull the data and do the work outside of the database with pandas.

Avoid SELECT * in your queries, especially when you aren’t sure how much data could be in a table.

How are you wanting to transform/join your data?If you already know some of the things you want to do with the data like filter out certain values, join to another table, combine certain fields in a calculation or concatenation, etc, it’s going to be easier to run SQL to pull the data as you want it and then export into a csv for any data analysis or data science work.

What are your strengths?The biggest question is where your strengths are.

If you feel more comfortable in one or the other, then stick with that language to do your data manipulation.

To Summarize:Both of these tools are really useful.

I recommend learning both.

The combination will give you the ability to do a broad range of data analysis and manipulation efficiently.

Soon, you won’t have to deal with Excel crashing on you anymore.

*Note: If you are working with really large data sets, you can use Dask which is built on Pandas specifically for big data.

I may do a write-up on Dask basics in the future.

.. More details

Leave a Reply