Getting set up in PostgresSQLUsing PSQL and Python Pandas togetherSherry YangBlockedUnblockFollowFollowingJun 4In the I SPY children’s series, the author asks us to look at a global image for the smaller images we want to find.
It is not easy (by design) which is the opposite of what we hope for with our SQL queries.
One of the big challenges I faced after taking introductory tutorials in writing structured query language (SQL) queries was how to practically use these queries.
Tutorials give us a nice environment to start out in but don’t always provide an accessible bridge to working on our own.
Here’s a quick walk through of SQL, and a tutorial for querying using the PSQL server.
First we need to understand why we need databases.
While learning as students, we often work with relatively small data sets that can be saved on our own machines.
However, in practice we need the ability to store and retrieve data that is too large for an individual machine’s memory.
The need for databases originated as a result of what was considered big data at the time.
Databases store information systematically to allow for storage, manipulation and management.
There are four main types of database management systems (DBMS): hierarchical, network, relational, and object-orientated relational DBMS.
SQL is used to deal with relational database management systems or servers.
These servers go by names such as SQL Server, MySQL, SQLite, and PostgreSQL, which can make the use of the word SQL as both a reference to the language and to the relational database confusing.
“Several database platforms use SQL, but a slight variation on it — each tend to have a slightly different syntax.
” (Source) We can also use a syntactic evolution of SQL language called SQL Alchemy inside a SQL server.
We have both open source and licensed servers available to us.
Here’s a brief look at a few types of database management systems: Microsoft SQL Server vs.
MySQL, PostgreSQL vs MYSQL and PostgreSQL vs MongoDB for instance.
PSQL is a good choice because it’s been in development for more than 30 years, is ACID-compliant — which stands for Atomicity, Consistency, Isolation, Durability and means loosely protection against errors, and it has become the open-source relational database for many organizations.
It is used to build applications and trusted to protect data integrity (Source).
To get started, we’ll work with the csv files from the Kaggle dataset Shopify App Store.
In the workplace we would likely not start with a csv file of data because on principle, we are trying to work with data that is too large to be saved in a mere csv file.
However, because we are learning, we will use a small set of data that doesn’t require a server to view, but can be queried for practice.
We’ll take a look at the pricing data for Shopify’s app marketplace.
Shopify, a Canadian e-commerce site, competes against Amazon for sellers.
It “powers over 600,000 businesses and have sold over $82bn” through their platform.
In some cases, Shopify beats out Amazon as the choice e-commerce platform on the market.
We can see there are five tables in this dataset — apps, categories, plan_features, pricing_plans, reviews.
We will first create an Entity Relationship Diagram (ERD) to understand the relationship between columns of data in the tables.
That will allow us to:Determine foreign and primary keysDetermine data type of each column — a good rule of thumb is only columns with data that we intend to do mathematical calculations on should have the type integer.
Understand types of relationships between tables: one-to-many, many-to-one etc.
Source (https://www.
lucidchart.
com/pages/ER-diagram-symbols-and-meaning)I set my primary keys because they are the parent records.
The foreign keys contain child records.
(Source) This can be really tricky to figure out the more tables we have.
We know that a table can “only have one primary key” and that it’s “good practice to add a primary key to every table”.
“Primary keys do not accept NULL or duplicate values.
” We may need to create a primary key that consists of two or more columns which is known as the composite primary key.
That is explored further here.
In order to get our data out of the csv files, we’ll start out in the command line of bash.
We can get into our PSQL server in our command line and create a database.
Then we can navigate to our database with connect.
We then create a file in our text editor and save it as a SQL file.
I am choosing to use the text editor from Jupyter Notebook, but we could also use Visual Studio Code.
In this case, I will call my file create_tables.
sql.
Below is the code I wrote in my text editor for the apps table.
/* Drop the apps table if it exists */DROP TABLE IF EXISTS apps;/* Create skeleton table that supplies the table name, column names, and column types */#set up our SQL tablesCREATE TABLE apps( url text PRIMARY KEY, title text, developer text, developer_link text, icon text, rating text, reviews_count integer, description text, description_raw text, short_description text, short_description_raw text, key_benefits_raw text, key_benefits text, pricing_raw text, pricing text, pricing_hint text;/* Copy data from CSV file to the table */COPY apps FROM '/Users/sherzyang/flatiron/mod_3/week_7/my_practice/shopify-app-store/apps.
csv' WITH (FORMAT csv);Now I will head back into bash and try to create my first table within my database.
I came up with the error “unquoted carriage return” which I learned is an error with the csv file, not typically with a command from our end gone array.
I learned that this was due to.appearing in my data.
This likely could have happened because the data was downloaded in a windows program.
I used this fix on the file and nice, it worked!Source (http://bit.
ly/2JsArGY)I was able to move the other csv files into tables by editing my text code and calling the file in bash in PSQL with i create_tables.
sql.
For other commands, I used this great tutorial on intermediate PSQL.
I used the following code for my other tables:CREATE TABLE plan_features( pricing_plan_id text , app_url text , feature text;CREATE TABLE categories( app_url text , category text ;CREATE TABLE reviews( app_url text , url text , author text, body text, rating text, helpful_count text, posted_at text;CREATE TABLE pricing_plans( id text PRIMARY KEY , app_url text , title text, price text, hint text;Once I have my data in tables in PSQL, I can move to my Jupyter notebook and connect to my sqlalchemy engine.
import psycopg2import numpy as np import pandas as pdfrom sqlalchemy import create_engine#connect to the database and make a cursorconn = psycopg2.
connect("dbname=shopifydb_1")cur = conn.
cursor()#loop through our queries list and execute themfor query in queries_list: cur.
execute(query) conn.
commit()#set up a sqlalchemy engineengine = create_engine("postgresql:///shopifydb_1", echo=True)Now I can query through my tables.
What I want to see is reviews and the apps they belong to together in one DataFrame.
What’s interesting about connecting these tables is the awareness that we have a one to many join between one app and many reviews for that one app.
Through this query we can also practice creating aliases, which gives us nicknames for the full names of tables and casting, which turns a column type from one to another.
q = """SELECT r.
rating, r.
body, r.
helpful_count, p_p.
price, a.
rating, a.
reviews_count, a.
key_benefits, a.
pricing, a.
url, a.
titleFROM apps AS aJOIN pricing_plans p_p ON cast(a.
url as text) = cast(p_p.
app_url as text) AND a.
title = p_p.
title JOIN reviews r ON a.
url = r.
app_url"""We can query through a very large database using SQL to isolate the data we need.
If you make a mistake and introduce an error, you can always use the following code to rollback the cursor as long as you have not committed the previous action.
conn.
rollback()Now we can run the query and read the results as a pandas DataFrame.
We would want to do this to perform statistical tests, transformations, visualizations and more on the slice of data that we are examining.
(For calculations on larger amounts of data, we will need to use say, Apache Spark instead of inserting the data into a pandas DataFrame.
)#read our query as a DataFramedf_join = pd.
read_sql_query(q,con=engine)Finally we can save our query back into a csv file if needed.
#save our selected query back into a CSV filedf_join.
to_csv('data.
csv')This was a quick look at how to use psql and its relationship to SQL and python’s pandas library.
Hope it’s useful, and helps you play I Spy (figuratively) with your data a little easier.
.