Data Engineering with Python, Django, and PostgreSQL

Data Engineering with Python, Django, and PostgreSQLSammy LeeBlockedUnblockFollowFollowingFeb 28Today’s post will deal with what may be one of the hardest aspects of data science which doesn’t involve analysis, but simply trying to make the backend of data science work.

By backend I mean the database systems most data scientists will be working with on the job.

I will go over the following:Build an absolute barebones Django app with a Relational Database Management System (RDBMS)Illustrate the use of a PostgresSQL database attached to the Django appHow to move data in and out between different formats and platformsWhile following this article doesn’t require any knowledge of Django, I think it’s important to appreciate the fact that a lot of data collection occurs through web apps.

For data scientists who are unfamiliar with Django, think of it as a framework for building web applications while adhering to the philosophy of “inversion of control”.

This means Django takes care of the skeleton of the web app, and you’re responsible for fleshing out the actual content on top of the skeleton.

For readers who don’t like Django you can skip to the section titled: “The Payoff: Django’s Object Relational Mapper” towards the end of this post.

Our Django App: “DoubleBagger”The app that I’m interested in creating is going to be called “DoubleBagger”, an investment blog where people self-publish their buy/sell opinions on public companies like Apple (ticker: AAPL) or Microsoft (ticker: MSFT).

And instead of firing up a Jupyter Notebook like my previous articles this time we’ll mainly be working with the command line + a text editor like Sublime Text.

And because this is aimed at data scientists, we’ll be using a conda environment:# I like to do things on my desktop# From the terminal:$ cd desktop && mkdir doublebagger && cd doublebagger$ conda create -n doublebagger$ conda activate doublebagger# You should now have the (doublebagger) conda environment activatedAnd now we install our two main packages: Django and psycopg2 for connecting to a PostgreSQL database.

Django already ships with SQLite which may actually be suitable for many organizations and for hobbyists, but we’re going to use Postgres instead.

Furthermore, we’ll be using an older version of Django (current version is Django 2.

1).

$ (doublebagger) conda install Django==1.

9.

6 psycopg2After verifying you have these packages along with their depencies, create a source directory where we put our entire source code having to do with “Doublebagger.

”$ (doublebagger) mkdir src && cd srcWe start every Django project in pretty much the same way with the same command:# Inside of src:# don't forget the space and period at the end$ (doublebagger) django-admin startproject doublebagger_blog .

The django-admin startproject command is what creates the skeleton or framework for our project and now if you check out what it’s inside of the src folder you should see:doublebagger_blog: contains the project configurations our project including the settings.

py file.

manage.

py: utility functionsNow we can open up our DoubleBagger project inside of Sublime Text or any other editor of your choice.

You should see the exact same directory structure:Assuming you have a postgres database already installed on your machine, we actually need to create a postgres database for our django app:# from the command line:$ psql -d postgrespostgres=# CREATE DATABASE doublebagger;# That's it!# quit by:postgres=# q*If you don’t have postgreSQL you can follow these instructions.

Then inside of settings.

py (using Sublime Text), we change the default configuration to account for the database we just created.

Change this:# settings.

pyDATABASES = { 'default': { 'ENGINE': 'django.

db.

backends.

sqlite3', 'NAME': os.

path.

join(BASE_DIR, 'db.

sqlite3'), }}To this:# Your 'User' is probably different depending on how you set up # postgres.

In many cases, it's just 'postgres'.

# Also depends if you set up a password with you postgres.

DATABASES = { 'default': { 'ENGINE': 'django.

db.

backends.

postgresql_psycopg2', 'NAME': 'doublebagger', 'USER': 'WhoeverOwnsTheDatabase', 'PASSWORD': '', 'HOST': '127.

0.

0.

1', 'PORT': '5432', }}*Make sure to save your changes within the text-editorNow if you go back to the command line, we can connect the app with the postgres database like so:# Still inside of your src where manage.

py lives:$ (doublebagger) python manage.

py migrateIf everything went okay, you should see something like this:Now from the same command line:$ (doublebagger) python manage.

py runserverAnd point your browser to:127.

0.

0.

1:8000You should see something like this:image sponsored by fakedoors.

comThis is essentially the homepage of your hypothetical web app displayed from Django’s local development sever which is meant to emulate a real web server.

You can quit the development server with <control-C>.

This is pretty much it for the app — not even bare bones, not even hello world.

So why did we go through the trouble of doing this?Because Django Models provides its users an object relational mapper (ORM) which allows us to manipulate model objects using python while connected to a postgresSQL database.

It’s just one more layer of sophistication and awareness of the absolute zoo that makes up the pythonic data science ecosystem.

At this point we need to think about the organization of our data.

Our web app will primarily have two main components or Model Classes:Post: Blog post about whether or not to invest in a particular companyCompany: Information about the companies the blog posts mentionPost will contain information about:title of the postslug: a unique identifier for the blog post based on title)text: the actual blog post textpub_date: when the post was publishedCompany will contain information about:name of the companyslug: a unique identifier for the companydescription: what the company doesPE Ratio: An indication of company’s valuation relative to market priceBefore we translate this to Django Models code, we need to create a Django app first.

# We're going to call our app 'post'# Inside of src from terminal:$ (doublebagger) python manage.

py startapp postYou should now see the post app show up from your text editor:After that we have to append our ‘post’ app under INSTALLED_APPS in settings.

py:# settings.

pyINSTALLED_APPS = [ 'django.

contrib.

admin', 'django.

contrib.

auth', 'django.

contrib.

contenttypes', 'django.

contrib.

sessions', 'django.

contrib.

messages', 'django.

contrib.

staticfiles', 'post', ]# Don't forget to save changesNow we can code up the organization for our data by going to post/models.

py:# post/models.

pyfrom __future__ import unicode_literalsfrom django.

db import models# Create your models here.

class Post(models.

Model): title = models.

CharField(max_length=100) slug = models.

SlugField() text = modelsle.

TextField() pub_date = models.

DateField()class Company(models.

Model): name = models.

CharField(max_length=100) slug = models.

SlugField() description = models.

TextField() pe_ratio = models.

DecimalField(max_digits=7, decimal_places=2)One of the most important aspects of the Models that we need to consider at this point is how these two things are related to each other.

If we look at what we did above, from the point of SQL, Post is a TABLE and Company is its own TABLE, with the fields underneath like title, slug, and pub_date representing columns of the two tables.

With SQL we need to consider if the relationship between our two models is a:One-to-Many or Many-to-OneMany-to-ManyIf you consider it, what we have here is a Many-to-One relationship between Post and Company:“One blog post can only be an investment thesis about one company, but one company can have many blog posts written about it.

”As a result our Post model will have a Foreign Key to Company.

The beauty of Django is that it takes care of all the hard work for us that would normally go with creating database schemas from scratch — no need to explicitly create primary keys, no need for indexes or junction tables for many-to-many relationships.

We can add our Foreign Key like so:# post/models.

pyclass Post(models.

Model): title = models.

CharField(max_length=100) slug = models.

SlugField() text = modelsle.

TextField() pub_date = models.

DateField() company = models.

ForeignKey('Company') # HereAdding some extra parameters and string methods that allow our Model objects to be referenced by string names, here is our final updated post/models.

py:# post/models.

pyfrom __future__ import unicode_literalsfrom django.

db import models# Create your models here.

class Post(models.

Model): title = models.

CharField(max_length=100) slug = models.

SlugField(max_length=50, unique_for_month='pub_date') text = models.

TextField() pub_date = models.

DateField('date_published', auto_now_add=True) company = models.

ForeignKey('Company')def __str__(self): return "{} on {}".

format( self.

title, self.

pub_date.

strftime('%Y-%m-%m'))class Meta: verbose_name = 'investment thesis' ordering = ['-pub_date', 'title'] get_latest_by = 'pub_date'class Company(models.

Model): name = models.

CharField(max_length=100, db_index=True) slug = models.

SlugField(max_length=50, unique=True) description = models.

TextField() pe_ratio = models.

DecimalField(max_digits=7, decimal_places=2)def __str__(self): return self.

nameclass Meta: ordering = ['name']# Don't forget to save your changesNow comes the magical part where Django creates a postgreSQL database for us.

Back on the command line:$ (doublebagger) python manage.

py makemigrations# Then $ (doublebagger) python manage.

py migrate# That's it!You should see something like this:The Payoff: Django’s Object Relational Mapper (ORM)Now we can interact with our PostgreSQL database as data scientists by using nothing but Python!From the command line:$ (doublebagger) python manage.

py shellYou should see something that reminds you of the Python interpreter, except this one allows you to play with Django’s database:>>> from datetime import date>>> from post.

models import Post, CompanyLet’s create a Company object:>>> Company.

objects.

create(name='Apple', slug='Apple-incorporated-nasdaq', description='One of the greatest companies in the world created by the amazing Steve Jobs.

', pe_ratio=14.

43)You should see something like this after:<Company: Apple>No need to:INSERT INTO company(name, slug, description, pe_ratio)VALUES('Apple', 'Apple-incorporated-nasdaq', description='.

', pe_ration=14.

43);Some more companies:>>> Company.

objects.

create(name='Amazon', slug='amzn-nasdaq', description='AWS baby!', pe_ratio=81.

48)>>> Company.

objects.

create(name='Microsoft', slug='msft-nasdaq', description='I love GitHub!', pe_ratio=26.

02)>>> Company.

objects.

create(name='MongoDB', slug='mdb-nasdaq', description='JSON in the cloud', pe_ratio=100)>>> Company.

objects.

all()# Output:[<Company: Amazon>, <Company: Apple>, <Company: Microsoft>, <Company: MongoDB>]We can also do some cool queries:>>> Company.

objects.

get(slug__contains='mdb')# Output:<Company: MongoDB>Who likes tuples:>>> Company.

objects.

values_list()# Output:[(3, u'Amazon', u'amzn-nasdaq', u'AWS baby!', Decimal('81.

48')), (1, u'Apple', u'Apple-incorporated-nasdaq', u'One of the greatest companies in the world created by the amazing Steve Jobs, not so currently under Tim Cook.

', Decimal('14.

43')), (2, u'Microsoft', u'msft-nasdaq', u'I love GitHub!', Decimal('26.

02')), (4, u'MongoDB', u'mdb-nasdaq', u'JSON in the cloud', Decimal('100.

00'))]We can also create a Post object:>>> Post.

objects.

create(title='I heart Apple', slug='i-heart-apple', text='I heart Apple', company_id=1)>>> Post.

objects.

create(title='Buy more Microsoft', slug='buy-more-microsoft', text="Nadella leads the way", company_id=2)>>> Post.

objects.

create(title='Buy more Amazon', slug='buy-more-amazon', text="Jeff Bezos acquires Mars from Elon Musk", company_id=3)>>> Post.

objects.

create(title='Time to sell MongoDB', slug='time-to-sell-mongdb', text="MongoDB seems a little overvalued", company_id=4)We can also verify that these objects actually exist in our postgreSQL database:$ psql -d postgrespostgres=# c doublebaggerdoublebagger=# dtdoublebagger=# SELECT * FROM post_company;You should see something like this:And an SQL join:doublebagger=# SELECT * FROM post_companydoublebagger=# JOIN post_post ONdoublebagger=# post_company.

id = post_post.

company_id;From postgreSQL to PandasUsing sqlalchemy we can directly get access to postgreSQL objects straight from pandas:If you fire up a Jupyter Notebook:import pandas as pdfrom sqlalchemy import create_engineengine = create_engine('postgresql://sammylee@localhost:5432/doublebagger')posts = pd.

read_sql('select * from post_post', engine)companies = pd.

read_sql('select * from post_company', engine)posts_companies = pd.

merge(companies, posts, left_on='id', right_on='company_id')posts_companiesAnd now we can use the full might and magic of pandas to do whatever we want as data scientists.

While knowledge of a web development framework isn’t necessary for data scientists, I think it’s really cool to take a step back and have a look at the Amazonian Forest that is data science by going from Django to postgreSQL to pandas.

One More ThingDjango has a “batteries included” philosophy, and provides an administration interface that basically serves as a GUI to your postgreSQL database.

All we have to do is add some lines to post/admin.

py# post/admin.

pyfrom django.

contrib import adminfrom django.

db import models# Register your models here.

from .

models import Post, Companyclass MyCompanyAdmin(admin.

ModelAdmin): model = Company list_display = ('name', 'slug', 'description', 'pe_ratio',)admin.

site.

register(Company, MyCompanyAdmin)And create a superuser on the command line:$ (doublebagger) python manage.

py createsuperuser# Then create your credentials# start up your local development server and head to your local host# and append "/admin" to itCode for DoubleBagger lives here.

.. More details

Leave a Reply