Using Docker and Pyspark

Spark is already installed in the container.

You are all ready to open up a notebook and start writing some spark code.

I will include a copy of the notebook but I would recommend entering the code from this article into a new Jupyter notebook on your local computer.

This helps you to learn.

To stop the docker container and Jupyter notebook server, simply enter control + c in the terminal that is running it.

Pyspark BasicsSpark is an open source cluster computing framework written in mostly scala with APIs in R, python, scala and java.

It is made mostly for large scale data analysis and machine learning that cannot fit into local memory.

In this brief tutorial, I will not use a dataset that is too big to fit into memory.

This tutorial borrows from the official getting starting guide: https://spark.

apache.

org/docs/latest/sql-getting-started.

html.

Spark Datatypes:There are two main datatypes in the spark ecosystem, Resilient distributed datasets or RDDs (which are kind of like a cross between a python list and dictionary) and dataframes (dataframes much like in R and python).

Both data types in spark are partitioned and immutable (which means you cannot change the object, a new one is returned instead).

In this tutorial I am going to focus on the dataframe datatype.

The Dataset:The dataset that I will be using is a somewhat large Vermont vendor data dataset from the Vermont open data Socrata portal.

It can be downloaded easily by following the link.

Setting up a Spark session:This code snippet starts up the pyspark enviroment in the docker container and imports basic libraries for numerical computing.

# import necessary librariesimport pandas as pd import numpyimport matplotlib.

pyplot as plt from pyspark.

sql import SparkSession# create sparksessionspark = SparkSession .

builder .

appName("Pysparkexample") .

config("spark.

some.

config.

option", "some-value") .

getOrCreate()Reading in a CSV:I wanted to start by comparing reading in a CSV with pandas vs Spark.

Spark ends up reading in the CSV much faster than pandas.

This demonstrates how Spark dataframes are much faster when compared to their pandas equivalent.

#time reading in data with spark%%timeitdf = spark.

read.

csv('Vermont_Vendor_Payments (1).

csv', header='true')#time reading in data with pandas%%timeitdf_pandas = pd.

read_csv('Vermont_Vendor_Payments (1).

csv', low_memory = False)Basic Spark Methods:like with pandas, we access column names with the .

columns attribute of the dataframe.

#we can use the columns attribute just like with pandascolumns = df.

columnsprint('The column Names are:')for i in columns: print(i)We can get the number of rows using the .

count() method and we can get the number of columns by taking the length of the column names.

print('The total number of rows is:', df.

count(), '!.The total number of columns is:', len(df.

columns))The .

show() method prints the first 20 rows of the dataframe by default.

I chose to only print 5 in this article.

#show first 5 rowsdf.

show(5)The .

head() method can also be used to display the first row.

This prints much nicer in the notebook.

#show first rowdf.

head()Like in pandas, we can call the describe method to get basic numerical summaries of the data.

We need to use the show method to print it to the notebook.

This does not print very nicely in the notebook.

df.

describe().

show()Querying the data:One of the strengths of Spark is that it can be queried with each language’s respective Spark library or with Spark SQL.

I will demonstrate a few queries using both the pythonic and SQL options.

The following code registers temporary table and selects a few columns using SQL syntax:# I will start by creating a temporary table query with SQLdf.

createOrReplaceTempView('VermontVendor')spark.

sql('''SELECT `Quarter Ending`, Department, Amount, State FROM VermontVendorLIMIT 10''').

show()This code performs pretty much the same operation using pythonic syntax:df.

select('Quarter Ending', 'Department', 'Amount', 'State').

show(10)One thing to note is that the pythonic solution is significantly less code.

I like SQL and it’s syntax, so I prefer the SQL interface over the pythonic one.

I can filter the columns selected in my query using the SQL WHERE clausespark.

sql('''SELECT `Quarter Ending`, Department, Amount, State FROM VermontVendor WHERE Department = 'Education'LIMIT 10''').

show()A similar result can be achieved with the .

filter() method in the python API.

df.

select('Quarter Ending', 'Department', 'Amount', 'State').

filter(df['Department'] == 'Education').

show(10)PlottingUnfortunately, one cannot directly create plots with a Spark dataframe.

The simplest solution is to simply use the .

toPandas() method to convert the result of Spark computations to a pandas dataframe.

I give a couple examples below.

plot_df = spark.

sql('''SELECT Department, SUM(Amount) as Total FROM VermontVendor GROUP BY DepartmentORDER BY Total DESCLIMIT 10''').

toPandas()fig,ax = plt.

subplots(1,1,figsize=(10,6))plot_df.

plot(x = 'Department', y = 'Total', kind = 'barh', color = 'C0', ax = ax, legend = False)ax.

set_xlabel('Department', size = 16)ax.

set_ylabel('Total', size = 16)plt.

savefig('barplot.

png')plt.

show()import numpy as npimport seaborn as snsplot_df2 = spark.

sql('''SELECT Department, SUM(Amount) as Total FROM VermontVendor GROUP BY Department''').

toPandas()plt.

figure(figsize = (10,6))sns.

distplot(np.

log(plot_df2['Total']))plt.

title('Histogram of Log Totals for all Departments in Dataset', size = 16)plt.

ylabel('Density', size = 16)plt.

xlabel('Log Total', size = 16)plt.

savefig('distplot.

png')plt.

show()Starting up you docker container again:Once you have started and exited out of your docker container the first time, you will start it differently for future uses since the container has already been run.

Pass the following command to return all container names:docker ps -aGet the container id from the terminal:Then run docker start with the container id to start the container:docker start 903f152e92c5Your Jupyter notebook server will then again be running on http://localhost:8888.

The full code with a few more examples can be found on my github:https://github.

com/crocker456/PlayingWithPysparkSources:PySpark 2.

0 The size or shape of a DataFrameThanks for contributing an answer to Stack Overflow.Some of your past answers have not been well-received, and you're…stackoverflow.

comGetting Started – Spark 2.

4.

0 DocumentationEdit descriptionspark.

apache.

orgLearn Python – Best Python Tutorials (2019) | gitconnectedThe top 77 Python tutorials.

Courses are submitted and voted on by developers, enabling you to find the best Python…gitconnected.

com.

. More details

Leave a Reply