Using Bash for Data Pipelines

In this tutorial, I will show you how to create a data pipeline using bash and Unix commands which you can then use in your work as a Data ScientistPre-requisitesFamiliarity with the command line (know how to make a directory, change directories and create new files)Linux or Apple computerInternet connectionGoals: Create a bash script that does the followingDownload data from an online sourceCount the number of rows in the dataRecord the names of the columns in the dataIterate through multiple files and give all of this informationPart 1: Downloading the data filesFor the first part of this tutorial, let say that we are working with San Francisco Speed Limit Data and we want to create our entire pipeline through the command line..To do this, we will use the text editor called nano.To open our file in nano we execute the following command in the command line:nano download_data.shOnce you have this open, you will create your first bash script by pasting the following code:The text with leads with a # is a comment except for #!/bin/bash which is called a shebang and is required in every bash script#!/bin/bash# Command below is to download the datacurl -o speed.csv https://data.sfgov.org/api/views/wytw-dqq4/rows.csv?accessType=DOWNLOADTo save:control + oentercontrol + xNow that we saved our file, lets explore this large command we are using to download our data:curl -o speed.csv https://data.sfgov.org/api/views/wytw-dqq4/rows.csv?accessType=DOWNLOADcurl is command to download-o is the a flag for outputspeed.csv is the output namehttps://data.sfgov.org/api/views/wytw-dqq4/rows.csv?accessType=DOWNLOAD is the url to download the dataNow to download the we just need to run our bash script with the following command in the terminal:bash download_data.shIf you now look into that folder, you will see that you have downloaded the file cars.csv!Please ask if you have any questions about this, and you can always download a different file by replacing the url..Putting all of this together (and a little more that I added so that the date auto populates into the text file) we get the following script:#!/bin/bashecho "Data Processed by Elliott Saslow"DATE=`date +%Y-%m-%d`echo "Date is: "$DATEecho ""echo "The name of the file is:" $1echo ""lines=$(wc -l < $1)echo ""echo "The file has" $lines "lines"echo ""colnames=$(head -n 1 < $1)echo ""echo "Column names are: "echo ""echo $colnamesNow to run the script, we save it like we did above with nano, and run it in the command line with the following command:bash process_data.sh speed.csv > text.txtThis command does the following:Calls the scriptPasses the file that we are looking at (speed.csv )Passes the output to a text file called text.txtIf you run this, and have done everything else correctly, you will have a text file in your folder containing the beginning of quality control checks that you can use for your data pipeline!Let me know in the comments where you get stuck and how I can help!Cheers. More details

Leave a Reply