Creating a data lake for GTFS RealTime data using AWS services: collection, storage and processingArash KavianiBlockedUnblockFollowFollowingMay 30As part of a project, I was supposed to collect GTFS real-time data of city Utopia (It is somewhere in the southern hemisphere) for later analysis.
GTFS Realtime is a “feed specification that allows public transportation companies to provide real-time updates about their fleet to application developers”.
Because of the limited time and resources, I needed to implement a reasonably good, fast and cheap solution to collect, store and process the data.
Achieving triplet of “fast”, “cheap” and “good” might stay a dream for many works but you may be able to get closer to it with finding the right solutions.
Figure 1: Fast, cheap and easy.
source: http://www.
pyragraph.
comThat said, I thought about a solution that involves collecting the data in a serverless environment in which you can fastly develop, test and deploy services without much of maintenance and cost.
AWS Lambda could provide me with this environment together with the creation of an AWS S3 data lake where GTFS real-time JSON files are saved.
In a nutshell, I used Lambda function to create a GTFS RealTime data lake on an S3 bucket.
I also used AWS Glue to extract, transform and load the data.
After that, AWS Athena was utilised to be able to query the data using standard SQL.
This process has been shown in Figure 3.
Figure 2: GTFS RealTime data collection, storage and processing using AWS servicesDuring the development of the python script to be used in the Lambda function, I found out that some of the python libraries are required that are not accessible on AWS Lambda.
Thus, you need to install them locally beside your python script.
As a more straightforward approach, you may want to undertake the development and test locally and deploy it later.
Without it, the development and test of this python script for the Lambda function becomes not a necessarily pleasant experience!.Therefore, a more appropriate solution for the development of the Lambda function can be achieved via utilising Serverless framework which eases the development, test and deployment of Lambda functions.
So, let’s say we have a python script for GTFS data collection similar to the example you can find on the Google Transit API.
This example needs to have gtfs-realtime-bindings library installed.
gtfs-realtime-bindings is a python library used for GTFS RealTime data collection.
The GTFS-RealTime data is usually received in JSON format.
These JSON files after retrieval can be saved on your S3 bucket using a code similar to below:s3 = boto3.
resource('s3')obj = s3.
Object(bucket, file_name)obj.
put(Body=json.
dumps(data))Regarding the development of your python script for the AWS Lambada function using the serverless framework, you first need to install required python libraries locally.
In doing so, in the local directory that you have your python script, you can install this python package with the following command:pip install — upgrade gtfs-realtime-bindings -t .
To be able to deploy your python script as an AWS Lambda function using serverless framework, you need to first ensure that you have installed and configured AWS CLI on your computer and have the right accesses.
Further, you need to install serverless on your computer using the following command:npm install -g serverlessFurther details about the installation of serverless can be found here.
After that, you start a new serverless project with the following command:serverless create –template aws-nodejs –path myServiceThis command creates a serverless.
yml in myService folder that should be configured as below:service: your_service_collector # NOTE: update this with your service nameprovider: name: aws runtime: python2.
7functions:service_collector: handler: handler.
data_retiever timeout: 900 # You can change the default timeout (6 seconds)package: artifact: myService/package.
zipRemember that we needed to use python 2.
7 as it is the version of Python that is supported by gtfs-realtime-bindings library.
Also, the timeout in Lambda is by default 6 seconds.
You can change it based on how long your script may take time in the serverless.
yml.
Otherwise, you get a timeout error.
Following these configurations, you can deploy and test your Lambda with the following commands:serverless deploy –region ap-southeast-2serverless invoke -f service_collector –region ap-southeast-2Once your Lambda function has been deployed without error, you need to make sure that it runs within the desired time intervals.
In your AWS console, you can go to AWS Lambda service and find your_service_collector Lambda function and add a CloudWatch event to run it in a certain desired interval.
Figure 3: CloudWatch Event to run the Lambda function on a regular basis.
Now that you have created a data pipeline that in time populates your GTFS RealTime data lake, you can employ AWS Glue for conducting ETL (Extract, Transform and Load) tasks.
In ETL, your goal is to prepare the data for analysis.
AWS Glue can be employed for queries against an Amazon S3 Data Lake.
It can also make them available immediately after it has been received in the S3 data lake.
For this, you need to set up a crawler in AWS Glue to scan your data sets and thereby create a data catalogue for you automatically.
Later, various AWS services such as Amazon Redshift, EMR and Athena can use these Glue Data Catalogues to retrieve the data for processing and analyses.
Figure 4 illustrates a use case in which AWS Glue used to query a data lake in Amazon S3.
Figure 4: Amazon Glue use case for data lakes (Source: AWS)When creating a crawler, we should make sure that the right classifier has already been created and assigned to the crawler.
A classifier specifies the schema of your data.
The crawler analyses your data and creates tables accordingly in your specified database.
Remember that these tables and databases are not real relational database management systems.
In fact, they are some metadata that informs the AWS services of the structure of your data.
One way to access the created databases and tables is via AWS Athena.
AWS Athena is a query service that helps to analyse your data using standard SQL easily.
Figure 5: AWS Athena used for retrieving data from S3 data lake through the tables (data catalogues) created AWS GlueMoreover, you may need to relationise your data before you want to use it in Athena.
GTFS RealTIme data can be nested and provided in JSON format.
You can relationise that with “AWS Glue Relationalize” transform function.
It can be done by creating a “Dev Endpoint” in AWS Glue.
Using this Dev endpoint you can also create SageMaker Notebook where you can access your AWS Glue components and thereby develop your code for relationising your nested JSON data.
You will find everything you need for this in this blog.
.