1.2 Billion Taxi Rides on AWS RDS running PostgreSQL

$ time ( ./initialize_database.sh; ./import_trip_data.sh; ./import_uber_trip_data.sh; cat analysis/prepare_analysis.sql tlc_statistics/create_statistics_tables.sql | psql trips; cd tlc_statistics; ruby import_statistics_data.rb ) The following were the durations I observed: The db.t2.large instance took 60 hours, 38 minutes and 42 seconds costing $13.98 excluding EC2 costs..The db.r3.large instance took 71 hours, 29 minutes and 27 seconds costing $25.56 excluding EC2 costs..The db.m4.large instance took 62 hours, 13 minutes and 47 seconds costing $17.05 excluding EC2 costs..The db.m4.xlarge instance took 51 hours, 43 minutes and 56 seconds costing $24.98 excluding EC2 costs..R Up & Running The following was run to install R and various other dependencies for the reports Todd Schneider wrote..$ echo "deb http://cran.rstudio.com/bin/linux/ubuntu trusty/" | sudo tee -a /etc/apt/sources.list $ gpg –keyserver keyserver.ubuntu.com –recv-key E084DAB9 $ gpg -a –export E084DAB9 | sudo apt-key add – $ sudo apt-get update $ sudo apt-get install git libgdal-dev libpq-dev libproj-dev r-base r-base-dev $ mkdir -p $HOME/.R_libs $ export R_LIBS="$HOME/.R_libs" $ echo 'requirements = c("ggplot2", "ggmap", "dplyr", "reshape2", "zoo", "scales", "extrafont", "grid", "RPostgreSQL", "rgdal", "maptools", "gpclib") sapply(requirements, function(x) { if (!x %in% installed.packages()[,"Package"]) install.packages(x, repos="http://cran.r-project.org")})' | R –no-save Before running the analysis.R file I needed to patch the database connector to use the environment variables I set earlier rather than the hard-coded localhost setup..$ cd ~/nyc-taxi-data/analysis/ $ vi helpers.R The following line in helpers.R: con = dbConnect(dbDriver("PostgreSQL"), dbname = "nyc-taxi-data", host = "localhost") Was replaced with the following: con = dbConnect(dbDriver("PostgreSQL"), dbname = "trips", host = Sys.getenv('PGHOST'), user = Sys.getenv('PGUSER'), password = Sys.getenv('PGPASSWORD')) Benchmarking RDS I ran the analysis.R script three times from the respective EC2 pairs..Each of the results has been rounded to the nearest second..$ time (cat analysis.R | R –no-save) The db.t2.large instance reported the following times: Run 1: 5 minutes 21 seconds Run 2: 5 minutes 25 seconds Run 3: 5 minutes 39 seconds The db.r3.large instance reported the following times: Run 1: 6 minutes Run 2: 5 minutes 30 seconds Run 3: 5 minutes 34 seconds The db.m4.large instance reported the following times: Run 1: 5 minutes 43 seconds Run 2: 5 minutes 34 seconds Run 3: 5 minutes 33 seconds The db.m4.xlarge instance reported the following times: Run 1: 5 minutes 34 seconds Run 2: 5 minutes 26 seconds Run 3: 5 minutes 41 seconds For this workload the reporting speeds dont line up well with the price differences between the RDS instances..I suspect this workload is biased towards Rs CPU consumption when generating PNGs rather than RDS performance when returning aggregate results..The RDS instances share the same number of IOPS each which might erase any other performance advantage they could have over one another..As for the money spent importing the data into RDS I suspect scaling up is more helpful when you have a number of concurrent users rather than a single, large job to execute.. More details

Leave a Reply