1.1 Billion Taxi Rides with MapD 3.0 & 2 GPU-Powered p2.8xlarge EC2 Instances

| |===============================+======================+======================| | 0 Tesla K80 On | 0000:00:17.0 Off | 0 | | N/A 62C P0 70W / 149W | 4163MiB / 11439MiB | 0% Default | +——————————-+———————-+———————-+ | 1 Tesla K80 On | 0000:00:18.0 Off | 0 | | N/A 54C P0 82W / 149W | 4163MiB / 11439MiB | 0% Default | +——————————-+———————-+———————-+ | 2 Tesla K80 On | 0000:00:19.0 Off | 0 | | N/A 62C P0 69W / 149W | 2115MiB / 11439MiB | 0% Default | +——————————-+———————-+———————-+ | 3 Tesla K80 On | 0000:00:1A.0 Off | 0 | | N/A 51C P0 76W / 149W | 2115MiB / 11439MiB | 0% Default | +——————————-+———————-+———————-+ | 4 Tesla K80 On | 0000:00:1B.0 Off | 0 | | N/A 63C P0 65W / 149W | 2115MiB / 11439MiB | 0% Default | +——————————-+———————-+———————-+ | 5 Tesla K80 On | 0000:00:1C.0 Off | 0 | | N/A 54C P0 83W / 149W | 2115MiB / 11439MiB | 0% Default | +——————————-+———————-+———————-+ | 6 Tesla K80 On | 0000:00:1D.0 Off | 0 | | N/A 63C P0 71W / 149W | 2115MiB / 11439MiB | 0% Default | +——————————-+———————-+———————-+ | 7 Tesla K80 On | 0000:00:1E.0 Off | 0 | | N/A 55C P0 88W / 149W | 2115MiB / 11439MiB | 0% Default | +——————————-+———————-+———————-+ The first instance launched in this EC2 cluster will be both a MapD leaf and aggregator node..The second instance will be both a MapD leaf and string dictionary node..Both machines belong to a security group that allows them to communicate with one another on TCP port 19091 for the aggregator server, TCP port 9091 for leaf communication and TCP port 10301 for the string dictionary server..Ill be using the AMI image ami-4836a428 / amzn-ami-hvm-2017.03.0.20170417-x86_64-gp2 for both machines..Each instance has its own 1.1 TB EBS volume..$ df -H Filesystem Size Used Avail Use% Mounted on devtmpfs 258G 62k 258G 1% /dev tmpfs 258G 0 258G 0% /dev/shm /dev/xvda1 1.1T 705G 352G 67% / Downloading 1.1 Billion Taxi Journeys On each EC2 instance Ill set the AWS CLI tool to use 100 concurrent requests so I can better saturate the network connection when downloading the taxi trips dataset off of S3..$ aws configure set default.s3.max_concurrent_requests 100 Ill then download the 104 GB of CSV data I created in my Billion Taxi Rides in Redshift blog post onto each instance..This data sits across 56 GZIP files and decompresses into around 500 GB of raw CSV data..$ mkdir ~/csvData $ cd ~/csvData/ $ aws s3 sync s3://<s3_bucket>/csv/ ./ $ gunzip trips_x*.csv.gz MapD 3.0 Up & Running Everything below, unless otherwise noted, was run on both EC2 instances..Im going to install Nvidias 375.51 driver along with two of its requirements, GCC and the Kernel development package which provides the headers needed to compile Kernel modules..$ sudo yum install -y gcc kernel-devel-`uname -r` $ curl -O http://us.download.nvidia.com/XFree86/Linux-x86_64/375.51/NVIDIA-Linux-x86_64-375.51.run $ sudo /bin/bash ./NVIDIA-Linux-x86_64-375.51.run As per Amazons recommendations, Im going to switch the Nvidia driver into persistent mode, turn off auto-boost and set the GPUs compute clock speed to 2,505 MHz and the GPUs memory clock speed to 875 MHz..$ sudo nvidia-smi -pm 1 $ sudo nvidia-smi –auto-boost-default=0 $ sudo nvidia-smi -ac 2505,875 MapD is commercial software so I cannot disclose the full URL Ive downloaded the self-extracting archive from.. More details

Leave a Reply