A Billion Taxi Rides in Elasticsearch

$ sudo /etc/init.d/elasticsearch restart Importing a Billion Trips into Elasticsearch The machine used in this blog post has two physical disk drives..The first is an SSD drive used for the operating system, applications and to store the data indexed by Elasticsearch..The second drive is a mechanical drive that holds the denormalised CSV data created in the Billion Taxi Rides in Redshift blog post..This second drive is mounted at /one_tb_drive on my machine..Ill be using Logstash to import the data into Elasticsearch..The CSV data is 104 GB when gzip compressed..Unfortunitely, at the time of this writing, Logstash doesnt support reading CSV data from gzip files..To get around this Ill need to decompress all the gzip files and store them in their raw form..This will raise the disk space requirements from 104 GB for the gzip data to ~500 GB for the raw CSV data..$ cd /one_tb_drive/taxi-trips/ $ gunzip *.gz The following will create the configuration file for Logstash..$ vi ~/trips.conf input { file { path => "/one_tb_drive/taxi-trips/*.csv" type => "trip" start_position => "beginning" } } filter { csv { columns => ["trip_id", "vendor_id", "pickup_datetime", "dropoff_datetime", "store_and_fwd_flag", "rate_code_id", "pickup_longitude", "pickup_latitude", "dropoff_longitude", "dropoff_latitude", "passenger_count", "trip_distance", "fare_amount", "extra", "mta_tax", "tip_amount", "tolls_amount", "ehail_fee", "improvement_surcharge", "total_amount", "payment_type", "trip_type", "pickup", "dropoff", "cab_type", "precipitation", "snow_depth", "snowfall", "max_temperature", "min_temperature", "average_wind_speed", "pickup_nyct2010_gid", "pickup_ctlabel", "pickup_borocode", "pickup_boroname", "pickup_ct2010", "pickup_boroct2010", "pickup_cdeligibil", "pickup_ntacode", "pickup_ntaname", "pickup_puma", "dropoff_nyct2010_gid", "dropoff_ctlabel", "dropoff_borocode", "dropoff_boroname", "dropoff_ct2010", "dropoff_boroct2010", "dropoff_cdeligibil", "dropoff_ntacode", "dropoff_ntaname", "dropoff_puma"] separator => "," } mutate { remove_field => ["average_wind_speed", "dropoff", "dropoff_borocode", "dropoff_boroct2010", "dropoff_boroname", "dropoff_cdeligibil", "dropoff_ct2010", "dropoff_ctlabel", "dropoff_datetime", "dropoff_latitude", "dropoff_longitude", "dropoff_ntacode", "dropoff_ntaname", "dropoff_nyct2010_gid", "dropoff_puma", "ehail_fee", "extra", "fare_amount", "host", "improvement_surcharge", "max_temperature", "message", "min_temperature", "mta_tax", "path", "payment_type", "pickup", "pickup_borocode", "pickup_boroct2010", "pickup_boroname", "pickup_cdeligibil", "pickup_ct2010", "pickup_ctlabel", "pickup_latitude", "pickup_longitude", "pickup_ntacode", "pickup_ntaname", "pickup_nyct2010_gid", "pickup_puma", "precipitation", "rate_code_id", "snow_depth", "snowfall", "store_and_fwd_flag", "tip_amount", "tolls_amount", "trip_id", "trip_type", "type", "vendor_id"] } date { match => ["pickup_datetime", "YYYY-MM-dd HH:mm:ss"] timezone => "America/New_York" target => "pickup_datetime" } mutate { convert => { "cab_type" => "string" "passenger_count" => "integer" "total_amount" => "float" "trip_distance" => "float" } } } output { elasticsearch { action => "index" hosts => "localhost:9200" index => "trips" } } The following will launch the Logstash process in a screen..It will read the configuration file passed to it and begin the import process.. More details

Leave a Reply