All 1.1 Billion Taxi Rides in Elasticsearch

$ vi ~/trips.conf input { file { path => "/one_tb_drive/taxi-trips/*.csv" type => "trip" start_position => "beginning" } } filter { csv { columns => ["trip_id", "vendor_id", "pickup_datetime", "dropoff_datetime", "store_and_fwd_flag", "rate_code_id", "pickup_longitude", "pickup_latitude", "dropoff_longitude", "dropoff_latitude", "passenger_count", "trip_distance", "fare_amount", "extra", "mta_tax", "tip_amount", "tolls_amount", "ehail_fee", "improvement_surcharge", "total_amount", "payment_type", "trip_type", "pickup", "dropoff", "cab_type", "precipitation", "snow_depth", "snowfall", "max_temperature", "min_temperature", "average_wind_speed", "pickup_nyct2010_gid", "pickup_ctlabel", "pickup_borocode", "pickup_boroname", "pickup_ct2010", "pickup_boroct2010", "pickup_cdeligibil", "pickup_ntacode", "pickup_ntaname", "pickup_puma", "dropoff_nyct2010_gid", "dropoff_ctlabel", "dropoff_borocode", "dropoff_boroname", "dropoff_ct2010", "dropoff_boroct2010", "dropoff_cdeligibil", "dropoff_ntacode", "dropoff_ntaname", "dropoff_puma"] separator => "," } date { match => ["pickup_datetime", "YYYY-MM-dd HH:mm:ss"] timezone => "America/New_York" target => "pickup_datetime" } date { match => ["dropoff_datetime", "YYYY-MM-dd HH:mm:ss"] timezone => "America/New_York" target => "dropoff_datetime" } mutate { convert => { "trip_id" => "integer" "vendor_id" => "string" "store_and_fwd_flag" => "string" "rate_code_id" => "integer" "pickup_longitude" => "float" "pickup_latitude" => "float" "dropoff_longitude" => "float" "dropoff_latitude" => "float" "passenger_count" => "integer" "trip_distance" => "float" "fare_amount" => "float" "extra" => "float" "mta_tax" => "float" "tip_amount" => "float" "tolls_amount" => "float" "ehail_fee" => "float" "improvement_surcharge" => "float" "total_amount" => "float" "payment_type" => "string" "trip_type" => "integer" "pickup" => "string" "dropoff" => "string" "cab_type" => "string" "precipitation" => "integer" "snow_depth" => "integer" "snowfall" => "integer" "max_temperature" => "integer" "min_temperature" => "integer" "average_wind_speed" => "integer" "pickup_nyct2010_gid" => "integer" "pickup_ctlabel" => "string" "pickup_borocode" => "integer" "pickup_boroname" => "string" "pickup_ct2010" => "string" "pickup_boroct2010" => "string" "pickup_cdeligibil" => "string" "pickup_ntacode" => "string" "pickup_ntaname" => "string" "pickup_puma" => "string" "dropoff_nyct2010_gid" => "integer" "dropoff_ctlabel" => "string" "dropoff_borocode" => "integer" "dropoff_boroname" => "string" "dropoff_ct2010" => "string" "dropoff_boroct2010" => "string" "dropoff_cdeligibil" => "string" "dropoff_ntacode" => "string" "dropoff_ntaname" => "string" "dropoff_puma" => "string" } } } output { elasticsearch { action => "index" hosts => "localhost:9200" index => "trips" flush_size => 20000 } } The following import took 4 days and 16 hours to complete..This is 1.6x longer than the 70 hours it took to import the 5-field dataset..$ screen $ /opt/logstash/bin/logstash -f ~/trips.conf It Fits On The SSD!.The point of this exercise was to fit the ~500 GB of CSV data into Elasticsearch on a single SSD..In the previous exercise I called the _optimize endpoint a lot to try and save space during importing..I was advised against doing this and I never called it during this import..When I started the import 17 GB of drive space on the SSD was being used (this included Ubuntu, Elasticsearch, etc…)..By the time I had imported 200 million records only 134 GB of space was being used and at 400 million records 263 GB of space was being used..Basically the document count was out-running the disk space usage count..When all 1.1 billion records were imported 705 GB of drive space on the SSD was being used..Benchmarking Queries in Elasticsearch The following completed in 34.48 seconds (4.2x slower than in the previous benchmark)..SELECT cab_type, count(*) FROM trips GROUP BY cab_type The following completed in 63.3 seconds (3.5x slower than in the previous benchmark).. More details

Leave a Reply