1.1 Billion Taxi Rides on ClickHouse & an Intel Core i5

$ clickhouse-client CREATE TABLE trips ( trip_id UInt32, vendor_id String, pickup_datetime DateTime, dropoff_datetime Nullable(DateTime), store_and_fwd_flag Nullable(FixedString(1)), rate_code_id Nullable(UInt8), pickup_longitude Nullable(Float64), pickup_latitude Nullable(Float64), dropoff_longitude Nullable(Float64), dropoff_latitude Nullable(Float64), passenger_count Nullable(UInt8), trip_distance Nullable(Float64), fare_amount Nullable(Float32), extra Nullable(Float32), mta_tax Nullable(Float32), tip_amount Nullable(Float32), tolls_amount Nullable(Float32), ehail_fee Nullable(Float32), improvement_surcharge Nullable(Float32), total_amount Nullable(Float32), payment_type Nullable(String), trip_type Nullable(UInt8), pickup Nullable(String), dropoff Nullable(String), cab_type Nullable(String), precipitation Nullable(Int8), snow_depth Nullable(Int8), snowfall Nullable(Int8), max_temperature Nullable(Int8), min_temperature Nullable(Int8), average_wind_speed Nullable(Int8), pickup_nyct2010_gid Nullable(Int8), pickup_ctlabel Nullable(String), pickup_borocode Nullable(Int8), pickup_boroname Nullable(String), pickup_ct2010 Nullable(String), pickup_boroct2010 Nullable(String), pickup_cdeligibil Nullable(FixedString(1)), pickup_ntacode Nullable(String), pickup_ntaname Nullable(String), pickup_puma Nullable(String), dropoff_nyct2010_gid Nullable(UInt8), dropoff_ctlabel Nullable(String), dropoff_borocode Nullable(UInt8), dropoff_boroname Nullable(String), dropoff_ct2010 Nullable(String), dropoff_boroct2010 Nullable(String), dropoff_cdeligibil Nullable(String), dropoff_ntacode Nullable(String), dropoff_ntaname Nullable(String), dropoff_puma Nullable(String) ) ENGINE = Log; The dataset itself uses commas for delimiting fields..None of the contents of the data contains any commas themselves so there is no quotations used to aid escaping data..NULL values are defined by the simple absence of any content between the comma delimiters..Normally this isnt an issue but with ClickHouse empty fields wont be treated as NULLs in order to avoid ambiguity with empty strings..For this reason I need to pipe the data through a transformation script that will replace all empty values with N..Below is the transformation script..$ cat trans.py import sys for line in sys.stdin: print ','.join([item if len(item.strip()) else 'N' for item in line.strip().split(',')]) This is the bash command I used to import the 1.1 billion records into the trips table..$ time (for filename in /home/mark/trips/trips_x*.csv.gz; do gunzip -c $filename | python trans.py | clickhouse-client –query="INSERT INTO trips FORMAT CSV" done) The above completed in 3 hours 27 minutes and 35 seconds..There is little doubt that if I didnt use a Python script for transforming the contents that this time would have been significantly reduced..The following shows the top command during the import and you can see the Python script is eating up a lot of resources……%CPU %MEM TIME+ COMMAND …. More details

Leave a Reply