Working with Data Feeds

$ java -jar ~/orc-tools-1.5.2-uber.jar meta metadata.orc Processing data file metadata.orc [length: 341494] Structure for metadata.orc File Version: 0.12 with ORC_135 Rows: 19833 Compression: ZLIB Compression size: 262144 Type: struct<a:int,b:int,c:string,d:timestamp,e:string> Stripe Statistics: Stripe 1: Column 0: count: 19833 hasNull: false Column 1: count: 19833 hasNull: false bytesOnDisk: 5511 min: 10 max: 30302 sum: 305772174 Column 2: count: 17310 hasNull: true bytesOnDisk: 44996 min: 0 max: 34699348 sum: 204280488782 Column 3: count: 17310 hasNull: true bytesOnDisk: 64840 min: "Serious" Sam Stone max: 칼빈500 sum: 173207 Column 4: count: 19833 hasNull: false bytesOnDisk: 70303 min: 2002-02-25 15:43:11.0 max: 2018-09-20 11:47:36.0 Column 5: count: 19833 hasNull: false bytesOnDisk: 154814 min: "Hello, World!" program max: ♯P sum: 312852 File Statistics: Column 0: count: 19833 hasNull: false Column 1: count: 19833 hasNull: false bytesOnDisk: 5511 min: 10 max: 30302 sum: 305772174 Column 2: count: 17310 hasNull: true bytesOnDisk: 44996 min: 0 max: 34699348 sum: 204280488782 Column 3: count: 17310 hasNull: true bytesOnDisk: 64840 min: "Serious" Sam Stone max: 칼빈500 sum: 173207 Column 4: count: 19833 hasNull: false bytesOnDisk: 70303 min: 2002-02-25 15:43:11.0 max: 2018-09-20 11:47:36.0 Column 5: count: 19833 hasNull: false bytesOnDisk: 154814 min: "Hello, World!" program max: ♯P sum: 312852 Stripes: Stripe: offset: 3 data: 340464 rows: 19833 tail: 144 index: 412 Stream: column 0 section ROW_INDEX start: 3 length 21 Stream: column 1 section ROW_INDEX start: 24 length 62 Stream: column 2 section ROW_INDEX start: 86 length 62 Stream: column 3 section ROW_INDEX start: 148 length 115 Stream: column 4 section ROW_INDEX start: 263 length 62 Stream: column 5 section ROW_INDEX start: 325 length 90 Stream: column 1 section DATA start: 415 length 5511 Stream: column 2 section PRESENT start: 5926 length 1686 Stream: column 2 section DATA start: 7612 length 43310 Stream: column 3 section PRESENT start: 50922 length 1686 Stream: column 3 section DATA start: 52608 length 27735 Stream: column 3 section LENGTH start: 80343 length 3229 Stream: column 3 section DICTIONARY_DATA start: 83572 length 32190 Stream: column 4 section DATA start: 115762 length 70288 Stream: column 4 section SECONDARY start: 186050 length 15 Stream: column 5 section DATA start: 186065 length 140495 Stream: column 5 section LENGTH start: 326560 length 14319 Encoding column 0: DIRECT Encoding column 1: DIRECT_V2 Encoding column 2: DIRECT_V2 Encoding column 3: DICTIONARY_V2[5197] Encoding column 4: DIRECT_V2 Encoding column 5: DIRECT_V2 File length: 341494 bytes Padding length: 0 bytes Padding ratio: 0% Ill dump out the first ten rows of data from the ORC file and as you can see, the data looks to be in place..$ java -jar ~/orc-tools-1.5.2-uber.jar data metadata.orc | head {"a":10,"b":23257138,"c":"Godsy","d":"2018-08-14 06:47:24.0","e":"AccessibleComputing"} {"a":12,"b":null,"c":null,"d":"2018-09-19 12:07:26.0","e":"Anarchism"} {"a":13,"b":9784415,"c":"Tom.Reding","d":"2017-06-05 04:18:18.0","e":"AfghanistanHistory"} {"a":14,"b":9784415,"c":"Tom.Reding","d":"2017-06-05 04:18:23.0","e":"AfghanistanGeography"} {"a":15,"b":9784415,"c":"Tom.Reding","d":"2017-06-05 04:19:42.0","e":"AfghanistanPeople"} {"a":18,"b":9784415,"c":"Tom.Reding","d":"2017-06-05 04:19:45.0","e":"AfghanistanCommunications"} {"a":19,"b":9784415,"c":"Tom.Reding","d":"2017-06-04 21:42:11.0","e":"AfghanistanTransportations"} {"a":20,"b":9784415,"c":"Tom.Reding","d":"2017-06-04 21:43:11.0","e":"AfghanistanMilitary"} {"a":21,"b":9784415,"c":"Tom.Reding","d":"2017-06-04 21:43:14.0","e":"AfghanistanTransnationalIssues"} {"a":23,"b":9784415,"c":"Tom.Reding","d":"2017-06-05 04:19:50.0","e":"AssistiveTechnology"} For comparison, this is the data for the first ten rows in the source CSV file..$ csvlook –no-header-row –no-inference –max-rows 10 metadata.noheader.csv | a | b | c | d | e | | — | ——– | ———- | ——————– | —————————— | | 10 | 23257138 | Godsy | 2018-08-14T06:47:24Z | AccessibleComputing | | 12 | | | 2018-09-19T12:07:26Z | Anarchism | | 13 | 9784415 | Tom.Reding | 2017-06-05T04:18:18Z | AfghanistanHistory | | 14 | 9784415 | Tom.Reding | 2017-06-05T04:18:23Z | AfghanistanGeography | | 15 | 9784415 | Tom.Reding | 2017-06-05T04:19:42Z | AfghanistanPeople | | 18 | 9784415 | Tom.Reding | 2017-06-05T04:19:45Z | AfghanistanCommunications | | 19 | 9784415 | Tom.Reding | 2017-06-04T21:42:11Z | AfghanistanTransportations | | 20 | 9784415 | Tom.Reding | 2017-06-04T21:43:11Z | AfghanistanMilitary | | 21 | 9784415 | Tom.Reding | 2017-06-04T21:43:14Z | AfghanistanTransnationalIssues | | 23 | 9784415 | Tom.Reding | 2017-06-05T04:19:50Z | AssistiveTechnology | Importing CSVs into ClickHouse Once Ive gotten a large sample of a data feed together, its helpful to build fluency in the dataset using an analytics-focused database..For this exercise Ill import the CSV data into ClickHouse..ClickHouse has done well in my 1.1 Billion Taxi Rides Benchmarks and even though it doesnt execute queries on GPUs, it is very performant, even on older Intel CPUs..The installation process is done via a simple apt install command, the default configuration works well and theres no license fee required for commercial purposes..Ill make sure the ClickHouse server is running and then connect to it with the client..$ sudo service clickhouse-server start $ clickhouse-client Ill first create a table called pages that will store the data in more or less the same form as it appears in the CSV file..Since ClickHouse wont interpret the Z in the timestamp strings properly Ill store the timestamp initially as a string and transform it later..The Z in the timestamps refers to Zulu time / UTC..CREATE TABLE pages ( page_id UInt32, revision_contributor_id UInt32, revision_contributor_username Nullable(String), revision_timestamp String, title Nullable(String) ) ENGINE=Log; The following will feed the decompressed CSV data into the table while skipping the header row..$ pigz -d -c metadata.csv.gz | tail -n +2 | clickhouse-client –query="INSERT INTO pages FORMAT CSV" Now I can create a new pages_mt table which will use the faster MergeTree engine.. More details

Leave a Reply