Exporting Cassandra time series data to S3 for data analysis using Spark

We’re been using this approach successfully over the last few months in order to get the best of both worlds for an early-stage platform such as 1200.aero: The cost effectiveness of on-premise hosting for a stable, live workload, and the on-demand scalability of AWS for data analysis and machine learning.Our end goal was to upload data to S3 organized as follows:/adsb/messages/year=2018/month=11/day=06To that effect, we performed the following steps, described in detail in this article:Reorganize the data using Cassandra materialized viewsUse Spark to read Cassandra data efficiently as a time seriesPartition the Spark dataset as a time seriesSave the dataset to S3 as ParquetAnalyze the data in AWSFor your reference, we used Cassandra 3.11 and Spark 2.3.1, both straight open source versions.Step 1: Reorganize the dataThe data we want to analyze consists of millions of low-altitude air traffic messages captured from ADS-B receivers at a number of regional airports..Since such attribute did not originally exist, we had to:Create and populate a new day column on the existing table.Define the right Cassandra data structure to enable efficient data retrieval using the new day column.Update our backend to populate the new field for incoming data.Defining a new date column in Cassandra is straightforward:alter table messages add day date;However, populating it proved more complex than anticipated: We had to update several hundred million records and did not find a suitable CQL UPDATE statement to do so in place (i.e. deriving day from the gentime timestamp)..DataStax found that every materialized view takes away ~10% of the insert performance on a given table.Step 2: Read Cassandra data efficiently as a time series using SparkOnce we had the means to retrieve the data by date, we had to ensure that the Spark Cassandra connector issued the right queries to retrieve the data efficiently.Spark configuration parameters for the Cassandra connector (passed when creating a SparkConf, via spark-submit, sparkmagic in Jupyter, etc):spark.cassandra.connection.host: “host1,host2”,spark.cassandra.auth.username: “username”,spark.cassandra.auth.password: “password”,spark.cassandra.input.consistency.level: ALLImport the right packages, define a dataset on the Cassandra table:import org.apache.spark.sql._import org.apache.spark.sql.cassandra._val messagesDf = spark.read.cassandraFormat(“messages_by_day”, “adsb”).load()Filter data for a specific date:import java.sql.Dateval messagesLastDayDf = messagesDf.filter(messagesDf("day") === Date.valueOf("2018-11-06"))…or for a number of dates (discrete dates must be specified in an IN condition):val messagesLastWeekDf = messagesDf.filter(messagesDf(“day”) isin (Date.valueOf("2018-11-06"), Date.valueOf("2018-11-05"), Date.valueOf("2018-11-04"), … ))To verify that your filter will be applied to the Cassandra table, print the physical plan with df.explain()..Hence, we had to first convert the day column into year, month and day columns:val partMsgsLastWeekDf = messagesLastWeekDf.withColumnRenamed(“day”, “date”).withColumn(“year”, year(col(“date”))).withColumn(“month”, month(col(“date”))).withColumn(“day”, dayofmonth(col(“date”))).drop(“date”)At long last, we had our data available in Spark with the necessary attributes and partitioned by the criteria we needed.Dataset with three new columns (year, month, day), ready to be partitioned and written as ParquetStep 4: Save your Spark dataset to S3 as ParquetSetting up Spark to push data to S3 was pretty easy.. More details

Leave a Reply