Apache Avro as a Built-in Data Source in Apache Spark 2.4

The new built-in spark-avro module is originally from Databricks’ open source project Avro Data Source for Apache Spark (referred to as spark-avro from now on)..In addition, it provides: New functions from_avro() and to_avro() to read and write Avro data within a DataFrame instead of just files..Avro logical types support, including Decimal, Timestamp, and Date types..See the related schema conversions for details..2X read throughput improvement and 10% write throughput improvement..In this blog, we examine each of the above features through examples, giving you a flavor of its easy API usage, performance improvements, and merits..Load and Save Functions In Apache Spark 2.4, to load/save data in Avro format, you can simply specify the file format as “avro” in the DataFrameReader and DataFrameWriter..For consistency and familiarity, the usage is similar to other data sources..val usersDF = spark.read.format(“avro”).load(“examples/src/main/resources/users.avro”) usersDF.select(“name”, “favorite_color”).write.format(“avro”).save(“namesAndFavColors.avro”) Power of from_avro() and to_avro() To further simplify your data transformation pipeline, we introduced two new built-in functions: from_avro() and to_avro()..Avro is commonly used to serialize/deserialize the messages/data in Apache Kafka-based data pipeline..Using Avro records as columns is useful when reading from or writing to Kafka..Each Kafka key-value record is augmented with some metadata, such as the ingestion timestamp into Kafka, the offset in Kafka, etc.. More details

Leave a Reply