How to Work with Avro, Kafka, and Schema Registry in Databricks

Schema Registry is the most popular solution for Kafka-based data pipelines.

Like an Apache Hive metastore, it records the schema of all the registered data streams, as well as the schema change history.

It also defines multiple compatibility levels.

For example, you can enforce that only backward-compatible schema changes are allowed.

To support reading data stream in a future-proof way, you need to embed the schema info in each record.

Thus, the schema identifier, rather than a full schema, is part of each record.

Schema Registry provides the custom Avro encoder/decoder.

You can encode and decode the Avro records using the schema identifiers.

Databricks has integrated Schema Registry into the from_avro and to_avro functions.

You can easily migrate your streaming pipelines, which are built on Schema Registry, to Spark Structured Streaming.

Furthermore, the from_avro and to_avro functions can be used in batch queries as well, because Structured Streaming unifies batch and streaming processing in the Spark SQL engine.

Sample Code for Using Schema Registry You can import the notebook with the examples and play it with yourself, or preview it online.

Assume you have already deployed Kafka and Schema Registry in your cluster, and there is a Kafka topic “t”, whose key and value are registered in Schema Registry as subjects “t-key” and “t-value” of type string and int respectively.

The following code reads the topic “t” into a Spark DataFrame with schema <key: string, value: int> val df = spark .

readStream .

format(“kafka”) .

option(“kafka.

bootstrap.

servers”, kafkaURL) .

option(“subscribe”, “t”) .

load() .

select( from_avro($”key”, “t-key”, schemaRegistryURL).

as(“key”), from_avro($”value”, “t-value”, schemaRegistryURL).

as(“value”)) The following code writes the Spark DataFrame with schema <key: string, value: int> into the Kafka topic "t".

dataDF .

select( to_avro($”key”, lit(“t-key”), schemaRegistryURL).

as(“key”), to_avro($”value”, lit(“t-value”), schemaRegistryURL).

as(“value”)) .

writeStream .

format(“kafka”) .

option(“kafka.

bootstrap.

servers”, servers) .

option(“topic”, “t”) .

save() Read More Read more about Schema Registry for Azure Databricks and AWS.

Download the notebook or read it here.

Try Databricks for free.

Get started today Related Terms:Term: Unified AnalyticsTerm: Databricks RuntimeTerm: Structured StreamingTerm: Spark SQL.

. More details

Leave a Reply