Publish Data Outside Your Data Lake with a Spark Connector

Tableau’s coorporate solution Tableau Server for instance, uses either Tableau Data Extract (.tde) or — more recently — Hyper (.hyper), as storage formats for its data tables.A connector divides in three parts:Figure1 — Spark connector to cloud vendorLet’s dive into each part of the connector!1 — Convert a Spark dataframe to your target formatThe proper way to convert a dataframe to your target format is to proceed partition by partition..RDDs are distributed in partitions which are not directly accessible to cluster’s driver where Python code is running.For each partition of the RDD, collect data within that partition to the driver..You can then go through the collected partition and insert it to your target file row by row.The function convert_and_insert will be provided by data vendors..For Tableau formats, for instance, you can refer to Tableau SDK (for .tde) or Tableau API 2.0 (for .hyper), that both have C++, Java and Python APIs.This method supposes that you have full control:on the driver’s memory (which can be set for the current Spark session throughspark.driver.memory), because all partitions that will be collected to that driver one after another should fit in, andon the partitioning of the source dataframe, because no partition should exceed the driver’s memory in order to avoid OutOfMemory errors.2 — Export the source file to the cloudOnce data is converted to the proper format, it can be exported to the cloud and made available to users..For instance, I use Tableau REST API to publish Tableau files to Tableau Server..Every vendor provides developers with dedicated APIs to publish data to the cloud.Not all APIs, however, are well-documented, so my advice is to directly clone the project and dive in the code to see if the one feature you need is already implemented..If features are still missing, you will even be able to submit a pull request.3 — Make the connector easy to use for your usersI built a command-line interface in Python on the top of my connector..The user can choose a source environment and a target environment (as represented in the figure 1 above)..These options are parsed, mapped to a configuration file, and both services described above in part 1(convert) and part 2 (export) are sequentially triggered to publish formatted data to the cloud.Connectors are another piece in the puzzle of data lakes..They allow your users to visualize fresh data in real-time without technical knowledge..Want more articles like this one?.Don’t forget to click the Follow button!. More details

Leave a Reply