Advanced Analytics with Apache Spark

There is another, subtler, but no less important, benefit: we are no longer bound by the practical requirement to have estimation errors of 1% or more.

When pre-aggregation allows 1000x gains, we can easily build HLL sketches with very, very small estimation errors.

It’s rarely a problem for a pre-aggregation job to run 2-5x slower if there are 1000x gains at query time.

This is the closest to a free lunch we can get in the big data business: significant cost/performance improvements without a negative trade-off from a business standpoint for most use cases.

Introducing Spark-Alchemy: HLL Native Functions Since Spark does not provide this functionality, Swoop open-sourced a rich suite of native (high-performance) HLL functions as part of the spark-alchemy library.

Take a look at the HLL docs, which have lots of examples.

To the best of our knowledge, this is the richest set of big data HyperLogLog processing capabilities, exceeding even BigQuery’s HLL support.

The following diagram demonstrates how spark-alchemy handles initial aggregation (via hll_init_agg), reaggregation (via hll_merge) and presentation (via hll_cardinality).

If you are wondering about the storage cost of HLL sketches, the simple rule of thumb is that a 2x increase in HLL cardinality estimation precision requires a 4x increase in the size of HLL sketches.

In most applications, the reduction in the number of rows far outweighs the increase in storage due to the HLL sketches.

error sketch_size_in_bytes 0.

005 43702 0.

01 10933 0.

02 2741 0.

03 1377 0.

04 693 0.

05 353 0.

06 353 0.

07 181 0.

08 181 0.

09 181 0.

1 96 HyperLogLog Interoperability The switch from precise to approximate distinct counts and the ability to save HLL sketches as a column of data has eliminated the need to process every row of granular data at final query time, but we are still left with the implicit requirement that the system working with HLL data has to have access to all granular data.

The reason is that there is no industry-standard representation for HLL data structure serialization.

Most implementations, such as BigQuery’s, use undocumented opaque binary data, which cannot be shared across systems.

This interoperability challenge significantly increases the cost and complexity of interactive analytics systems.

A key requirement for interactive analytics systems is very fast query response times.

This is not a core design goal for big data systems such as Spark or BigQuery, which is why interactive analytics queries are typically executed by some relational or, in some cases, NoSQL database.

Without HLL sketch interoperability at the data level, we’d be back to square one.

To address this issue, when implementing the HLL capabilities in spark-alchemy, we purposefully chose an HLL implementation with a published storage specification and [built-in support for Postgres-compatible databases]((https://github.

com/citusdata/postgresql-hll) and even JavaScript.

This allows Spark to serve as a universal data pre-processing platform for systems that require fast query turnaround times, such as portals & dashboards.

The benefits of this architecture are significant: 99+% of the data is managed via Spark only, with no duplication 99+% of processing happens through Spark, during pre-aggregation Interactive queries run much, much faster and require far fewer resources Summary In summary, we have shown how the commonly-used technique of pre-aggregation can be efficiently extended to distinct counts using HyperLogLog data structures, which not only unlocks potential 1000x gains in processing speed but also gives us interoperability between Apache Spark, RDBMSs and even JavaScript.

It’s hard to believe, but we may have gotten very close to two free lunches in one big data blog post, all because of the power of HLL sketches and Spark’s powerful extensibility.

Advanced HLL processing is just one of the goodies in spark-alchemy.

Check out what’s coming and let us know which items on the list are important to you and what else you’d like to see there.

Last but most definitely not least, the data engineering and data science teams at Swoop would like to thank the engineering and support teams at Databricks for partnering with us to redefine what is possible with Apache Spark.

You rock!.Try Databricks for free.

Get started today Related Terms:Term: Unified AnalyticsTerm: Genomics.. More details

Leave a Reply