Data Lake: an asset or a liability?

Data Lake: an asset or a liability?Bachar WehbiBlockedUnblockFollowFollowingApr 17Written by Bachar Wehbi and Hiba Nehme — April 18, 2019Photo by Paul Gilmore on UnsplashA Data Lake, as its name suggests, is a central repository of enterprise data that stores structured and unstructured data.

The promise of a Data Lake is “to gain more visibility or put an end to data silos” and to open therefore the door to a wide variety of use cases including reporting, business intelligence, data science and analytics.

Data Lakes are considered as key assets at the centre of the digital transformation and data driven initiatives.

In the last few years, we witnessed an accelerated move by companies of all sizes to have their own Data Lakes.

Although, the first Data Lakes where built on top of Hadoop Distributed File System (HDFS), today AWS, Microsoft, Google, IBM and others offer Data Lake as a Service (DLaaS) solutions making it relatively easier to start building one.

However, despite the technological progress of Data Lake and Big Data solutions, building a Data Lake remains a challenging task.

Many Data Lake projects are reported to fail due to the inherent organisational and cultural changes required to build and operate business projects on top of Data Lakes.

Based on our experience, there are a certain number of points should be carefully considered in order to avoid transforming your Data Lake into a liability.

Build it, they will comeBuilding a Data Lake should not be an objective in itself, but should rather be a means to an end; the end being to address digital transformation and data driven initiatives in a company.

Yet many IT departments started building Data Lakes because it’s cool and trendy and because the competitors have built their own.

Then they went to business users presenting the supposed benefits of their Data Lake and inviting them to ship in some use cases.

The idea of first centralising all enterprise data in a Data Lake and then figuring out the use cases to take benefit of this data is quite dangerous to say least.

A Data Lake must be built to address a given number of identified use cases and embarking the business since day one is a key success criterion.

Our recommendation would be rather to design your Data Lake to scale and to build it gradually by including data and meta data required by identified use cases and to enrich it as needed later on.

If your company has no use case requiring data from multiple sources, then you don’t need to build a Data Lake in the first place (or you need to change the company you work for).

Don’t forget about Conway’s LawConway’s law, named after computer programmer Melvin Conway, states that: “organizations which design systems are constrained to produce designs which are copies of the communication structures of these organizations”.

Designing, building, operating, supporting and using a Data Lake are most often spread across multiple business units and departments.

This requires a close collaboration between different teams that happen to have different objectives and priorities.

Every single change or decision will thus require going into bureaucratic and time-consuming approval processes and all your agile and DevOps initiatives get a hit along the way.

To move quickly and efficiently, you need all the actors to be working towards a common goal.

You will need (a) to build small and agile teams with the required autonomy and accountability to deliver different aspects of the Data Lake and (b)to define interfaces or agreements between these teams.

Each team will then be autonomous to make changes on its scope as long as it does not break the interface or agreement made with the other teams.

No one size fits allDifferent use cases have often varying needs in terms of data consumption.

BI use cases require data aggregates as well as bulk data retrieval, AI and data science use cases are fed with the bulk of data at hand, while analytics use cases require having search engines, in-memory caches or indexed databases to optimise random access to dataThere are awesome tools out there to address each one of these needs, and I bet you cannot find a tool that fits them all.

Thus, you will find yourself with different tools to cover your needs, making therefore your architecture a bit more complex.

But that’s really fine, at least you must live with it.

The last thing you want is to trade-off user experience and satisfaction.

Our recommendation is to define acceptable performance indicators for the different use cases, to identify their priority from business point of view and associated cost and then to select the tools that can satisfy your pre-defined SLA.

The devil lies in the detailsA Data Lake is usually designed in a layered architecture with:A Raw Data Layer: that includes immutable raw data collected from different source systems.

This layer provides input data for most of the processing pipelines in the Data Lake.

A Transformed Data Layer: that includes the output of different transformation pipelines of raw data (cleansing, filtering, format harmonisation, etc.

).

A Serving Layer: that includes processing results of raw and transformed data.

This layer exposes data in SQL or NoSQL databases to different applications.

Maintaining these layers involves a number of ingestion, transformation and processing pipelines with their monitoring, error handling, rollback procedures & data quality checks.

This submerged part of the Data Lake iceberg is where most of the technical work lies.

This hidden cost should be seriously taken into account from day one when building a Data Lake.

Otherwise, all cost estimation and planning will fall short.

Key takeawaysIn order to best leverage on the benefits of a Data Lake, try to consider the following points:Involve the business since day one and make sure that the data lake you design addresses the identified business use cases.

Favour the creation of cross-functional teams guided by business capabilities to build your Data Lake.

This will allow you to avoid process overhead and to produce applications with greater capacity to evolve.

Don’t trade-off user satisfaction when choosing your data lake tools.

Building user centric solutions will require using a myriad of tools that will inevitably increase the complexity of the system, but for the best.

Don’t forget certain hidden costs linked to data ingestion, transformation & error handling, etc.

when estimating your projects.

.. More details

Leave a Reply