The Path to Data Centricity 3: Why a Data Lakehouse
You store all your raw data in your data lake, then you copy them in your enterprise data warehouse for your BI purposes, while other ETLs copy data for your data prep and ML pipelines or operational applications. However, raw data is hard to use and to understand. Your business users are asking for clean data because they want to choose between retrieving dimensional data or KPIs.
As the word suggests, data lakehouses refer to the merging of two worlds: data warehouses and governed data lakes. This approach proposes making the data warehouse the first-class citizen and implementing warehouse-like data structures and data management functions on cost-effective storage used for data lakes.
Its particularity lies in its robustness. The data lakehouse architecture not only solves the problems of the classic data lake by ensuring that the data is secure and governed. It also brings quality control to the storage phase to facilitate data search through indexing. Specifically, it brings different levels of quality to the data.
- “Bronze" data is raw and historical data.
- “Silver” data is validated, filtered and optimised for OLAP. Data is easily queriable and accessible (closer to a DWH)
- “Gold” data is aggregated for the business and IA results, aggregated at the company level
Overall, a data lakehouse holds four main advantages:
- Reliability: it enables ACID transactions, and guarantees to distribute updates of data to all consumers in real time.
- Performance: it indexes at storage and contains a distributed streaming engine.
- Governance: it enables level access control; not only resource-level access control.
- Quality: it enables the validation and expectation of the schema and of the curated data lake.
However, it should be noted that:
- Most companies already have data warehouses. They will have to migrate the data of the warehouse into the lake house; the warehouse will be replaced by the gold layer inside the lake house. It is important for companies to assess if the benefits justify the time and cost of migration.
- The monolith structure of the lakehouse can make it complex and fragile. At Euranova, we first want to decouple producers and consumers of data, and repatriate data warehouses as consumers of data, instead of merging them with data lakes.
A data lakehouse is recommended when there is a need to use data for both analytics and machine learning and the company does not have an existing data management solution. This is a great way to support a lot of data use cases with a single data platform.
While we are not a partner of Databricks, nor a lakehouse vendor, we have large expertise in Databricks software stack. Coupled with our expertise in data management, big data and data warehousing, we can design and implement a data lakehouse, or help you set yours.
Coming up: The Path to Data Centricity 4: Why a Data Hub
This article is the first in a series of blog posts in which we highlight the differences between storing solutions to support your data-driven strategy. In the next part, we will discuss data hubs.