The Path to Data Centricity 2: Why a Data Lake
Thanks to your data warehouse, you have customised reports and charts on the number of clients, the churn rate, the items that sell most, etc. Let’s say you would like to use the data from your Twitter account to better understand your customers. Unstructured data, such as the one coming from your social networks, are not supported by the data warehouse. Would you like to know how many and what clients will churn next month? For what reasons? What other products you could recommend to your customers?
In the digital world, businesses are able to do much more than reporting with their data. For this, they need to get out of the predefined schemas of the data warehouse and use raw data. They also need to use new data, i.e. data from the internet, images, sensors, live events, etc.
That is when data lakes came to the rescue. Data lakes store a large amount of diverse raw data in a single repository, avoiding an explosion in the number of ETLs. Data is not formatted according to a specific schema anymore but copied as such in the data lake. The type of data that is stored in the lake does not matter; it can be unstructured, structured, or semi-structured. The fundamental idea for a data lake is that you want to make any/all data from applications available, so your data team can leverage it to provide some insights on how to solve business problems or how to create value.
Data lakes answered companies challenges by :
- Collecting all data (structured, semi structured, unstructured) in a central repository
- Allowing the definition of data subsets for specific use cases
- Allowing the application of governance on data access (this requires to add a data management layer, as it is not part of it by default)
- Extracting data sets quickly, by removing the T from ETLs. As data lakes are about raw data, data are extracted from operational systems and loaded onto the lake without any transformation.
- Allowing the application of data quality assessment on raw data.
However, the challenge begins when you want to make sense of your data:
- If you dump data into a data lake, how do you know which data you need and which data you don’t need? How do you know where the data resides in the lake? How do you know if data is privacy-sensitive and how to handle an appropriate processing? The data lake can turn into a data swamp very quickly if not managed correctly.
- Data can be of poor quality and formatted for operational needs, thus not suited for analysis.
- There is a risk that data processing and machine learning is performed on the data lake itself making it a bottleneck: having to support diametrically opposed requirements like historization, operational load, reporting load, machine learning training, etc.
A data lake is a good start! For companies who want to colocate a lot of heterogeneous data in a single location to discover what to do with them, it is a must-have. However, it is important to think about how this data will be governed, accessed, and cleaned and to impose some ground rules on sharing the space and computing power.
Euranova has a long history of working with Hadoop and big data technology. We provide you with a data-management layer that will help you with the following tasks: data exploration, checking up compliance to rules and regulations, tracking of personal data, among others. We have expert data engineers who have built countless data lakes, data architects that have designed various data lakes and have an opinionated stance on data governance, and data scientists that are accustomed to exploring various datasets in lakes.
Coming up: The Path to Data Centricity 3: Why a Data Lakehouse
This article is the first in a series of blog posts in which we highlight the differences between storing solutions to support your data-driven strategy. In the next part, we will discuss data lakehouses.