I’ve recently been working with a major online retailer with a wealth of data at its disposal who wanted to capitalise on this by establishing a data lake to inform their business decisions.
A data lake is a centralised repository that allows businesses to store all their structured and unstructured data at any scale. With little or no processing performed you can scale to any data size and not worry about defining data structures, schemas and transformations. You can store data that is useful today, as well as data that may be useful in the future, without having to define the kinds of questions you might need to want to ask first. Data lakes also eliminate siloes, providing businesses a single view of all their data in one place – from customer data to social media analytics.
Users across the business can have flexible access to the data lake and its content, increasing reuse of the data to drive business decisions, run dashboards and visualisations, real time analytics and machine learning.
Data lakes are without question a hugely valuable resource however having spent some time working with them there are certain things that need to be considered to ensure that they deliver the value that you are looking for as a business:
Display value and achieve business buy-in early
Internal teams need to be bought-in to the idea of the data lake and understand the value it can bring as early as possible. Data lakes are completely reliant on the source platforms for providing source data. If the data owners don’t see the value of the data lake they will not prioritise sending their data to one. Likewise, if the consuming teams don’t understand the value of this data for their reporting and analytics, they won’t use it.
The best way to achieve buy-in early is to identify and satisfy some strong use cases from the business teams. These use-cases will prove the value of the data lake and gain support by showing decision makers how access to this sort of data can assist in their decisions and relieve pain points. After all, “Without data, you’re just another person with an opinion” – W. Edwards Demming
Invest in big data training and products
As part of your data lake build, you will likely transition to big data formats (such as parquet). However, data analysis is still traditionally SQL based so you’ll need to evaluate your existing analytics and reporting landscape and probably make changes to your current products. You’ll also need to upskill your data analysts to ensure they know how to interpret and examine these big data formats.
Playing piggy in the middle
A data lake sits between source platforms that create and provide the data, and the consuming teams who interpret and examine the data for analytics and reporting. The team managing the data lake can therefore become piggy in the middle and will need to ensure there is an effective relationship between the source teams (who know what their data includes and means) and the consuming teams (who know what they require).
As the piggy in the middle, you’ll also be subject to dependencies across the data flow. For example, if a consumer needs a particular dataset for a report, you will need to work with the source platform and agree timeframes for providing the lake with that data. Dependencies such as this will have a strong impact on your own roadmap and timelines, as will changes in priorities by either the source or consuming teams.
To validate or not to validate
It’s important that the data being held in the lake is “clean” and high quality. You don’t want to be making business decisions based on “bad” data. Ideally source teams should be responsible for validating their data before it reaches the data lake. However, this is not always pragmatic and it can be easier to centralise the validation in the data lake instead, which is the approach we decided to take on our most recent project.
Catalogue your data
You need to ensure you have a well-structured, easily searchable catalogue for the data you hold in the lake. If the consuming teams are to make the best of the data lake they need to know what data is available, where to find it and how to access it. This is essential for reducing barriers to usage by the consuming teams.
We are made by our history
Looking for trends over time is one of the most common ways a business will use its data. However, this means you will need at least a few years data under your belt; pushing new data into the lake isn’t enough. If this isn’t available, then the business must accept that there will be a latency before the data lake provides value whilst this history is built.
Whilst this is by no means an exhaustive list, these are some of the areas that need to be considered to ensure that a data lake can offer real value to a business. Even once the decision has been made to move forward the value can be jeopardised if there is too much of a focus on getting data into the lake as quickly as possible. In my experience, the amount of data made available is not the most important thing when establishing a data lake, it is making sure the business knows how to use and access the data, that the data available is well understood and that the value of the data lake has been demonstrated.