While at first data lakes and data warehouses seem to be competitors, in reality they solve very different problems
For a long time businesses have seen data warehouses as the go-to place for business intelligence (BI) reporting, with the primary driver being consistency and accuracy of data across all business functions. Limited by the technologies of the day, data flows through a data warehouse using a series of specialised analytics, which, predictably have an impact on performance, agility and costs.
The concept of a data lake surfaced in about 2010, underpinned by the growth of the hadoop market which, according to a recent study from Marketwatch, is expected to exceed more than $50 billion by 2022
In simple terms, why would I need one?
If the concept of data lakes had an FAQs section, “what’s the difference between a data lake and a data warehouse” would be one of the first on the list. Data warehouses represent a top down approach whereas data lake represent a bottom up approach. By ‘top down’, it means that in order to build and manage a data warehouse, stakeholders need to know all of the questions they want to be asking of their data upfront. The data is then optimised to suit specific reporting requirements, which means a lot of up-front design work with any changes resulting in significant engineering and design reworking.
With a data lake, it gives you unrivalled flexibility over the data you ingest into the lake. It allows you to ask different questions of your data now and in the future, allowing you to add new streams of data into the reporting mix without affecting your data architecture. In the aim of supporting business agility and decision making, it removes the obstacles for the expansion of future reporting which is especially relevant for businesses investing in IoT who lack the technology to capture the volume and velocity of data.
A data lake can be characterised by the following attributes:
• Collect everything – Both raw and processed data, never throw anything away.
• Democratise your data – Allow all business users access to the lake, to explore, enrich and draw insight from data.
• Make it fast – Optimise data to be quick to query and curate.
Can I have both?
One of the original reasons for Data Warehouses was to establish a single source of truth, which could be legally binding and accurate for reconciliation purposes. A data lake in its purest form represents a raw view of data, stored forever using a creative, experimental and free-flowing process.
In order to get to that legally binding and accurate view of data, the data would still need to be curated and for many, this would still involve using a data warehouse, drawing from the lake. It doesn’t have to be an either-or decision when considering whether a data lake is right for your business.
Want to know more about data lakes?
Over the next few months we’ll be producing a series of blogs, giving you all the information you need to know to help determine the right approach for your business. Stay tuned!