Is the “Data Lake” the Best Architecture to Support Big Data?
In 2011, Forbes magazine posted an article titled Big Data Requires a Big, New Architecture[1] which defined the concept of a “data lake” thusly:
The difference between a data lake and a data warehouse is that in a data warehouse, the data is pre-categorized at the point of entry, which can dictate how it’s going to be analyzed.In the Forbes article, the justification for a data lake strategy was centered on the question of how the data would ultimately be used. The idea being that the data lake would provide more analytical flexibility in the long term.
The problem is that, in the world of big data, we don’t really know what value the data has when it’s initially accepted from the array of sources available to us. We might know some questions we want to answer, but not to the extent that it makes sense to close off the ability to answer questions that materialize later. Therefore, storing data in some “optimal” form for later analysis doesn’t make any sense. Instead, what the Dixon suggests is storing the data in a massive, easily accessible repository based on the cheap storage that’s available today. Then, when there are questions that need answers, that is the time to organize and sift through the chunks of data that will provide those answers.While the cost of storage may be falling (over the last 30 years, space per unit cost has doubled roughly every 14 months[2]), is this reason enough to continue dumping all available data into a centralized repository? Isn’t the so-called “data lake” just the latest iteration of the same scale-up, data-hoarding, “technology will save us” strategies we have seen before?
Even if it is both technically and financially possible, operational implementation of “a massive, easily accessible repository based on the cheap storage” may raise more questions than it resolves.
What about Governance ?
In a data lake architecture, how will accountability for data quality, and decision rights around data definition and change be distributed? How do organizations ensure that there is a consistent operational model for documenting, prioritizing, resolving, and communicating data issues? Is there a mechanism in the data lake architecture to provide line of sight to the lineage of the data or to understanding the System of Record for the data? How does the data lake architecture ensure that there are consistent data definitions, that business rules for creation and usage are documented, and that metadata is published and available to all?
What about Information Lifecycle Management (ILM)?
Assuming storage costs continue to plummet, do we simply continue to dump more and more data into the data lake? Over time, data has less value and more risk associated with it. It does not make sense to continue to fill the lake without some plan to drain off the data which has become more noise than signal a la Nate Silver.
What about Security?
If the data lake consists of both transaction and reference data, if it consists of customer and product and sales data, if it consists of both internally sourced as well as 3rd party data, how do we ensure that each data element has the appropriate level of security?
What about structured vs. unstructured data?
Do a million Tweets have the same value as a thousand customer records? Unstructured data fundamentally requires more metadata (When? Why? Who?) to understand its context and value. Are there linkages between the data lake and the metadata repository?
What about internal data vs. 3rd party data?
Data and insights that are internally generated about our customers may be our greatest treasures. These are the differentiators with which businesses drive market share. Should this data be included in the data lake?
An Alternative to the Data Lake
At the end of the day, the data consumer does not care where the data is housed. Their real concerns center on the availability of the data they need, and the fit-for-use quality of that data.
These needs can be met in a real way with a robust metadata repository.
For example, Google, vast as it may be, does not store the entire Internet on their servers in Mountain View. What they do store is metadata: keywords, page rank, page title, inbound connections, URL / domain, and other Search Engine Optimization (SEO) metrics which Google uses to serve up the available data which best fits the users stated need. In fact, Google’s ability to collect, analyze and exploit metadata is the special sauce that makes Google the giant they are.
This example is repeated at Amazon. Although Amazon provides practically everything in the known universe for sale, their warehouses do not store all those goods. Amazon brings product descriptions, images, reviews, and prices together on their pages, but they use a distributed supply chain (including partner vendors) to deliver those products.
A big advantage for Amazon, however, is that it manages and ships not only its own inventory, but also that of other retailers such as Eddie Bauer and Target, giving it an economy of scale that dwarfs its rivals. As it stands, Amazon can currently ship some 10 million products, compared with Walmart’s 500,000, according to Internet Retailer.[3]These powerful lessons can be directly applied to how other enterprises manage their data.
Where the data lake architecture is in play, and as that data lake becomes wider and deeper, the addition of a metadata repository will augment a collection of arbitrary data objects in a way that is more intelligible at people-scale.
Where a data lake is not present, focus on the creation of a comprehensive metadata repository still has advantages to the enterprise in terms of increased standardization, data integrity, change management, and owner accountability.
Metadata has the power to make all data, regardless of source, available and understandable to all data consumers, regardless of role, and regardless of the scale of the data lake.
Conclusion
A well-constructed metadata repository will allow the enterprise to leap-frog over the data lake and empower the delivery of Data as a Service, Analytics as a Service, advanced analytics, self-service BI, self-service data provisioning and Data Science sandbox provisioning.
A maturing metadata program is also crucial to resolving the issues around governance, ILM, security, and the management of unstructured and 3rd party data.
Metadata transparency is basic to democratizing and extracting value from Big Data.
Combine this Transparency with the Trust engendered by a Data Quality program and the Discipline of Data Governance, and any enterprise can have a winning Big Data Management strategy.
For more on the importance of Transparency, Trust, and Discipline, come see Scott Lee and I at the Strata conference in Santa Clara on February 12, 2014.
Thank you for a succinct point of view.