Data Lake, Data Reservoir, Data Dump…Blah, Blah, Blah…
As is typical from many (but not all) technology vendors, analysts and analyst firms, there is a rush to come up with the “right” name to which the technology vendors, analysts and analyst firms can claim origination honors. From the below Gartner slide (see Figure 1), it seems that Gartner is trying to coin the term “Data Reservoir” – instead of “Data Lake” – to describe this new, big data architectural approach. My response: who cares?
Who cares what it’s called. The fact that every technology vendor and IT analyst is out there trying to coin their own favor term only dooms us to delaying the most important discussion – how do we leverage this “data thingie” to uncover customer, product and operational insights that we can use to differentiate our customer engagements, optimize key business processes and uncover new monetization opportunities?
The Data Warehouse versus Data Marts Battle All Over Again
Several decades ago, a battle raged between Data Warehouse advocates (associated with Bill Inmon and the Corporate Information Factory) and Data Mart advocates (associated with Ralph Kimball and star schemas). Countless hours were wasted and lives lost at trade shows, seminars and in conference rooms across the world debating which approach was the “right” approach. As a reminder:
- Data Warehouse or Enterprise Data Warehouse (EDW) is a subject-oriented, nonvolatile, integrated, time variant collection of data in support of management’s decisions. The enterprise data warehouse approach is often characterized as a top-down approach, more in alignment with the OLTP or transactional systems from which the data was sourced. The data warehouse typically has an enterprise-wide perspective.
- Data Mart is a subset of the data warehouse that is oriented to a specific business function or a single department.This enables each department to use, manipulate and develop their data any way they see fit; without altering information inside other data marts or the enterprise data warehouse. Data marts use the concept of “conformed dimensions” to integrate data across business functions, replicating in many ways the same data that is captured in the enterprise data warehouse.
Interesting factoid: both worked!! There were certainly architectural and deployment differences between the two approaches, but the bottom line is that they both required the same key capabilities including:
- Captures large amounts of historical data that could be used to analyze the performance of the key business entities (dimensions) and identify trends and patterns in the data
- Data governance procedures and policies to ensure that the data stored in the data warehouse and data marts were 100% accurate
- Master data management to ensure common definitions, terminology and nomenclature across the lines of business
- Ability to join or integrate data from different data sources coming from different business functions
- End user query construction (using SQL and Business Intelligence tools) that supported 1) the generation of daily, weekly, monthly, and quarterly reports and dashboards as well as 2) the ad-hoc slicing and dicing of the data – drill up, drill down and drill across different data sources – to identify areas of over- and under-performance.
I first published on the opportunity for a Hadoop-based “data store” upon which an organization’s data warehouse and advanced analytics environments could be built in my August 2012 blog “The Data Warehouse Modernization Act”. As I wrote in that blog:
The Hadoop Distributed File Systems (HDFS) provides a powerful yet inexpensive option for modernizing Operational Data Store (ODS) and Data Staging areas. HDFS is a cost-effective large storage system with an intrinsic computing and analytical capability (MapReduce). Built on commodity clusters, HDFS simplifies the acquisition and storage of diverse data sources, whether structured, semi-structured (web logs, sensor feeds), or unstructured (social media, image, video, audio). Once in the Hadoop/HDFS system, MapReduce and commercial Hadoop-based tools are available to prepare the data for loading into your existing data warehouse. As I discussed previously in my “Understanding the Role of Hadoop In Your BI Environment” blog, the ability to “define schema on query” versus “define schema on load” simplifies amassing data from a variety of sources, even if you are not sure when and how you might use that data later (see Figure below).
I didn’t bother to create a new name for this “data store” because to me, it was very much like the data stores that we’ve historically used, except that it leveraged new technologies (Hadoop and HDFS) to handle massive data volumes from an incredibly wide variety of data sources (structured and unstructured data; internal and external data), and it was ridiculously inexpensive and almost infinitely scalable.
So instead of wasting time debating whether it’s a data lake or data reservoir or data dump or whatever, let’s instead focus on the discussion and debate on the more important questions and challenges. And to me, these are the three biggest challenges I see right now for the Data Lake/Reservoir/Dump/Hub/Store discussion:
- Data Governance
- Metadata Management and Enhancement
- Information Factory vs. Data Alchemy
Data Governance in a Big Data World
Traditional data governance approaches are not going to work in a world where data is being pulled (or merely indexed) from a multitude of sources by analysts and data scientists in an attempt to determine how valuable that data might be. Rachel Haines articulated a solution for this problem in the presentation she made about the data governance in a big data world (see Figure 2).
Rachel identified three classes of data, all of which have different data governance requirements:
- Governed Data – Key business data understood in regard to ownership, definition, business rules, lineage, quality target(s), and classification. Typically, governed data will be included in conformed data warehouses in addition to the data lake.
- Lightly governed Data – Data understood in regard to definition and lineage, but not necessarily controlled with respect to quality or usage. Data may, or may not, be included in conformed data warehouses.
- Ungoverned Data – Data only understood in regard to definition and location. Ungoverned data may or may not physically exist in the data lake and may exist only in the data catalog as metadata pointers to external data.
As data transitions up the value curve (i.e., data is deemed to be more valuable or useful in its ability to monitor the business or predict key business performance), then the data transitions from ungoverned data to lightly governed data to governed data. I think this is a very clean way to think about data governance in a big data world.
Critical Importance of Metadata
We just completed a project for a high-tech manufacturer in the area of improving quality, testing effectiveness and on-time shipments. As part of our envisioning process, our data science team came across a variable (PO_Type) that was highly predictive of on-time shipments. After spending time verifying and validating the predictive capabilities of this variable, we were told that this was a “non revenue item” and therefore not important to the analysis.
Two things struck me at this moment:
- How come we didn’t have the metadata ahead of time that would have steered us away from this variable?
- But maybe more importantly, what if you had enhanced metadata on the PO_Type because while PO_Type may not be a good predictor of immediate revenue, there are some values in that variable (such as evaluation, prototype and NPI test) that may be highly important in longer-term strategic sales opportunities or even design wins where our client is trying to get their products designed into the products of their key partners (see Figure 3).
In either example, more well understood metadata is good, but no metadata either wastes time or maybe even more importantly, misses opportunities to find variables that may be highly predictive of key monetization opportunities.
Information Factory vs. Data Alchemy
For me, this is the most difficult challenge and something with which I continue to wrestle – how do the roles, responsibilities and expectations change in a big data world where data science is such a critical and scarce skill. I had first-hand experience on my most current engagement of a “fake data scientist” – that is, someone who said all the right words and had all the right credentials in their resume, but really struggled in a world of “data alchemy” – a world where the data scientist wrestles with a multitude of different data sources trying to tease out insights or “ah-has” buried in those data sources (see Figure 4).
I love the term “data alchemy” to describe the data science process – lots of exploration, testing, failure, more exploration, more testing and more failure and until they start uncovering something that might be interesting, then more exploration, more testing and more failure. It couldn’t be more different from the “Information Factory” approach that has been associated with the development of an enterprise data warehouse.
I’m going to be writing on this data science/data alchemy topic frequently over the next several months because it is a topic that I need to learn more about.
In the end, no one should care what we call this thing that some are calling a “data lake.” Instead of creating a false and distracting discussion that adds NO value to the big data and data science challenges, let’s just pick name (I’ve come to terms with “Data Lake” even though I find it not very useful) and move onto the bigger discussions, debates and industry advancement.
 An information factory is a logical architecture that relies on a data warehouse linked to assorted other pieces that provide for various functions that help a business to use data and optimize its use of internal resources (improve business monitoring and optimize decision making).