Why Do I Need A Data Lake?

Bill Schmarzo By Bill Schmarzo August 12, 2015

The data lake is gaining lots of momentum across the different customers to whom I talk.  Every, and I mean every organization wants to learn why and how to implement a data lake.  But “because it is a cheaper way to store/manage data” is not a good reason to adopt a data lake.  The “Why do I need a data lake?” answer is much more powerful than just having the IT organization save some money.

The data lake is a powerful data architecture that leverages the economics of big data (where it is 20x to 50x cheaper to store, manage and analyze data as compared to traditional data warehouse technologies). And new big data processing and analytics capabilities help organizations address business and operational challenges that were difficult to address using conventional Business Intelligence and data warehousing technologies.

The data lake has the potential to transform the business by providing a singular repository of all the organization’s data (structured AND unstructured data; internal AND external data) that enables your business analysts and data science team to mine all of organizational data that today is scattered across a multitude of operational systems, data warehouses, data marts and “spreadmarts”.

Analytics Hub and Spoke Service Architecture

The value and power of a data lake are often not fully realized until we get into our second or third analytics use case.  Why is that?  Because it is at that point where the organization needs the ability to self-provision an analytics environment (compute nodes, data, analytic tools, permissions, data masking) and share data across traditional line-of-business silos (one singular location for all the organization’s data) in order to support the rapid exploration and discovery processes that the data science team uses to uncover variables and metrics that are better predictors of business performance.  The data lake enables the data science team to build the predictive and prescriptive analytics necessary to support the organization’s different business use cases and key business initiatives.

Joe Dossantos, the head of EMC Global Services Big Data Delivery team, termed this a “Hub and Spoke” analytics environment where the data lake is the “hub” that enables the data science teams to self-provision their own analytic sandboxes and facilitates the sharing of data, analytic tools and analytic best practices across the different parts of the organization (see figure 1).


Figure 1: Analytics Hub and Spoke Service Architecture

The hub of the “Hub and Spoke” architecture is the data lake.  The data lake has the following characteristics:

  • Centralized, singular, schema-less data store with raw (as-is) data as well as massaged data
  • Mechanism for rapid ingestion of data with appropriate latency
  • Ability to map data across sources and provide visibility and security to users
  • Catalog to find and retrieve data
  • Costing model of centralized service
  • Ability to manage security, permissions and data masking
  • Supports self-provisioning of compute nodes, data, and analytic tools without IT intervention

The spokes of the “Hub and Spoke” architecture are the resulting analytic use cases that have the following characteristics:

  • Ability to perform analytics (data scientist)
  • Analytics sandbox (HDFS, Hadoop, Spark,, Hive, HBase)
  • Data engineering tools (Elastic Search, MapReduce, YARN, HAWQ, SQL)
  • Analytical tools (SAS, R, Mahout, MADlib, H2O)
  • Visualization tools (Tableau, DataRPM, ggplot2)
  • Ability to exploit analytics (application development)
  • 3rd platform application (mobile app development, web site app development)
  • Analytics exposed as services to applications (API’s)
  • Integrate in-memory and/or in-database scoring and recommendations into business process and operational systems

The Analytics “Hub and Spoke” architecture enables the data science team to develop the predictive and prescriptive analytics that are necessary to optimize key business processes, provide a differentiated customer engagement and uncover new monetization opportunities.

Beware The False Prophets!

You know something must be on the right track when the market incumbents are working so hard to either discredit or spread confusion.  And that seems to be the case for the data lake.  Lots of vendors, press and analysts are trying to position the data lake as just an extension to the data warehouse; as data warehouse 2.0.  And with that sort of thinking, we risk repeating many of the fatal mistakes we made with data warehousing.

Confusion #1:  “Feed the Data Lake from the Data Warehouse.”

That’s ridiculous and is being pushed by traditional data warehouse vendors as the most appropriate use of the data lake.  Sorry, but that’s like inventing the jet engine and then saying that you’re going to pull it with a horse and buggy.

Loading data into a data warehouse means that someone has already made assumptions about what data, level of granularity and amount of history is important.  You have to make those assumptions in order to pre-build the data warehouse schema.  And that means that the raw data has already gone through data transformations (and content elimination) in order to get the data to fit into the data warehouse schema.  Lots of assumptions being made a priori about what data, data granularity and data history is important when the only purpose of the data warehouse is to report on what happened!!  That’s like going wine tasting and swabbing Vaseline on your tongue!  Many of the valuable nuances in the data have been removed in order to aggregate the data to fit into a reporting-centric data schema.


Figure 2: Data Lake Architecture

As you can see in figure 2, the data lake sits in front of the data warehouse to provide a data repository that can leverage the “economics of big data” (where it is 20x to 50x cheaper to store, manage and analyze data using traditional data warehousing technologies) to store any and all data (structured AND unstructured; internal AND external) that the organization might want to leverage.  What are the benefits of having the data lake in front of the data warehouse?

  • Rapid ingest of data because the data lake captures data “as-is”; that is, it does not need to create a schema before capturing the data.
  • Un-handcuffing the data science team from having to try to do their analysis on the overly-expensive, overly-taxes data warehouse
  • Supporting data science team’s need for rapid exploration, discovery, testing, failing, learning and re-fining of the predictive and prescriptive analytics that power the organization’s key business processes and enables new business models.

The additional benefits of this architecture:

  • Provides an analytics environment where the data science team is free to explore new data sources and new analytic techniques in search of those variables and metrics that may be better predictors of business performance
  • Frees up expensive data warehouse resources and opens up SLA windows by off-loading the ETL processes off of the data warehouse and put those processes into the natively parallel, scale out, less expensive data lake

Clearly having the data lake in front of the data warehouse is a win-win for both the data warehouse administrators and the data science organization.

Confusion #2:  “Create multiple data lakes.”

Oh, the creation of multiple data warehouses and multiple supporting data marts has worked out soooo well for the world of data warehousing.  Disparate, duplicated data warehouses and data marts are a debilitating problem in the world of data warehouses. Not only does this hinder the sharing of data across departments and lines of business, but more importantly it causes confusion and a lack of confidence by senior management in the data.  How can senior management be confident that they are dealing with the “right” data when every business unit or business function has created their own data warehouse?

The result: silo’ed data and no easy way (or willingness) to share data across the business units.

For the data lake to be effective, an organization deploys only ONE data lake; a singular repository where all of the organizations data – whether the organization knows what to do with that data or not – can be made available.  Organizations such as EMC are leveraging technologies such as virtualization to ensure that a single data lake repository can scale out and meet the growing analytic needs of the different business units – all from a single data lake.

Do me a big data favor and scold anyone who starts talking about data lakes (plural) instead of a data lake.

Confusion #3:  Dependent upon IT to manually allocate analytic sandboxes.

Why insert a human-intensive IT intermediary into a process that can easily be managed, controlled and monitored by the system?  The data science team needs to be free to explore new data sources and new analytic techniques without adding a labor-intensive, middle step to have someone allocate the analytic sandbox environment.  IT as a Service, baby!  This seems more like a control issue than a technology issue and fight IT’s urge to control the data science creative process.


The data lake is a game-changer not because it saves IT a whole bunch of money, but because the data lake can help the business make a whole bunch of money!  Do not get caught up in the ability to build a data lake, instead focus on how the data lake can “Make me more money.”

Bill Schmarzo

About Bill Schmarzo

Read More

Share this Story
Join the Conversation

Our Team becomes stronger with every person who adds to the conversation. So please join the conversation. Comment on our posts and share!

Leave a Reply

Your email address will not be published. Required fields are marked *

10 thoughts on “Why Do I Need A Data Lake?

  1. Great points made in this article. Would love to see articles on actual structure and design ideas, this would help answer some of the many questions and concerns from some of those existing warehouse owners.

  2. There are a lot of thought-provoking points made in this article but aso a lot of assumptions. For example in para 3 you say the Data Science teams will be able to mine across the organisations whole data assets but if you have ever tried data mining uncleaned raw data, including setting up links across datasets for the first time you will know that can be a non-trivial task which costs time and money. Even if you have extremely skilled data modellers and manipulators you will be starting each exploration more or less from scratch. If the work results in new algorithms and reports that can be reused then it might pay, but unless you are MI5 the exploratory data science team in any organisation is likely to be a smaller subset of analytical staff who generally want speed and consistency in MI. The article has started me thinking and I might start a blog post to address some of the issues you raise and link back to you. Thanks.

  3. Matt, thanks for your feedback and observations. Totally agree, and in fact, your comments have also given me the incentive to write another blog to address many of the points you raised above. I’m probably guilty of over-hyping the business value of the data lake without providing details about the challenging and tedious role of things such as metadata management, data governance and security that are necessary to operationalize the data lake.

    Thanks again Matt for reading and commenting on the blogs!

  4. I absolutely love this one. There must be quite a few customers in HR approaching you for such lakes now that they see that their stockpile of resumes is just noise 🙂

  5. I have to respectfully disagree with this viewpoint. Surely the data lake is distributed and not centralised. It has to be managed as if it was centralised which is a much tougher challenge. Equating a data lake to 1 Hadoop cluster is just not enough. I already have clients with multiple Hadoop systems. Data is on the cloud, on Hadoop, in DWs, in MDM systems, in No SQL data stores. There are several reasons why data may be kept apart including legal reasons, data being to big to move etc. So surely we must manage it across multiple data stores in a distributed data lake.

    • Mike, thanks for your comments. No, we don’t put the data lake on a single Hadoop cluster, but I think your point is that the data lake is more of a logical creation that a physical creation. The key point is that it is the data lake is a single place – logically or physically – to put the organization’s data.

      So if you have to leave your Point of Sales data for your Germany retail locations in Germany for legal reasons, then it is located in a single repository and it isn’t also stored across multiple other repositories in Germany. If a business then needs to create a European-wide view of sales performance, then they would access and analyze that data within the Germany data lake extension and the integrate/aggregate the Germany results with the results of the other European countries.

      Data stored and accessed from a single (logical) point so that we can eliminate data silos and the proliferation of data that gets out of sync, two of the items that has greatly hindered the credibility of the data warehouse.

  6. Excellent read! However, if you google “why do we need a data lake”, you find another link (actually the results show them just below yours). I would highly recommend that you read the article for the blatant attempt at plagiarism by this individual. He’s literally lifted verses from this article and claimed them as his own (or at least he’s not put any references to this original one anywhere in his blog post).

    Here’s the link:

    • Nikhil, thanks for alerting me. I’ll send them a note to either reference me and my blog, or remove the material. Ugh.

      What I love about social communities is that we get a chance to build upon the works of others. I do it all the time, but I make sure to give credit and reference any one else’s material. It’s easy to do and a great way to build community.

      Thanks again!