Weaving Data Hay Into Business Gold

Bill Schmarzo By Bill Schmarzo March 19, 2015

The data lake is certainly becoming a hot discussion topic in most of my client meetings nowadays.  Just recently, I had several clients in a meeting where they raised the concern that adding more data to the data lake would just make it harder for them to “find the needle in the haystack.”  When one is building out the data lake, one does not want to just “dump” more data into the data lake.  That will lead to a data “swamp.”

There are a number of activities – many of which are activities that we’ve been doing for years, heck even decades, as part of a solid Enterprise Information Management (EIM) discipline – that need to take place in order to ensure that your data lake doesn’t become a data dump, or swamp.  Activities and disciplines such as data cataloging, metadata development and management, auditing / traceability / lineage and graduated levels of data governance (e.g., heavily governed, lightly governed, not governed) are just a few that need to be covered in order to ensure that the data in the data lake is discoverable, usable and (relatively) accurate for the analytic needs.

Figure 1 below lays out some of the components that we think need to be covered as an organization builds out their data lake.

Figure 1:  Data Lake Archtecture

Figure 1:  Data Lake Architecture

However, I want to challenge the “find a needle in the haystack” as the wrong analogy for the data lake.  That is a data warehouse / Business Intelligence way of thinking about analysis; to slice-and-dice the data haystack trying to find needles.  Instead, I want you to “think different” and contemplate the story of Rumpelstiltskin as a better analogy for uncovering the business value buried in data lake.  Let me explain.

Rumpelstiltskin: A Big Data Lesson

The story of Rumpelstiltskin is about a miller who lies to the king, telling him that hisimage2 daughter can spin straw (hay) into gold.  The daughter is forced to spin the straw into gold three times or the king will cut off her head.  But if she is successful, the king will instead marry her.  Since she can’t really spin straw into gold, a strange imp-like creature offers to spin the hay into gold in exchange for something of value. The first time he is paid with a necklace, the second time he is paid with a ring, but on the third time the girl has run out of items of value, so she is forced to promise the imp her first born child.

When their first child is born, the imp returns to claim his payment. The now-queen offers him all the wealth she has if she may keep the child, but the imp has no interest in her riches.  He finally consents to give up his claim to the child if the queen can guess his name within three days. After failing for two days to guess his name, she wanders out into the woods and comes across the imp hopping around a fire and singing, “tomorrow, tomorrow, tomorrow, I’ll go to the king’s house, nobody knows my name, I’m called “Rumpelstiltskin”.

And, well, you can figure out the rest of the story.

Data discovery is like trying to “find a needle in a haystack”; however, data science with a data lake is more like trying to “weave data hay into business gold.”  So instead of thinking about the data lake as this haystack from which you are trying to find needles, think instead about the data lake as the loom for your data where you weave data hay into business gold.

Data Lake, Data Science and Scores

Let’s take this analogy one more step.  One of my favorite data science books, “Moneyball”, advocates that:

[Data Science] is about finding variables that are better predictors of performance

For many organizations, the data science team creates predictive “scores” that help them better predict what’s important to their business.  Probably the best example of a predictive “score” is the FICO Score (see Figure 2).


Figure 2:  FICO Score Example

FICO (acronym for Fair Isaac Corporation) score is a type of credit score that makes up a substantial portion of the credit report that lenders use to assess an applicant’s credit risk and whether to extend them a loan. Using mathematical models, the FICO score takes into account various factors including payment history, current level of indebtedness, types of credit used, length of credit history, and new credit.  A person’s FICO score will range between 300 and 850. In general, a FICO score above 650 indicates that the individual has a very good credit history. People with scores below 620 will often find it substantially more difficult to obtain financing at a favorable rate.

There are opportunities for your data scientists to create these predictive “scores” across a number of different industries to support what’s important to your business.  For example, a financial services firm may want to create a “Retirement Readiness” score for each of its clients, that takes into consideration their current net worth, current and projected value of their home, current and projected annual income, savings rate, spending patterns (which could be gleaned from sources such as, number of dependents (children and parents), etc.  The financial services firm may want to balance this “Retirement Readiness” score with a “Risk Tolerance” score that measures how much financial and investment risk the client is willing to bear which could include information such as age, years to retirement, number of dependents, years at current job, job title, location and behavioral classifications gleaned from on-line gambling and investment patterns.  The combination of “Retirement Readiness” and “Risk Tolerance” scores gives the financial advisor the necessary insights at the individual customer level to make the most appropriate investment and budgeting decisions. Figure 3 shows other potential scores across different industries.

Figure 3:  Potential Score Candidates by Industry

Figure 3:  Potential Score Candidates by Industry

Data Science and the Data Lake

One of the most important benefits of the data lake is enablement of your data science team.  The data lake frees up the data science team from being handcuffed by limitations in the data warehouse.  The data in the data warehouse has been optimized for Business Intelligence reporting and dashboards – aggregate tables, indices and materialized views as part of pre-defined data schema designed to address the business monitoring needs of the organization.  The Business Intelligence and data warehouse is focused on understanding “What happened?”

The data science team is trying to do something different; they are trying to predict what might happen and make evidence-based recommendations as to what actions or decisions the customers and front-line employees should make based upon those predictions of what might happen.

Figure 4:  Business Data Lake

Figure 4:  Business Data Lake


The data lake is going to be a hot and critical topic over the next 18 to 24 months.  And there will be lots of temptations to allow the conversation to digress into a technology only conversation.  However, one of the most important benefits of the data lake is enabling your data science team to mine and enrich the data looking for those better predictors of performance.  The primary goal of the data lake, from a business perspective, is to think different and think Rumpelstiltskin; to enable your data science team to not just find needles in haystacks, but instead think about how they “weave the data hay into business gold.”

Bill Schmarzo

About Bill Schmarzo

Read More

Share this Story
Join the Conversation

Our Team becomes stronger with every person who adds to the conversation. So please join the conversation. Comment on our posts and share!

Leave a Reply

Your email address will not be published. Required fields are marked *

4 thoughts on “Weaving Data Hay Into Business Gold

  1. Hi Bill,

    thanks for your excellent post, its always very insightful to read!

    In your definition of a data lake it sounds like the lake is largely an analytics sandbox to run experiments to “mine the gold”. In my client discussions I have seen a lot of different facets on how people define a lake, going from a simple ETL replacement, a trusted source of certain data domains (largely an ODS replacement) to a full blown big data production environment to build scaled applications on top of the lake. What is your view on the different areas of application of the lake? What is allowed vs. what should we not do because the lake concept is not designed for?

  2. Matthias, thanks for the response. What I am seeing as Data Lake use case #1 is setting up the data lake to free up the data science team from being data dependent upon the data warehouse. The natural curiosity-iterative-fail fast cycle that the data science team follows to uncover insights in the data is hindered by data warehouse’s production schedules and SLA’s. The data science team needs their own data environment, but need to be using much of the same data that is already stored in the data warehouse (though at a lower level of granularity in many cases). The solution: store ALL of that data in the data lake first, and then feed the data warehouse from the data lake (instead of from the source systems). The data science team can then pull the data it needs from the data lake instead of having to be dependent upon the good graces of the data warehouse manager.

    Use case #2 is ETL off-loading like you discussed. That’s typically a big win for the data warehouse team as it frees up expensive data warehouse cycles as ETL can consume anywhere from 40% to 60% of the data warehouse’s processing.

    Use case #3? Don’t know…yet. But I wouldn’t be surprised to see companies start re-platforming their data warehouses to Hadoop and the data lake. Relational data base management systems still have advantages over the data lake in supporting data warehouse processing requirements, but those advantages are quickly disappearing. And the inherent ability of the data lake (Hadoop) to process blocks of records instead of processing a single record at a time (which RDBMS’s do to perfection) means to me that the data lake is the ultimate data processing platform for both advanced analytics and data warehousing alike.

    But more on that topic later…

  3. Bill – Great post, wonderful insight pleasure to read. Good to refresh architecture, BDL etc…Do we have a case of enterprise usage to demonstrate to our customers during EBC?

    Very Best
    MURALE Narayanan