Data Warehousing Lessons for A Data Lake World
Over the past 2 decades, we have spent considerable time and effort trying to perfect the world of data warehousing. We took the technology that we were given and the data that would fit into that technology, and tried to provide our business constituents with the reports and dashboards necessary to run the businesses.
It was a lot of hard work and we had to do many “unnatural” acts to get these OLTP (Online Transaction Processing)-centric technologies to work; aggregated tables, plethora of indices, user defined functions (UDF) in PL/SQL, and materialized views just to name a few. Kudos to us!!
Now as we get ready for the full onslaught of the data lake, what lessons can we take away from our data warehousing experiences? I don’t have all the insights, but I offer this blog in hopes that others will comment and contribute. In the end, we want to learn from our data warehousing mistakes, but we don’t want to throw out those valuable learnings.
Why Did Data Warehousing Fail?
Below is the list of areas where data warehousing struggled or outright failed. Again, this list is not comprehensive, and I encourage your contributions.
- Adding New Data Takes Too Long. It took too long to load new data into the data warehouse. The general rule to add new data to a data warehouse was 3 months and $1 million. Because of the need to pre-build a schema before loading data into the data warehouse, the addition of new data sources to the data warehouse was a major effort. We had to conduct weeks of interviews with every potential user to capture every question they might ever want to ask in order to build a schema that handled all of their query and reporting requirements. This greatly hindered our ability to quickly explore new data sources, so organizations resorted to other options, which leads to…
- Data Silos. Because it took so long to add new data sources to the data warehouse, organizations found it more expedient to build their own data marts, spreadmarts or Access databases. Very quickly there was a wide-spread proliferation of these purpose built data stores across the organization. The result: no single version of the truth and lots of executive meetings wasting time debating whose version of the data was most accurate, which leads to…
- Lack of Business Confidence. Because there was this proliferation of data across the organization and the resulting executive debates around whose data was most accurate, business leaders’ confidence in the data (and the data warehouse) quickly faded. This became especially true when the data being used to run a business unit was redefined for corporate use in such a way that it was not useful to the business. Take, for instance, a sales manager looking to assign a quota to his rep that manages the GE account and wants a report of historical sales. For him, sales might be Gross and GE might include Synchrony, whereas the corporate division might look at sales as Net or Adjusted and GE as its legal entities. It’s not so much a question of right and wrong as much as it is the enterprise introducing definitions that undermines confidence, which leads to…
- Underinvestment In Metadata. No business leader had the time to verify the accuracy of the data, and no IT person knew the business well enough to make those data accuracy decisions. Plus, spending the money to hire consultants to do our job for us was always a hard internal sell, which leads to the metadata management denial cycle:
- IT: “You business users need to own the data.”
- Business: “We don’t have time to do that.”
- IT: “Okay, let’s hire consultants.”
- Business: “Shouldn’t we know our data better than consultants?”
- IT: “Okay, you business users need to own the data”
- And so forth…
- Inability to Easily Share Data. The data warehouse lacked the ability to quickly ingest and consequently easily share data across different business functions and use cases. The data warehouse failed to become that single repository for the storage of the organization’s data assets because of the complexity, difficulty and slowness to add new data to the data warehouse, which leads to…
- Shadow IT Spend. Nothing confirms the failure of the data warehouse more than shadow IT spend. Business users did not have confidence in how the data warehouse could help them address urgent business needs. Consequently, many line of business leaders pursued their own one-off IT initiatives (call center operations, sales force automation, campaign marketing, logistics planning, financial planning, etc.), which also further contributed to the unmanageable proliferation of data across the organizational data silos.
- Inability to Handle Unstructured Data. Data warehouses cannot handle unstructured data. Unfortunately the bulk of the world’s data is now found in semi-structured data (log files, sensors, beacons, routers, MAC addresses) and unstructured data (text files, social media postings, audio files, photos, video files). Organizations who wanted a holistic view of the business had to make do with only 10 to 20% of the available organizational data. Hard to provide a holistic view with a 80% to 90% hole in that view.
- No Predictive Analytic Capabilities. Business Intelligence solutions provide the summarized data necessary to support the organization’s operational and management reporting needs (descriptive analytics). However, most data warehouses lacked the detailed data across a wide variety of structured and unstructured data sources to support the organization’s predictive and prescriptive analytic needs.
- Too Damned Expensive. Data science is about creating behavioral analytics at the individual levels (e.g., customers, employees, jet engine, train engine, truck, wind turbine, etc.). To uncover these behavioral analytics at the individual level, data scientists need the complete history of detailed transactional, operational and engagement data. The data scientists don’t want 13 months of aggregated data; they want 17 years of detailed transactions, even if that data is now located on mag tape. Trying to gather all of the voluminous data on a data warehouse is a recipe for organizational bankruptcy.
- Inadequate Processing Power. Let’s face it; data warehouses lacked the economical processing power necessary to analyze petabytes of customer and machine data to uncover behavioral patterns and propensities. The data lake is built on modern, big data scale-out environments using open source software built on commodity servers are game changers in allowing organizations to store and analyze data volumes magnitudes bigger than one could ever economically fit into a data warehouse.
What Did Data Warehousing Get Right?
Okay, I was pretty harsh on the data warehouse world in which I grew up. But again, it was amazing what we were able to do with technology designed to deal with single records (insert, update, delete). I have never constructed analytics that uses only a single record. Analytics requires a massive number of records in order to uncover individual behaviors, propensities, tendencies, patterns, etc.
So what did we get right, and what should we preserve as we move into the modern data lake world?
- Data Governance. Data governance, into which I also group things like data accuracy, data lineage and data traceability, is as important now as it was in the data warehouse world. Having a process that allows the data science team to quickly ingest and explore the data unencumbered by data governance is a good practice. However you will need data governance rules, policies and procedures once you have determined that there is value in that data to support key decisions. If the business users do not have confidence in the data, then all is lost.
- Metadata Management. The importance of metadata only becomes clearer as we begin to integrate data and analytics into the organization’s key business processes. The more metadata that we have about the data, the easier it is to get value from that data. Investing in the associated metadata carries the same economic value as investing it the data itself, IMHO. We want to enrich the data as much as possible, and a solid metadata management strategy is key for making that happen.
- Conformed Dimensions. Having a single master file – or conformed dimension – for key business entities (e.g., products, customers, employees, physicians, teachers, stores, jet engines, locomotives, delivery trucks, etc.) is critical. It is these conformed dimensions that allow the data science team to tie together the wide variety of data sources to create the detailed analytic and behavioral profiles. Maintaining these conformed dimensions is hard work, but without them, there is no way to turn all this valuable data (and metadata) into actionable insights.
- Single Version of The Truth. While I have always hated the term “single version of the truth,” operationally it is important to have all the data about your key business entities in a single (physical or logical) location. Also, in the Big Data world, the notion of data that is fit for purpose becomes critical. There may not be one truth, but there should be clarity as to how numbers are produced to provide transparency and trust.
- Analytics Self-service. The idea of creating a self-service environment around analytics is very powerful. How do I pull IT out of the middle of the analytics request and provisioning process? If I truly want to create an environment where the analysts can quickly spin up an analytics sandbox and populate with data, I can’t have heavy manual processes in the middle of that process.
- Reports Starting Point. The many reports and dashboards that have been built upon your data warehouse are a great starting point for your data lake journey. Business users have requested those reports for a reason. Instead of focusing time and effort to create yet more reports, first try to understand what questions and decisions the business users hoped to address with those reports, and what additional predictive and prescriptive insights do they need from those reports.
- Yeah, SQL is still the query language of choice and we need to embrace how we help SQL-trained analysts to use that tool on the data lake. Open-source tools like Hive, HBase, and HAWQ are all designed to enable that army of SQL-trained business users and analysts to have access to the wealth of data in the data lake.
There is much that can be learned from our data warehousing experiences. The key is to understand what to keep and what to throw out. That means a single data lake (not data lakes). That means data governance. That means metadata management, and even more that we learned from our data warehousing experiences. We must learn from our experiences, otherwise…
“Those who do not learn history are doomed to repeat it.”
 Spreadmart (short for “spreadsheet data mart”) is a business intelligence term that refers to the propensity of some organizations or departments within organizations to use individual, desktop-based databases like spreadsheets as a primary means of data organization, storage, and dissemination.