Don’t Build a Data Lake on Data Warehouse Infrastructure
In my last blog, I introduced a framework for understanding the dimensions of change to address as part of a big data transformational journey. While the first blog addressed the importance of aligning to business strategy and the implementation of meaningful changes to drive business processes and financial outcome, I would like to shift my attention to the technology capabilities that are new and different in a big data world. Specifically, I would like to focus on the elements of Technical Infrastructure required to support a big data environment.
Why the paradigm of the data warehouse no longer works
The data warehouse revolutionized information management in the 1990s through maniacal standardization. The most mature IT organizations boasted one brand of BI tool targeted at one brand of Relational Database fed by one brand of ETL. This enabled infrastructure standardization as well, usually resulting in a single on premises Storage Area Network powered by increasingly powerful compute capabilities. While the most modern of these solutions have given way to in-memory solutions, the basic tenets have remained: the criticality of standardization, the perpetual quest to increase performance of queries, and the fundamental desire to slow the onboarding of extraneous data sets, which take time to standardize, adversely affect system performance, and drive up the cost of storage.
While this paradigm will remain central to management reporting where quality and accuracy remain paramount, it has not proven effective for more dynamic predictive modeling environments. Complementary big data technologies look to turn this model on its head in several ways, each of which has significant impacts to the physical architecture:
Integration without Standardization
One of the key principles of big data is the concept of schema on read rather than schema on write, or as my colleague Bill Schmarzo calls it, schema on load vs. schema on query. The implication is that many data schema may be loaded side by side rather than transformed. One often hears this referred to as ELT rather than ETL. This is an important principle in that it allows for very simple introduction of new data sets to experiment with queries across the data sets using dynamic foreign key relationships prior to investing time to integrate these data sets. The implication from an infrastructure point of view is a change in the paradigm of compute power to query across these data schema. Power and performance shift from the nightly batch window to the minute by minute queries interrogating the system. Compute power must be significantly more robust and scalable based on system activity. Also, whereas in the data warehousing world, indexing could greatly improve the performance of frequently used queries, the nature of predictive analytics queries is (ironically) much less predictable, and consequently performance improvements are generally accounted for by increasing parallelization and possibly storing information in multiple physical locations. As a general rule, Hadoop requires at least three copies of any piece of data, although I would be remiss if I did not take this opportunity to plug EMC’s Isilon product, which reduces this footprint to only two copies.
Inclusion of Data that “might” be important
Because of the cost of data integration and storage in conventional data warehouses, the business would have to prove data value prior to inclusion in scope. In a big data world, there is a strong desire to test out the value of data sets by iterative analysis. Along with the schema on read capability outlined above, this creates a new capability to flexibly gain insight from previously unavailable data sets. This will drive up the total amount of data collected, which means a corresponding increase in the storage footprint by orders of magnitude, especially as these data elements begin to include machine data and other information from the internet of things (IoT). Now in the golden age of social media, third party data sets, wearables, and sensors, the available data sets are growing exponentially, increasing the likelihood that a business might want to try out a data set for its value that did not exist only months earlier. The implication to infrastructure is a strong desire to decrease the cost per byte of data, either by moving to lower cost (often commodity) gear or perhaps evaluating cloud providers. Also, query speed, so critical in the data warehouse world, might be less important for experimental analyses in favor of inclusiveness of data sets. In short, IT organizations want more data, but that must come at a much lower price point.
While ambiguity in the data warehouse world is unacceptable, the big data world recognizes that the world is a messy place and answers are not always so clear. This ambiguity might appear in the form of text. Consider the following two sentences:
“Acme Corp treated me like a child. They yelled at me like I was a Kindergartener.”
“Acme Corp made me feel like a kid again. We were screaming like Kindergarteners.”
While these sentences are similar in the words that they use, one is a scathing criticism while the other is a ringing endorsement. Big data analytics demands that you understand the difference, relying on tools well beyond SQL to understand the difference. The same examples exist in the realm of pictures and videos. For example, what might this picture convey?
…where one could reasonably conclude that when people wear baseball caps, it increases likelihood of a specific action. Similarly, this might require photo recognition tools as well as skills that might not be resident in the current data management organization.
As a consequence of the new ability to interpret unstructured data, will come the new need to introduce these types of data, which often require a significantly larger footprint than the rows and columns of the relational database. Along with this need will come a storage footprint that resembles current content and knowledge management software rather than data warehouses, mostly file based rather than relational. This will create the corresponding need for file based technical infrastructure as well. You will, of course, require in memory and NoSQL technologies to act upon your insights, but these file based systems will become the new standard for analytic discovery.
Implications for Your Technical Architecture
- Scale out compute alongside your scale out storage
Be prepared to have elastic compute that can scale up and down independently of your storage footprint. Be aware that this will be paramount in improving query performance in a world that creates unpredictable query patterns.
- File based storage vs. Relational Database Storage
Modern data analytics tools allow for the creation of meaning from Ambiguity. Facebook can identify your Friends in Pictures. Companies can derive sentiment analysis from Twitter. The implication is that there is a greater need to store high volume file based information for analysis. This, coupled with modern tools to allow structure data to be stored in unstructured data stores, such as Hive, Drill, HAWQ, Impala, Spark SQL, and, most recently, Presto, mean that file storage can now handle both structured and unstructured data whereas the reverse is not true for relational databased, even high-performing MPP or in-memory solutions.
- Inexpensive storage footprint for infrequently used data
High end gear will continue to be paramount in the data warehouse and data warehouse world. The mission-critical nature of these environments implies risk and possible loss of revenue should they slow down, lose key information, or fail. The past two years have seen a strong interest in in memory solution and caching in order to boost performance, and advances in high availability solutions, including real-time replication to high availability solutions. These have come with a cost, but are critical to meet demanding business needs.
Hadoop data lakes operate under a different principle, bringing in potentially useful information, implying a lower storage footprint such as scale out NAS that can store massive amounts of file data at very low cost. As we outlined above, performance will be achieved with a focus on compute rather than a focus on expensive, in memory solutions. I admit that I am oversimplifying a complex topic, especially with the emergence of technologies like Spark, which aim to put Hadoop in memory. In my next blog, I will explore this point a bit further, going into a bit more detail on the latency requirements of a lake, but suffice it to say that some of the vanguard Hadoop distributions have been build on low cost, even commodity hardware to enable low-cost analytic discovery. These principles help to establish the infrastructure required for data analytics. My next blog will continue this exploration of technology, looking at other workloads required in big data and implications for a broader need of latency requirements.