Big Data Management – The Lab and the Factory
Big Data management requires availability of at least two very distinct data environments, and possibly organizational structures, for the different participants.
A few months ago I saw a tweet that clarified this for me regarding architecting Big Data solutions. Interestingly, I somewhat misunderstood the tweet, but both the original article and my thoughts on the tweet make for a worthwhile Big Data management discussion.
The Tweet – The Lab and the Factory
“To Work with Data, You Need a Lab and a Factory” read the tweet from @CraigMilroy. The body of the tweet doesn’t even refer to “Big” Data, but points to his blog and then to the Harvard Business Review article referenced below and the hashtags #bigdata, #analytics, and #DataScientist.
This started me thinking that in “Big Data” management we need a “Lab” or “Sandbox” environment that is very dynamic and can be used by the Data Scientists to throw in or throw away massive amounts of structured and unstructured data against which to do analysis, find patterns and insights, and develop models.
But then, we need to take those models and create an operational “Information Factory” with all the good production processes we’ve learned around data access security and high volume efficiency to produce insight and trigger action on an on-going basis. This “Factory” also needs to be able to process structured data, unstructured data, and data streams, thus requiring a Big Data architecture that will include, among other things: relational and NoSQL databases, unstructured data stores, and in-memory databases, as well as the ability to process and trigger action.
The need for these very distinct environments had never been clear to me previously, even though I had worked in financial services risk management environments that do this type of work.
Article by Thomas C. Redman and Bill Sweeney in Harvard Business Review
The actual article referred to in this tweet was more focused on organizational structures, as Harvard Business Review will tend to be, rather than data architecture solutions.
“Companies that aim to score big over the long term with big data must do two very different things well. They must find interesting, novel, and useful insights about the real world in the data. And they must turn those insights into products and services, and deliver those products and services at a profit.”
The article discusses the need to create separate departments for the Data Scientists (the Lab) and the Factory with very different goals, management, and processes, but which have established good communication.
“In their search for new insights, data scientists write enormous quantities of code. But it is not designed to meet commercial standards for scalability, security, and stability. You create and support commercial-grade code in the factory.”
“The [Factory] requires many more people with a wider variety of skill sets, a more rigid environment, and different sorts of metrics…. To be clear, creativity and experimentation are important in the factory, but you must not expect more than incremental thinking and production-oriented solutions.”
The Lab and the Factory
It had never really dawned on me previously that there is a need for both two distinct data management environments and two distinct groups in the organization in order to effectively manage Big Data. Attempting to manage both types of data in the same environment will lead to either insufficient controls or stifled creativity. Attempting to manage both types of people with the same style, goals, and metrics will similarly do disservice to one or both groups.
Thus, we need a “Lab” environment and organization for Data Scientists trying to identify new patterns and create new models, and a “Factory” environment and organization in order to turn those discoveries into operational insight and action.
Extreme Transaction Processing
The ability to trigger action based on the processing of incoming information is called “Complex Event Processing” or even, depending on the speed of turn around, “Extreme Transaction Processing.” Examples of this includes credit card companies attempting to identify fraudulent use of credit and energy utilities trying to manage power usage fluctuations.
Click the image below and check out my new book for more on “Managing Data In Motion” and “Big Data Integration.”