Big Data Management – The Lab and the Factory

By April Reeve July 31, 2013

Big Data management requires availability of at least two very distinct data environments, and possibly organizational structures, for the different participants.

A few months ago I saw a tweet that clarified this for me regarding architecting Big Data solutions.  Interestingly, I somewhat misunderstood the tweet, but both the original article and my thoughts on the tweet make for a worthwhile Big Data management discussion.

The Tweet – The Lab and the Factory

Twitter_bird_logo“To Work with Data, You Need a Lab and a Factory” read the tweet from @CraigMilroy.  The body of the tweet doesn’t even refer to “Big” Data, but points to his blog and then to the Harvard Business Review article referenced below and the hashtags #bigdata,  #analytics, and #DataScientist.

This started me thinking that in “Big Data” management we need a “Lab” or “Sandbox” environment that is very dynamic and can be used by the Data Scientists to throw in or throw away massive amounts of structured and unstructured data against which to do analysis, find patterns and insights, and develop models.

But then, we need to take those models and create an operational “Information Factory” with all the good production processes we’ve learned around data access security and high volume efficiency to produce insight and trigger action on an on-going basis.  This “Factory” also needs to be able to process structured data, unstructured data, and data streams, thus requiring a Big Data architecture that will factoryinclude, among other things: relational and NoSQL databases, unstructured data stores, and in-memory databases, as well as the ability to process and trigger action.

The need for these very distinct environments had never been clear to me previously, even though I had worked in financial services risk management environments that do this type of work.

Article by Thomas C. Redman and Bill Sweeney in Harvard Business Review

The actual article referred to in this tweet was more focused on organizational structures, as Harvard Business Review will tend to be, rather than data architecture solutions.

“Companies that aim to score big over the long term with big data must do two very different things well. They must find interesting, novel, and useful insights about the real world in the data. And they must turn those insights into products and services, and deliver those products and services at a profit.”

The article discusses the need to create separate departments for the Data Scientists (the Lab) and the Factory with very different goals, management, and processes, but which have established good communication.


“In their search for new insights, data scientists write enormous quantities of code. But it is not designed to meet commercial standards for scalability, security, and stability. You create and support commercial-grade code in the factory.”

“The [Factory] requires many more people with a wider variety of skill sets, a more rigid environment, and different sorts of metrics…. To be clear, creativity and experimentation are important in the factory, but you must not expect more than incremental thinking and production-oriented solutions.”

The Lab and the Factory

It had never really dawned on me previously that there is a need for both two distinct data management environments and two distinct groups in the organization in order to effectively manage Big Data. Attempting to manage both types of data in the same environment will lead to either insufficient controls or stifled creativity.  Attempting to manage both types of people with the same style, goals, and metrics will similarly do disservice to one or both groups.

Thus, we need a “Lab” environment and organization for Data Scientists trying to identify new patterns and create new models, and a “Factory” environment and organization in order to turn those discoveries into operational insight and action.

Extreme Transaction Processing

The ability to trigger action based on the processing of incoming information is called “Complex Event Processing” or even, depending on the speed of turn around, “Extreme Transaction Processing.”  Examples of this includes credit card companies attempting to identify fraudulent use of credit and energy utilities trying to manage power usage fluctuations.

Click the image below and check out my new book for more on “Managing Data In Motion” and “Big Data Integration.”

Managing Data in Motion Widget

About April Reeve

With 25 years of experience as an enterprise architect and program manager, April fully deserves her Twitter handle: @Datagrrl.

She knows data extremely well, having spent more than a decade in the financial services industry where she managed implementations of very large application systems.

April is a Data Management Specialist as part of EMC Global Services, with expertise in Data Governance, Master Data Management, Business Intelligence, Data Warehousing Conversion, Data Integration and Data Quality. All of these add up to one simple statement: April is very good at helping large companies organize their data and capture value from it. April works for EMC Consulting as a Business Consultant in the Enterprise Information Management practice.

Read More

Share this Story
Join the Conversation

Our Team becomes stronger with every person who adds to the conversation. So please join the conversation. Comment on our posts and share!

Leave a Reply

Your email address will not be published. Required fields are marked *

2 thoughts on “Big Data Management – The Lab and the Factory

  1. Hello April,
    Good insights. The concept of lab and factory makes a lot of sense and was actually practiced by financial services on a smaller scale for many decades right under everyone’s nose. Spreadsheets were the lab for all manner of experiments in creating new instruments and models. Only when these models were proven worthy were some attempts made to extract the logic out of the spreadsheets into some kind of robust programming model. In many cases the conversion efforts were not too successful, because the models were still a moving target. The key is to recognize the state of maturity as it relates to the solution being pursued. Some models will be in flux for a long period of time and need the analyst to make radical changes quickly. Hopefully new PaaS offerings can give data scientists newer platforms that provide the ease of using the fantastic modeling ability of a spreadsheet, but have the robustness of an enterprise app under the covers as default.

  2. I think that this style of development resonates well with me, but there are indubitably some issues. As the previous commenter points out, you are trying to hit a moving target while searching for the right models.

    This is fundamental to the way that Google-esque companies tie together research and engineering within the same positions; they do as Facebook does: “move fast and brake stuff.”

    You want to be able to iterate quickly, not necessarily building “commercial-grade” code… as things change and ideas need to be tested out in “the field” forthwith, to really know if they work. A strong engineer will put the product out as soon as possible if it solves 80% and worry about the remaining 20% later.