The Dirty Little Secret of Big Data Projects
Over the past year, I’ve had the privilege of being involved with an initiative at MIT called bigdata@csail; CSAIL is the Computer Science and Artificial Intelligence Lab at MIT.
Massachusetts Governor Deval Patrick kicked off this initiative in May, 2012 to show the state’s interest in partnering with academic institutions to drive innovation in technology, specifically related to Big Data. The idea is that Big Data provides many opportunities to drive change, so why not bring together experts to share ideas, research and problems and find ways to use Big Data technologies to improve what we do.
About ten companies, including EMC, are participating in this consortium, which is focused on advancing research on Big Data, and helping each other learn and solve problems related to this area. (BT recently also joined the consortium, but is not included in this graphic.)
There have been some outstanding sessions at bigdata@csail, covering a variety of topics. Everything from faster and more efficient scientific database architectures based on matrix-based table structures (SciDB), to integrating sampling techniques for lightning fast queries (BlinkDB), developing new open source collaborative analytical software (Julia), and the use of machine learning and Big Data to interpret body language and non-verbal communication between people (see Sandy Pentland’s work on sociometric badging and Living Labs).
Although I find these projects innovative and exciting, it occurs to me that many times people look to improve and push boundaries for things that we are already pretty good at, while spending less time on improving areas that are important, but may be difficult or less sexy.
The Data Analytics Lifecycle that we developed and teach in EMC’s data science classes is an example of this:
Although the third (Model Planning) and fourth (Model Execution) phases tend to get most of the attention, since these are where algorithms and predictive models come into play, the place where people spend the most time by far is in Phase 2, Data Prep. From my experience, and from input I’ve received from others who are experienced Data Scientists, Data Prep can easily absorb 80% of the time of a project. But there has been a real lag in the development of tools for data prep. Many times I see leaders who want to get their data science projects going quickly, so their teams jump right into making models, only to slide back a few phases, because they are dealing with messy or dirty data. They must then try to regroup and create predictive models.
Dealing with the data cleansing and conditioning can be a very unsexy part of a project. It can be painful, tedious, time consuming, and sometimes thankless to clean, integrate and normalize data sets so that you can later get it into a shape and structure to analyze later on. Rarely do people pound their chests at the end of a project and talk about all of the fabulous data transformations they performed in order to get the data into the right structure and format to analyze. This is not where the sizzle is, but, like many things, it’s what separates the novices from the masters. In fact, because of the amount of thought and decision-making related to how data is merged, integrated, and filtered, I believe more and more that the data prep cannot be separated from the analytics, and is intrinsically part of the Data Analytics Lifecycle and process. The reality is, if you give Data Prep short shrift, everything that comes after it is a waste of time.
From my perspective, many new tools have emerged to help simplify analytics, dashboards, reporting, handling streaming data, and even improving database architectures, but I’ve seen very little in my career that truly improves the Data Prep phase of the project. It seems to be the dirty little secret of every data science or analytical project that the time and attention spent on Data Prep is intensive. As a consequence of it being tedious and labor intensive, most organizations use only a handful of their datasets. Research presented during a recent talk at MIT indicated that large organizations have roughly 5,000 data sources, but only about 1-2% of these make it into their Enterprise Data Warehouse. This means that 98% of an organization’s data sources may be unused, inaccessible, or not cleaned up and made useful for people to analyze and use to make better business decisions.
For these reasons, I’m glad to see that people are starting to create tools to address this need. As all of the marketing hype, newspapers, media, and legitimate research tell us, 80% of new data growth is unstructured. To take advantage of it, we need to get a lot better at preparing, conditioning and integrating data.
The theme for the most recent bigdata@csail session in early April focused on Data Integration. Many researchers presented their projects and research on Big Data, and how they are trying to solve these problems. If data integration has been a problem for years, shouldn’t this turn into a much bigger problem when integrating Big Data?
The thesis is that rather than using brute force techniques to merge data together, we can use more intelligent techniques to make inferences about different kinds of data and automate some of the decision making. Take a project like Data Tamer, which strives to inject algorithmic intelligence into the data cleaning stages of the Data Analytics Lifecycle to make our lives easier and better. To give a very simple example, if Data Tamer detects that two columns of data are named differently but contain data that are highly similar above a certain threshold, then we can infer it is likely the same data, and that someone has renamed the column. Data Tamer will suggest likely columns to combine, and ask the human to choose how and when to merge data. This means some of the brute force-level work can happen with machine learning, and some of the more difficult decisions about merging data can be left to humans, who can exercise higher-level judgments based on their deep domain knowledge and experience.
Instead of using only 1-2% of the data in an organization, what decisions could we improve if we could access 10% or 20% of an organization’s datasets? MIT is doing good things to advance data integration and conditioning, but there are also other tools emerging in this area (even if they are less sophisticated) to make the data conditioning and prep easier for most people. Here are a few free tools:
1) Open Refine (formerly Google Refine) has a simple user interface to help people clean up and manipulate datasets.
2) Similarly, Data Wrangler, which emerged from Stanford, does some of the same things. Both of these are great tools with graphical user interfaces.
3) Certainly, you can also use R if you are feeling a bit more ambitious. Or, try learning Reshape2, and Plyr packages. These packages will enable you to do many, many data transformations on to help in data science projects (though R has more of a learning curve).
I encourage you to explore these tools. Getting even a little bit more conversant with data management and handling means you can greatly expand the universe of data you can explore and there will be more data out there that you can use for analysis. This will become even more critical as Big Data continues to evolve.
Please add your comments and feedback on your favorite tools and methods for data preparation. Also, for those in the Boston area, I’d like to invite you to a session I’m presenting at MIT Sloan on Big Data on April 29. Hope to see you there.