Three Big Misconceptions About Big Data
As a result of the industry‘s growing interest in Big Data, my favorite topic, I did more public speaking in 2013 than in any other year of my career. I delivered 14 talks at industry conferences and events, at universities, and within EMC. Over the course of delivering these talks, a number of comments, questions and misconceptions about Big Data came up again and again. I felt it would be useful to share some of what I heard, so here are three big misconceptions about Big Data:
1. The most important thing about Big Data is its size
Big Data is mainly about the size of the data because Big Data is big, right? Well, not exactly, says Gary King of Harvard’s Institute of Quantitative Social Science. Certainly there is more data to work with than in the past (this is the Volume of the “3 Vs” – Volume, Variety and Velocity), but if people focus mainly on gigabytes, terabytes and petabytes, they are looking at Big Data mainly as a problem about storage and technology. Although this is definitely important, the more salient aspects of Big Data are typically the other two V’s: Variety and Velocity. Velocity refers to streaming data and very fast data, with low latency data accumulating or entering a data repository that enables people to make faster (or even automated) decisions. Streaming data is a big issue, but to me, the Variety piece is the most interesting of the 3 Vs.
The items shown above represent common ways that Big Data is generated. In fact, this illustrates a philosophical issue – it is not just the fact that Big Data has changed, it is more that the definition of what is considered data has changed. That is, most people think of data as rows and columns of numbers, such as Excel spreadsheets, RDBMSs, and data warehouses storing terabytes of structured data. Although this is true, Big Data is predominantly about semi-structured data or unstructured data. Big Data encompasses all of these other things that most people don’t think about when they consider data, such as RFID chips, geospatial sensors in smart phones, images, video files, clickstreams, voice recognition data, and metadata about these data. Certainly we need to find efficient ways to store the volumes of data, but I find that when people begin grasping the variety and velocity of data, they begin to find more innovative ways to use it.
2. It’s just fine to bring a knife to a gun fight
“OK, but why do I need new tools? Can’t I just analyze Big Data with my existing software tools?” During a panel discussion in which we discussed using Hadoop to parallelize hundreds or thousands of unstructured data feeds, an audience member asked why he couldn’t simply analyze a large text corpus with SPSS. The reality is once you grok #1 above, then you realize that you need new tools that can understand, store and analyze different kinds of data inputs (images, clickstreams, video, voice prints, metadata, XML….) and process them in a parallel fashion. This is why in-memory desktop tools that were adequate for local in-memory analytics (SPSS, R, WEKA, etc.) will buckle under the weight and variety of Big Data sources and why we now need new technologies that can manage these disparate data sources and deal with them in a parallel manner.
3. Imperfect data quality must mean that Big Data is worthless
“Yes, but with the Big Data, what about data quality? Isn’t it just ‘garbage in-garbage out’ (GIGO) on a larger scale?”
Big Data can certainly be messy, and data quality is important to any analysis. However, the key thing to remember is that the data will be inherently noisy. That is, there will be a lot of distractions, anomalies of different kinds, and inconsistencies. The important thing is to focus on the amount and variety of data, which can be pruned and used for valuable analysis. In other words, find the signal within all of the noise. In some cases, organizations will want to parse and clean large data sources, but in other cases this will be less important. Consider Google Trends.
Google Trends shows the top things people are searching for, such as the top things that people searched for on Google throughout 2013, as shown in the photo above. It requires a massive amount of storage, processing power and robust analytical techniques to sift through the searches and rank them. This is an example of using Big Data, where GIGO is less of the focus.
By this point, many people say things such as “Aha! This sounds like a big change.” Yes! As a colleague of mine says, this suggests a distinction whereby people think of Big Data either as a noun or a verb. In other words, thinking of Big Data as a noun treats Big Data as “just more stuff” that needs to be stored and accommodated. Treating Big Data as a verb implies action, and the people in this camp view Big Data as a disruptive force, and an impetus to change the way that they operate. Use Big Data to test ideas in creative ways and approach business problems in analytical ways such as performing A/B testing — consider Google testing 50 shades of blue to find the one Gmail users would click on most, rather than having marketing managers simply guess. Or finding ways to measure what would seem to be unmeasurable, such as companies and universities finding better ways to automate image classification. Explore ideas in new ways – using data to answer the “what if….” questions.
The organizations that view Big Data as a verb will be the winners in this race.