A Data Scientist View of the World, or, the World is Your Petri Dish
In an earlier InFocus post, I discussed five attributes of Data Scientists.
In developing the EMC Data Science & Big Data Analytics course, we collaborated with the Greenplum Data Scientist team. One member of that team, Kaushik Das, likened a Data Scientist to a sculptor, in that master sculptors see the world differently than most people. Where most people would just see a block of marble, a master sculptor can see a statue hiding within the raw material, and views their job as chiseling away the exterior and the pieces of marble to reveal the work of art.
Likewise, Data Scientists have the ability to see hidden possibilities. Where most people look at data and see unrelated information, it is the job of the Data Scientist to look for the insights lurking within the data. Like the example of chiseling away excess marble to reveal art, data must sometimes be reshaped, cleaned, or formatted in the right ways in order to produce unexpected insights.
More and more, I’m encountering researchers and Data Scientists who are not just doing data science projects in controlled ways, but are using the world around them as a sandbox in which to experiment with data and test ideas on a large scale. Recently, I had the privilege of attending an MIT Computer Science & Artificial Intelligence Lab (CSAIL) lecture as a result of EMC’s participation in the bigdata@csail initiative. The lecture was delivered by Jeffrey Dean and Sanjay Ghemawat, who together developed the MapReduce computational framework, and are co-designers and co-implementers of heavily used distributed storage systems, including Bigtable and Spanner at Google.
Dean spoke of using MapReduce, not in a controlled environment within a small IT shop, but rather on a very large scale, on publicly available data. One example Dean discussed was using MapReduce to tag and classify sets of images. Doing data experiments on the world around them, the team is tapping into large scale, publicly available data stores to test their algorithms. Most of the work they are doing in R&D is about finding ways to develop algorithms that can mimic human thinking, but with data. For example, to test out an algorithm to classify images, they performed unsupervised learning, trying to find hidden structures in unlabeled data, on one frame from each of 10 million YouTube videos. Rather than create a training set to train an algorithm (a more traditional supervised learning method in this case), they conducted this experiment at a large scale, to test how neural networks would work to classify video frames as images on their own. The resulting classifier was much, much better than most other existing methods, when they used ImageNet, which contains 16 million images in 21,000 categories. This is an example of testing robust algorithms in the wild, and using the world as a petri dish in which to test hypotheses. Another great example is the project EMC is sponsoring, called The Human Face of Big Data.
Rick Smolan, a former Time, Life, and National Geographic photographer, has authored numerous books, and is perhaps best known for his “Day in the Life” book series. He is spearheading The Human Face of Big Data project, which is designed to demonstrate “how real- time sensing and visualization of data has the potential to change every aspect of life on earth. It may represent one of the most powerful toolsets humanity has ever created in addressing some of our biggest challenges.”
This is part of the Quantified Self movement, in which the intent is to show how Big Data is touching our lives and that of our families. As part of this, Smolen encouraged people worldwide to download an app that, for one week, turned our smartphones into sensors to record and share data. In other words, the apps on the smartphones anonymously tracked and shared users’ habits and preferences, and compared them to others in the world. These images are taken from Rick Smolen’s video of the project.
Here is a list of the industries he touched…basically most major areas of people’s lives:
One of the great things about Smolan’s project is that he highlights some of the success stories and the benefits of becoming a data-driven society. For instance, he tells the story of people who developed cheap early warning systems, based on sensors in a common laptop computer, to sense earthquakes in Japan. As a result, one minute before the 2011 Japan earthquake hit, all of the bullet trains and transport systems were halted, which prevented further casualties.
Another example cited by Smolen analyzed crime data in New York City. Rather than plot data on a map, someone analyzed all the data and found the home addresses of convicts before they went to jail. They then used this information to target locations for career counseling services and crime prevention and education programs.
One final example is a Prius dashboard. Not only does it track gas mileage, but it provides feedback to the driver, who then adjusts their driving habits to be more fuel efficient based on the feedback.
These are common-sense examples, but they show how Big Data gives us the ability to analyze everyday problems in innovative ways. I would encourage you to consider what you do each day that you could optimize by having more data, or at least try to think a little bit like a Data Scientist and test ideas more quantitatively. For further reading:
- If you want to learn how to reshape data, check out Professor Hadley Wickham’s paper, “Tidy Data,” which teaches people straightforward techniques for reshaping data in the R programming language, and uses his very popular libraries for this, such as Reshape.
- Tom Davenport & DJ Patil recently published an excellent article about Data Scientists in Harvard Business Review, ”Data Scientist: The Sexiest Job of the 21st Century,” highlighting Data Scientist skills, the future of this profession, and how universities and companies such as EMC fit into this mix.
- For more examples of work that Data Scientists are doing, see these videos from the May 2012 Data Science Summit
- To learn more about the Human Face of Big Data project, see this short video by Rick Smolan.