Lessons in Becoming an Effective Data Scientist
I was recently a guest lecturer at the University of California Berkeley Extension in San Francisco. On a lovely Saturday afternoon, the classroom was crowded with students of all ages learning the tools of the modern economy. The craftspeople of the “Analytics Revolution” were busy learning new skills and tools that will prepare them for this Brave New World of analytics. I was blown away by their dedication!
As we teach the next generation, it’s important that we focus more on capabilities and less so on skills. What I mean is “learning TensorFlow” isn’t nearly as important as “learning how to learn TensorFlow.”
We need to make sure that we teach concepts and methodologies along with the tools. We should teach the “What” and “Why” as well as the “How” so we don’t put our students in a situation where they “can’t see the forest for the trees.”
This brings me to a recent article “What IBM Looks for in a Data Scientist,” which outlines what IBM looks for in a Data Scientist. The list of skills is very useful, especially for someone pursuing such a career:
- Training as a scientist with an MS or PhD.
- Expertise in machine learning and statistics with an emphasis on decision optimization.
- Expertise in R, Python or Scala.
- Ability to transform and manage large data sets.
- Proven ability to apply the skills above to real-world business problems.
- Ability to evaluate model performance and tune it accordingly.
Unfortunately, this is a tactical list, not a strategic list. In fact, some of the points are too granular and too focused on “how” versus “why.” For example, on point #3, it’s more important to know how to program than it is to know a specific language. It’s more important to learn the concepts and approach to effectively program than it is to learn the tools themselves. The minute you think you’re expert at R or Python or Scala, along comes Julia. It’s important to develop transferable skills rather having to re-educate yourself each time a new tool arrives.
In a world driven by the rapid introduction and adoption of open source tools and frameworks (like TensorFlow for machine learning), expertise in a tool is fleeting. However, mastery of the concepts and approaches for which those tools are used is critical because being a data scientist is more than just a bag of skills. The best data scientists are about outcomes and results.
Data Science DEPP Engagement Process
Our data science team at Dell EMC uses a methodology called DEPP that guides the collaboration with the business stakeholders through the following stages:
- Descriptive Analytics to clearly understand what happened and how the business is measuring success.
- Exploratory Analytics to understand the financial, business and operational drivers behind what happened.
- Predictive Analytics to transition the business stakeholder mindset to focus on predicting what is likely to happen.
- Prescriptive Analytics to identify actions or recommendations based upon the measures of business success and the Predictive Analytics.
The DEPP Methodology is an agile and iterative process that continues to evolve in scope and complexity as our clients mature in their advanced analytics capabilities (see Figure 1).
Importance of Humility
The first skill that I look for when engaging with or hiring a data scientist is humility. I look for the ability to listen and engage with others who may not seem as smart as them. And as you can see from our DEPP methodology, humility is the key to driving collaboration between the business stakeholders (who will never understand data science to the level that a data scientist do) and the data scientist (who will never understand the business to the level that the business stakeholders do).
Humility is critical to our DEPP methodology because you can’t learn what’s important for the business if you aren’t willing to acknowledge that you might not know everything.
Humility is one of the secrets to effective collaboration. Nowhere does the importance of the business/data science collaboration play a more important role than in hypothesis development.
A hypothesis is a formal statement that presents the expected relationship between an independent and dependent variable. (Creswell,1994)
If you get the hypothesis and the metrics against which you are going to measure success wrong, everything the data scientist does to support that hypothesis doesn’t matter. In fact, if you get the hypothesis and the metrics against which you are going to measure wrong, not only are you likely to achieve suboptimal results, but you could actually achieve the wrong results altogether.
For example, in the healthcare industry, we are seeing the disastrous effects of the wrong metrics (see the blog “Unintended Consequences of the Wrong Measures” for more details). Instead of using “Patient Satisfaction” as the metric against which to measure the doctor and hospital effectiveness (which is leading to unintended consequences), the healthcare industry may benefit from a more holistic metric against which to measure success. One example is a “Quality and Effectiveness of Care” combined with a “Readmissions” score and “Hospital Acquired Infections” score.
Being off in your hypothesis by just one degree can be disastrous. For example, if you are flying San Francisco to Washington, D.C. and were off by a mere one degree upon takeoff, you’d end up on the other side of Baltimore, 42.6 miles away (“Impact of A Mere One-Degree Difference”).
Get the hypothesis wrong, even by a one degree, and the results could be wrong or even disastrous (if you have tickets to watch the Washington Redskins play football and not the Baltimore Ravens).
Type I / Type II Errors
Being humble also means to concede when you may be wrong, particularly with analytic models that may not always deliver the right predictions or outcomes. In that case, a solid understanding of the business or organizational costs of Type I (False Positive) and Type II (False Negative) errors is important. To understand the business and organizational ramifications of such errors requires close collaboration with the business stakeholders (see Figure 3).
See the blog “Understanding Type I and Type II Errors” for more details.
In my classes, I focus on the “What” and “Why” versus spending too much time on the “How”. I want my students to have a framework that enables them to understand how the different technologies, techniques and tools can be more effectively used.
I’m not teaching my students data science, I’m teaching them how to learn data science. It is an important distinction that can be humbling, but results in a more detailed-oriented student that wishes not only to become a data scientist, but become an effective data scientist. As teachers, it is important that we know the difference.