Go Big Data Lake or Go Home
Remember, when it comes to Data Lakes, I’m not just an EMC employee I’m also a client.
I was recently interviewed about analytics at EMC and was asked what I would say to other companies thinking about a Data Lake. That’s when Sy Sperling and this old Hair Club for Men commercial from 1986 popped in my head. See, I’m an EMC employee but I’m also a client since I use EMC ‘s Data Lake.
The decision to build a Data Lake isn’t just an IT decision. It’s a C-Level decision answering the real question “Do you want to have advanced analytics / Data Science capabilities that don’t force users to jump through hoops to get at the data?” The biggest challenge is getting at all the data sets required to do your analysis.
You can do Data Science without a Data Lake but it’s very time consuming. If you want to enable a culture of analytics, a Data Lake is what you want. Bill Schmarzo’s recent blog, Why Do I Need A Data Lake, is a great starting point. He always does a great job talking about data ingestion, and hub & spoke service architecture enabling analytics so be sure to check it out.
What other issues is the Data Lake addressing?
- BI used for ETL – Traditional IT solutions for providing data to the business are via BI tools. Our business needs are rarely met with these tools as they are typically single topic data visualizations. Many people, including me back in the day, resort to Shadow IT solutions to meet business requirements. We often have to blend several data sets together and provide simple models or basic reporting. The issue we are addressing with the IT native data may not be enough.
- Reducing risk of Shadow IT – No extra copies of the data, visibility to what is done with the data, sharing and scale are no longer challenges. With a Data Lake, we can see and listen to how the data is being used as opposed to being told how to consume it.
- Enabling “Sharing” by feeding Analytical Sandboxes – This enables rapid data discovery and the ability to share findings with other groups. Without this “sharing” environment much of what gets discovered remains with the team who found it. They typically are less likely to share or want to share for risk of taking down their environment or being held accountable for providing copies of their data to other groups. It’s boring work and not of value to the team who created the insight.
- A Cloud Feed – Many companies are moving to cloud-based solutions. Where is your data going? Just like Traditional IT BI, these cloud-based solutions have point data sets. For deep analytics you will need to merge this data with other data. If your cloud solutions don’t have a data lake that you’re feeding, make sure you are capturing this data in your data lake. Many cloud providers offer BI & analytic solutions so it’s not in their interest to feed you the data for use outside of their solution. Now you are back to BI used for ETL.
Are you thinking about a Data Lake or already have one? Do you see the same issues being solved? If you are struggling making the business case, I think the strongest one is around Cloud Solutions. It’s your data so make sure you are maximizing the use of it.