4 Key Steps on Your Journey to the Data Lake
In my last blog I talked about why you need a Data Lake. Now I’m going to share a few helpful steps on this journey and highlight some “gotchas” to avoid.
Step 1 – Feed the Lake
Understand all the data needs of your company/customer. If you don’t have the data, you are dead in the…wait for it—yes, I’m going there since we’re talking about data lake—water.
I can’t count the number of times I’ve requested data only to find it was missing an integral column in: Quotes, Billings, Bookings, Install Base, Contracts, Logistics, Case Management, Headcount, Expenses, Web traffic, Mobile, Telephony, Training, Industry data, DUNS….
With the data lake I finally have a place to ask questions about my business where I can see what is happening end to end without having to use 10 different BI tools.
- Mistake #1: Don’t structure the data into what you think they need. Feed all the data, structured and unstructured. If not, you will always be asked to feed more and paying by the drip is expensive. Also, your customers will always be unhappy and work around your expensive data ingest model by using shadow solutions.
Step 2 – Care for the Lake
The initial feed to the data lake is awesome for asking questions and data discovery. But what happens when you discover something? What if you already knew something and now 10 other groups are re-creating the wheel? Or asking the same question but getting different answers?
This is where I use my Lego example. Let’s say you just dumped a bag of Legos on my desk, showed me a picture of the Death Star with no instructions or pre-packaged bags, and said, “Build it, you have everything you need”.
- Mistake #2: Oops – you may have a skill gap. I hope your data lake is sitting on one of EMC’s solutions. Sorry I couldn’t help shamelessly plugging our EMC equipment. All kidding aside, this is what we use at EMC. If your team doesn’t know Hadoop, you probably need to learn it or pay for it “as a Service” to get you the unstructured data.
- Mistake #3: Create a community not a competition. Data SMEs are your friends, not the enemy. Everyone wants to say they are in an advanced analytics / data science team, which is great, but you don’t need to discover the earth is round. Take advantage of data SMEs learnings and share yours. This can later feed into data governance.
Step 3 – Use the Lake with Analytical Sandboxes
Analytical sandboxes are provisioned spaces for users to discover and build new insight. This is finally a place where you can access the data without a BI layer in the way. You can build and merge many different data sets in ways that were previously impossible.
- Mistake #4: Where is that column? A huge frustration I’ve run into is only knowing our data through traditional BI tools. These BI tools often transform the data or create custom calculations that don’t exist natively. Understanding what data exists and how the BI data was created can save you an enormous amount of frustration and time.
Heads up: You will run into resistance, saying you are duplicating efforts and now have to maintain two copies of the logic. Forcing BI tools to do ETL is a huge mistake as the BI tool will hold you hostage. By building it into the data layer and then dropping it in to BI, it will run faster and take better advantage of your infrastructure.
Step 4: Feed Back into the Lake
Again, you want to part of a community. Once you discover or create a model that adds value, feed it back. If you only keep it in your sandbox, you limit the amount of value this insight can produce. Many groups in my company look at similar information with a very different lens and sharing our findings helps reduce duplicated efforts or even dueling data. Success and Value to me is when we operationalize our findings and built it into workflow, applications and/or change the way we work, not just a report that has cool visualization. Your initial discovery or model may not be the final mile. Getting it fed back can help it feed other solutions or use cases. As an example, the data science model you created in your sandbox most likely isn’t going to feed a production app.
- Mistake #5: I’m taking my data and going home. Creating insight or a fantastic data science model in your sandbox is awesome; but you are limiting your value. Many of your initial sandbox users may be former shadow IT groupies, where sharing data is not natural or encouraged. Create incentives and reward sharing or you are limiting value.
I hope you found these steps useful and learned how to avoid some land mines. If your experience is different or have other suggestions please comment below.