Don’t Get Rid of Datastage Quite Yet…My Take on ETL vs. Hadoop
I got the following question from one of my blog postings, and since I’ve gotten similar questions in the past, I thought it might be useful to create a shorter blog to discuss the question. The answer to this question was not simple, and I had to bring in a couple of our data scientists (Dr. Pedro DeSouza and Dr. Wei Lin) to help me construct the most appropriate answers. And as seems to be typical in most of our big data discussions, the answer really is “it depends.”
I am looking at Hadoop in comparison to traditional ETL (Datastage) for managing mostly structured data. The intent is to feed both data warehouse environments as well as batch integration between systems.
Hadoop doesn’t seem to have some of the more advanced mapping tools that Datastage has so we require more low-level coding to parse the incoming files.
I am trying to make Hadoop do the wrong things – should we choose Datastage instead for the structured data workloads?
I don’t think Hadoop (with MapReduce) will replace your traditional ETL (extract, transform, load) tools like Datastage any time soon. Hadoop is a powerful data management framework with scale out capacities, and MapReduce provides parsing/aggregating capabilities across massive structured and unstructured data sets. It can perform ETL via custom coding, but does not easily replace your traditional ETL tool. In fact, I think you’re better off thinking of Hadoop/MapReduce as a complement to your existing ETL tools. Gives you a new tool in your kitbag!
The ETL tools have lots of built-in functionality for data cleansing, alignment, modeling, and transformation that today would need to be hand-coded in Hadoop. Better to buy that functionality.
However, a conventional ETL tool would struggle processing even a few hundred GB in a single file. Today’s standard ETL tools are best to process files up to 50GB or so. For files larger than 50GB, your best strategy is to hand-code using Hadoop/MapReduce.
For small files you are better off using conventional ETL than MapReduce in order to use the evolved functionality already developed in ETL tools. Leave the hardcore Hadoop/MapReduce jobs to large files, unstructured data or complex processing.
Also, there may be select data management and transformation processes that could benefit from the newer ELT (extract, load, transform) development paradigm that leverages the parallel processing and more procedural capabilities of Hadoop/MapReduce. For example, you could use the ELT process on Hadoop with MapReduce to create advanced data metrics such as frequency, recency, and sequencing.
One last point, many of the traditional ETL vendors are porting their tools to run on Hadoop in order to take advantage of the processing and cost benefits of the natively parallel Hadoop environment. For example, Talend, Pentaho, and Informatica have “ETL-like” functionality based on MapReduce jobs. You can use them for large files that are processed on the Hadoop cluster.
Seems to indicate that most advanced data management shops are going to have need for both traditional ETL tools as well as Hadoop/MapReduce – the right tool for the right job