Big Data Architectures – NoSQL Use Cases for Hadoop
Everyone’s talking about Hadoop. With all the fanfare surrounding Hadoop, what’s it really good for and what … not so much? Hadoop is not just one tool, but a framework of solutions that was developed to handle challenging social media problems that were not well-served with traditional development tools and databases. The Hadoop framework was originally developed by Google for its Web search engine. The standard used today is an open source solution which was developed in 2006 by Doug Cutting and funded by Yahoo! This standard became “production strength” (or “Web scale”) in 2008 and is now hosted by Apache.
The core of the Hadoop framework consists of the Hadoop File System (HFS) and the MapReduce programming language. The Hadoop File System can store all types of data, structured and unstructured, in nodes across thousands of servers. However, the structure of the data stored doesn’t have to be known until the data is read. This “schema on read” capability allows Hadoop to support a very flexible and dynamic environment for data of varying structure. The key strength of the MapReduce language is that it can be used to write programs that are run against thousands of nodes or servers containing data simultaneously and bring back answers from the distributed servers to a central consolidation point. This capability is called massive multi-processing or “bringing the code to the data.”
The original use case for which Hadoop was developed by Google was to troll through all the servers on the World Wide Web and bring back a list of links and tags to all the information existing on each server. This distribution and collection of information happens in batch mode. Then the list of links collected can be used to respond to realtime searches for information. So Hadoop is very good for managing large amounts of information across many servers and supporting the need for massive multiprocessing of queries against that data. Queries or processing that can be broken up and run simultaneously against thousands of servers, or nodes can be completed very quickly compared to the traditional approach of running each transaction or query with only one processor.
These basic strengths of Hadoop lend themselves to ancillary uses around data warehousing in an organization. In transporting data from operational processing systems to data warehousing and business intelligence systems, data is frequently held in a temporary format, or “staged,” prior to being loaded to its final data store. The Hadoop File System may make a good solution for staging data because it can handle whatever the original structure or format may be and is significantly less expensive to store and maintain than the complex and expensive “smart disk” and database tables on which most relational databases reside. Similarly, data being brought into the organization from an external source may be “staged” in its original format in the Hadoop File System much less expensively than in a traditional relational database and can more readily handle changes in structure.
Archiving data is another strong use case for a Hadoop solution because it is relatively inexpensive, compared to a relational database solution, and it can store data in various formats and structures, so you don’t have to worry about schema changes in the original source systems affecting the data in archive.
The Hadoop framework is being used for some very interesting and complex data analysis involving very, very large amounts of data distributed over many nodes and servers. These solutions usually involve the ability to break up analysis into many parts rather than run in one long-running job or query. These complex solutions involve more pieces from the Hadoop framework than just the file system (HDFS) and Map Reduce. A database that supports this type of distributed solution, resource management, messaging support, and other pieces of the Hadoop framework are usually involved—plus additional solutions that support and enhance Big Data analytics, such as those available in the Pivotal Hadoop Architecture. Some of these Hadoop analytical solutions have been able to take jobs that ran for many hours or even days and run them in minutes instead.
When is Hadoop not appropriate? Online, real-time update transaction processing operations, especially those requiring complex two-phase handshakes (such as for financial transactions) are a relational database management system “sweet spot” and are not appropriate for Hadoop. Processing solutions that require extremely fast response, such as those best performed with in-memory databases, are also not appropriate for Hadoop. Remember that Hadoop was originally a batch-oriented platform, so it generally doesn’t lend itself to realtime transaction processing.