Hadoop Data Modeling Lessons – by Vin Diesel

By Bill Schmarzo September 23, 2014

As my friends know…okay, my nerdy friends… there are 3 big data topics that give me pause.  These topics are critical from a standpoint of operationalizing big data, but I still have not gotten my head completely around them:

  1. What skills, training, capabilities and attitudes does someone with a Business Intelligence/SQL/Statistics background need to learn in order to become a data scientist?
  2. What is the economic value of data; that is, how do I price data that I might want to buy and/or sell?
  3. In a world of “schema on query”, what tools, skills and design techniques does one need to simplify and standardize the ad hoc schema definition process in order to operationalize the resulting schema and analytic results?

Today’s blog is about that third topic. Which brings us to Xander Cage?  Who is Xander Cage?  One of the original Hadoop developers?  The code name for a super-secret big data project happening within EMC Global Services?  Nah, Xander Cage is a character played by Vin Diesel in the movie “XXX.”  He’s a bad-ass that’s out to save the world from a deadly virus.  A great B-rated shoot ’em up, blow ’em up movie.

movie XXX

Source: Columbia Pictures

There is a scene in the movie where one of the bad guys has the Prague police pinned down from a sniper’s position in a large industrial storage room.  Xander comes into the scene, surveys the situation, grabs a bazooka and utters:

You have a bazooka! Dude, stop thinking Prague Police and start thinking Playstation. Blow [schtuff] up!!!”

This quote is the ultimate in “think differently” messaging, which is exactly what we need to do when we think about data modeling in a big data world.

History of Data Modeling By Bill Schmarzo

The world of data modeling (at least as it pertained to Bill Schmarzo) started with third normal form (3NF) and E. F. Codd.  E. F. Codd defined third normal form data modeling in 1971 as a database normalization technique to improve data base processing while minimizing storage costs. 3NF data modeling was ideal for online transaction processing (OLTP) applications with heavy order entry type of needs.

When I was coding in the early 1980’s, disk space and processing power were extremely expensive. 3NF was designed to minimize the amount of data that we needed to store by ensuring that we eliminated data redundancy in storage. Heck, the entire “Y2K” panic was caused by programmers like me hardcoding “19” into the date (year) field so that we could save two bytes each time that the “year” field was used in calculations or reporting. When we were writing programs in the 1980’s, no one dreamed that our programs would still be running 20+ years later (I wonder if anyone ever found all the Easter eggs that I buried in my code, he-he-he).

As a result, we ended up with OLTP data models that looked like Figure 1.

Third Normal Form Sample Data Model

Figure 1: Third Normal Form Sample Data Model

Data Modeling For Business Users

While 3NF was ideal for machine processing, the spaghetti nature of the data model was uninterpretable by a human user.  The world of analysis (a.k.a. query, reporting, dashboards) required a different type of data model that supported analysis such as trend lines, period-to-date calculations (month-to-date, quarter-to-date, year-to-date), cumulative calculations, basic statistics (average, standard deviation, moving averages) and previous period comparisons (year ago, month ago, week ago).

Ralph Kimball, while at Metaphor Computer Systems, pioneered dimensional modeling, or star schemas, in the early 1980s (see Figure 2).  The dimensional model was designed to accommodate the analysis approach of the business users via two important design concepts:

  • Fact tables (populated with metrics, measures or scores) that correspond to transactional systems such as orders, shipments, sales, returns, premiums, claims, accounts receivable, and accounts payable.  Facts are typically numeric values that can be aggregated (e.g., averaged, counted, summed).
  • Dimension tables (populated with attributes about that dimension) that represent the “nouns” of that particular transactional system such as products, markets, stores, employees, customers, and different variations of time.  Dimensions are groups of hierarchies and descriptors that describe the facts.  It is these dimensional attributes that enable analytic exploration, attributes such as size, weight, location (street, city, state, zip), age, gender, tenure, etc.
Dimensional Model (Star Schema)

Figure 2: Dimensional Model (Star Schema)

Dimensional modeling was ideal for business users because it was designed with their analytic thinking in mind.  What do I mean by that?  For example, “By” analysis is a common way for capturing the reporting and analysis needs of the business users during the design phase of an analytic project.  The “By” analysis is consistent with the way the business users tend to frame their data requests such as:

  • I want to see sales and returns by month, store, geography, product category, product type and customer segment
  • I want to compare this month and last month’s claims by policy type, policy rider, zip code, city, state, region, customer tenure and customer age
  • I want to see social sentiment scores trended by social media source, subject, and day of week

Today, all Business Intelligence (BI) tools use dimensional modeling as the standard way for interacting with the data warehouse.

Data Modeling For Data Science

In the world of data science using Hadoop, we again need to think differently about how we do data modeling.  Hadoop was originally designed by Google and Yahoo to deal with very long, flat web logs (see Figure 3).  Heck, Google called it “Big Table[1]” since it was an uber large table, not a series of smaller tables tied together with joins – it was just designed differently.

Sample Log file

Figure 3: Sample Log File

For example, Hadoop accesses data in very large blocks – the default block size is 64MB to 128MB versus relational database accesses block sizes are typically 32Kb or less.  To optimize this block size advantage, the data science team wants very long, flat records.

For example, some of our data scientists prefer to “flatten” a star schema by collapsing or integrating the dimensional tables that surround the fact table into a single, flat record in order to construct and execute more complex data queries without having to use joins (see Figure 4).

Flattening the Star Schema for Data Science work on  Hadoop

Figure 4: Flattening the Star Schema for Data Science Work on Hadoop

Taking this to the next level, the data science team will create an uber long record for each different business functions that can be more easily analyzed using Hadoop (see Figure 5).

Data Science Data Modeling on Hadoop

Figure 5: Using Large Flat Files To Eliminate/Reduce Joins On Hadoop

For example, we could have the following massively long records for an individual for whom we want to analyze:

  • Customer demographics (age, gender, current and previous home addresses, value of current and previous home, history of marital status, kids and their ages and genders, current and previous income, etc.)
  • Customer purchase history (annual purchases including items purchased, returns, prices paid, discounts, coupons, location, day of week, time of day, weather condition, temperatures)
  • Customer social activities (entire history of social media posts, likes, shares, tweets, favorites, retweets, etc.)

One technique that Dr. Pedro Desouza, the head of our data science practice at EMC Global Services, uses in order to avoid too many and frequent joins is to replicate just the key dimensional attributes into the fact table. In this way, he keeps the clear representation of the star schema but eliminates the joins by performing the heavy lifting analysis just on the flat file. The tradeoff is a little bit of data redundancy to keep clarity, but it takes advantage of the Hadoop performance characteristics.

What Does This Mean?

It means is that the way that we designed data models for OLTP applications (using third normal form) and for data warehousing (using dimensional modeling) needs to change to take advantage of the inherent architecture and processing advantages offered by Hadoop.  Data scientists will create flat data models that take advantage of the “big table” nature of Hadoop to handle massive volumes of raw, as-is data.

9 11 14 Bill Image 2My friend (and data scientist) Wei Lin calls this data modeling transition from relational (third normal form) to dimensional (star schema) to universal “The Good, The Bad and The Ugly” (I always had a preference for “Angel Eyes” in that movie, if you know which character that is).  But maybe that’s the wrong title? Maybe the name should be “The Complex, The Elegant and The Simple” that reflects the changing nature of data modeling.

As Xander Cage says, we need to think differently about how we approach data modeling in a big data world using Hadoop.  Old data modeling techniques that may have been optimal in a world of constrained storage and processing power, are no longer the optimal data modeling techniques in a big data world.

We need to embrace the “think differently” mentality as we continue to explore and learn from this brave new world.  It may be a painful transition, but we made the successful transition from third normal form to dimensional modeling…eventually.




[1] In 2004 a Google in-house team of scientists built a new type of database called “Big Table”.  Big Table successfully broke through the barriers presented with a traditional RDBMS to handle massive volumes of semi-structured data (log files). Google scientists published two scholarly research papers on their project, the first describing the Google File System (GFS) and Google MapReduce.  Doug Cutting and Mike Cafarella, two independent developers, convinced Yahoo that this new structure was the solution to their search and indexing challenges, discovered that work.  By 2006 Yahoo had the first prototype called ‘Hadoop’ and in 2008 Yahoo announced the first commercial implementation of Hadoop.  Facebook was a fast follower and soon Twitter, eBay, and other major players were also adopting the technology. (

About Bill Schmarzo

Read More

Share this Story
Join the Conversation

Our Team becomes stronger with every person who adds to the conversation. So please join the conversation. Comment on our posts and share!

Leave a Reply

Your email address will not be published. Required fields are marked *

18 thoughts on “Hadoop Data Modeling Lessons – by Vin Diesel

  1. Bill – right direction… but I think we can take this further with some good abstractions. In Hadoop, we have the freedom – even moreso than we did in the RDBMS – to have data tables yes, but also Data Services. Because it is a app server and runtime as well as a data tier, Hadoop lets us define algorithms (APIs) that accept say, a customer name, and provide out as a return parameter a universal customer ID, a matched-enhanced-standardized revision of the customer name given in, and a full customer data object as well in the case of a match.

    This is a pretty simple transactional MDM concept, but in the big data world we can use such an algorithm (which might be backed by an MDM server like Siperian, or by some iterative Map-reduce code leveraging public-domain fuzzy matching logic) in the app layer AND when doing data joins for analysis.

    This concept is the bridge I believe between the technology-silo’d data management world today and the integrated big data ecosystem many of us are hoping for in the future.

  2. Thoughts from an X-Metaphorer NonHadooper

    You never had the Metaphor star schema until you first did your data management. Yes Cobol and Mainframes and Files and Logic.

    Today you have your Distributed File Systems containing your Lists/Sets/EmbeddedTuples/Structs {{{Persons:’Mary’,’Jack’,’Steve’},{Education:’CollegeEducated’,’Iowa’},{Likes:’LikesHighEndBicyles’}}}. Yes these datasets are large. Larger than a sharded MySql implementation.

    You still have to build them. Most likely from streams. Streams of data processed by parsing, filtering, more parsing more filtering … then abstracting and more abstracting …. then later combining and excluding and recombining and re-excluding and summarizing and more summarizing ….. Then analyze.

    To me Hadoop is many batch processes piped together. The scale is beyond huge. To me the special sauce is not scale …. it is what to parse/filter for and how to best parse/filter (said stream) and how to best (abstract,combine,exclude,re-combine,re-exclude,summarize,re-summarize) resulting in a dataset that makes a Bill Schmarzo drool at all the analytic possibilities.

  3. Hey Rob! Good to hear from you! Excellent assessment of what needs to be done in order to derive value out of these Lists/Sets/EmbeddedTuples/Structs/ and to put the information into a form that is usable by the business. Lots of hand-crafting and lots of data contextual knowledge required in order to do that. Not an easy task, but hopefully something that gets easier as the tools evolve to support that.

    Thanks again, and good to hear from you Rob!

  4. Bill – I think you are too tightly coupling the ERD data model to a type of DBMS. Due to the cost of storage and need to reduce data redundency, RDBMS databases do tend to look a lot like the related ERD. However, an ERD is a powerful tool for understanding business data, independent of any database, because of its ability to capture both data and business rules. An ERD is only confusing to the business due to lack of familiarity, not that it is inherently incomprehensible. Our experience is once the business overcomes the incorrect impression that data models are only used to design databases, learns how to read a data model and we organize them in a way that is easily consumable, the business becomes enthusiastic users of ERD data models. Where there is a big disconnect between ERDs and big data is the traditional practice of fully modeling ALL data PRIOR to putting it in the database. An ERD is fully capable of modeling the data as it comes out of the big data data store. Having an ERD (logical data model) doesn’t preclude also modeling the data in a manner more suitable to the means of physical data storage (physical data model).

  5. Bruce, thanks for the comments. Great point about the importance of ERD (Entity Relationship Modeling) in a Big Data world. ERD remains an important process/tool, no matter where the data is ultimately physically stored (RDBMS or HDFS).

    I still think an ERD is a bit hard for the average business person to read. We find that it’s easier to engage the business people when we focus on the metrics and scores that might predict a certain business outcome (e.g., brainstorming the variables that might predict when a customer is likely to leave, or their readiness to retire).

  6. I agree with Bruce on the importance of ERDs and how they are not just an artifact of traditional OLTP systems. I’m also a bit hesitant when I see the motto “think differently” used as an excuse to throw out 30 or 40 years of work and principles on how to model data.

    I have spent most of my career in traditional relational database systems doing Data Warehousing, and am more recently exploring Hadoop and its associated tool suite, so I am aware that I have some bias :). Hadoop has some clear use cases in areas where traditional DBMSes don’t perform well — complex text analysis, streaming, thousands (or more) of columns. I’ve personally recently run into issues with Oracle supporting only ~1000 columns in a table while modeling some census data. I can see where schema-on-read could be useful in certain situations as well.

    Here’s the problem: I see people saying all the place how “traditional RDBMSes don’t scale.” Well, maybe MySQL has some issues, but I have seen both Oracle and Teradata handle massive amounts of data gracefully. It’s a matter of implementing the system properly and having qualified people, not so much the technology. You need the same thing on Hadoop to make it all work well. And losing ACID, etc., has a LOT of consequences that, frankly, 90% to 95% of developers are not qualified or competent to deal with. They don’t know what they don’t know. The 5 or 10% who really do understand what they are losing are probably qualified to make a decision to go with Hadoop or similar tools, but I also see a lot of cases where people just want to make end runs around existing IT process in place and therefore download some “free” (to acquire, NOT free to maintain!) software to have their own shadow data mart where they don’t have to follow appropriate process in their organization.

    Anyway, to return to the original topic, if you don’t do traditional data modeling, how do you prevent all of the issues of data duplication and inconsistency that the normal forms were invented to deal with? Or are you assuming that someone “upstream” has done this, and that all you have to do is consume and analyze data without having to worry about maintaining it?

  7. Josh, thanks for the thoughtful reply. However, I’m confused as to why you think that I’m saying that data modeling isn’t necessary. What I am advocating is that the data models you build from a data scientist perspective are very different than the data models you build from OLTP or Business Intelligence perspective; that a data scientist mentality requires a “think differently” approach from your traditional data warehouse/star schema data models. As I stated in the blog, our data scientists prefer to “flatten” the data model. Given the superior cost and massive flat table processing effectiveness of Hadoop versus a traditional RDBMS, our data scientists are willing to repeat fields in the long flat records because the cost of storage is so much cheaper. Let me put this another way, our data scientists are willing to trade off data storage efficiencies for analytic processing efficiencies, especially the elimination of joins. For many BI and data warehouse folks looking to make the jump to data science, it is truly a “think differently” moment.

  8. Hi Bill —

    Thanks for the opportunity to engage in a good discussion!

    My concern is that in flattening the data model, you lose a lot of the benefits of 3NF or even of dimensions/facts as per traditional DW modeling. My question was “How do you prevent all of the issues of data duplication and inconsistency that the normal forms were invented to deal with? Or are you assuming that someone “upstream” has done this, and that all you have to do is consume and analyze data without having to worry about maintaining it?”

    I am not asking about data efficiency. I understand how flattening data models reduces the number of joins and therefore causes less work for the DB/query optimizer. I am asking about issues with data cleanliness, duplication/etc. Do you have specific steps and approaches that help deal with losing the built-in safeguards of the various normal forms?

  9. Two issues to clarify with a highly de-normalized data store. 1. Is the user concerned with “AS WAS” versus “AS IS” dimensional data (such as where did a person live at the time of the event versus where does the person live now)? 2. Will the user need to match this up with any other data? How is that to be done?

    Typically the first analytical step involves looking at the data in a silo, but it soon grows into making links and correlations to other data sources. When is there a need for a bus matrix?

    My 2 cents: Hadoop is not just for data scientists anymore, watch the different use cases grow.

  10. @JoshAndrews: My take is that – putting on the Data Scientist / “Explorer” hat for a moment – the issues of data duplication and inconsistency are just not germane in that context. The data ‘model’ a DS uses is typically very rough: an input vector pushed through statistical algorithm(s). There is no write-back, there is no atomic transactions, none of the stuff that 3NF/app database structures are designed to prevent/enforce against.

    Your most relevant point, I believe, is around data quality / cleanliness – I would make it even more specific myself: Master Data Management is the key. If DS disregards or throws out gold-copy records in her search for “raw” or “unprocessed” data – which is a mantra I hear a lot from stats folk – there is a lot of potential for waste.

    I envision a solution to this problem and am using the marketing meme “Data-as-a-Service” for it (I know, no points for originality). The key idea is that BOTH “raw” data – e.g. customer table from SAP, marketing contact list, leads, etc. AND processed high-quality data (customer master file from MDM, product attribution from PIM system, etc) are all in a metadata-fueled Data Catalog, with the assets themselves (or copies, or federated-style links to them) in an underlying Data Lake.

    Quality metadata from profiling, technical metadata from the source, and steward/librarian-provided metadata (name, description, categories, #hashtag keywords, etc) all sit together in this Catalog. In a mature implementation the Portal site that allows human interaction with the Catalog might provide for user ratings and reviews, and other customer-provided metadata (think: 5 star widget, Amazon customer video review). The basic workflow has a user logging in – say when they are starting a data research project – and searching / browsing for the Data Asset(s) they need, adding them to a Project Sandbox, and clicking a “Provision Now!” button. The system knows which data can be provided immediately, which can only be accessed at certain times of the day, which require some higher-level authorization in order to be accessed, etc. – that metadata bloc is called the Data Lease.

    Technically on the back-end, I (1) provision a sandbox [Pivotal HD] through a PaaS [CloudFoundry] template, (2) use federated query [Attivio, Whamtech] to translate the Data Catalog entries into HDFS, DDL, and SQL commands, and (3) employ workflow BPM engine [Activiti] to coordinate the approvals and other Data Lease actions.

    DaaS solves many pressing Big Data problems, but relative to your points: it’s ability to provision high quality, pre-vetted Data Assets alongside that rough, raw, “just happened” event / transaction data that fuels Data Science – is a very powerful combination.

    Putting all that metadata into the DaaS Catalog in the first place… well maybe that is a story for another blog comment. 🙂

  11. Scott and Josh, thanks for engaging in this discussion. I was going to say debate, but I really don’t think this is a debate because I think we agree on some very key aspects around data quality, data governance and metadata management. As Scott says, the holy grail for many organizations is this “Data-as-a-Service” (Daas) nirvana. That’s an admirable goal, but without the disciplines of data quality, data governance and metadata management, we’ll never realize the business potential of Daas.

    Regarding data modeling, as Scott says, the data scientists do “think differently” about building data models. Unlike data models that we built for Business Intelligence (retrospective reporting and dashboards) which took months to capture all the requirements and months more to build, test and refine those data warehouse/dimensional models, the data scientist is building data models on the fly to address a specific question, prediction or hypotheses. Those data models don’t need to be as comprehensive and bullet proof as the data warehouse model, until something valuable is uncovered and those models need to be moved into production (operationalized).

    It’s really all about the right tools and techniques for the right jobs. BI retrospective, 100% accurate reporting requires different tools and techniques than the forward-looking, best estimates generated by the data scientists.

  12. I think you are all correct!
    These techniques exist as constructs to help us solve a business question, either retrospective or prospective in nature.
    I remember creating a factless fact table in order to make some analysis work the way we needed it to. IIRC, it helped us determine exactly which elements of a little-known French bar soap most impacted consumer purchase. That idea earned a write-up in the back of Kimball’s book.
    Our poor data management team once went through twelve iterations of a star schema and associated aggregates within a two week period while we tried to create a database that could support an iterative regression analysis in order to isolate key cost and volume drivers. Until they did those iterations, it ‘had never been done before.’
    I don’t know squat about Hadoop…but I’d bet both analyses would be much easier in today’s environment using DS constructs instead of BI constructs.
    In my humble opinion its about speed and flexibility. And using the right tool for the job.

  13. Hello,
    It was really nice to read your comments. New to hadoop and dataware house, One thing I wanted to ask was, how does hadoop and star schema things work together.

    Initially every one was talking about denormalizing data, now I see each of the major vendor of hadoop (cloudera, hortonworks etc) showing tpc-ds utilizing star schema. Why is that

    Also, if you have multiple dimensions each one having large number of attributes, denormalizing to what extent is ok.

      • Loki, yes you can build a normalized data model (data warehouse) on Hadoop. Per the head of our data engineering consulting team (Siddharth Tiwari), all the data lands into HBase, is externalized over Hive and finally exposed via HAWQ ( as it supports true ANSI SQL standards).

        We have a couple of clients who have ported their data warehouse from a RDBMS to Hadoop. There are many pros including cost, agility and scale out processing capabilities. The negatives include the lack of maturity in the tools when compared to traditional RDBMS tools, but that is changing very quickly as new tools are coming to market and are maturing to simplify this transition (AtScale is one of my favorites).

  14. Great point about the importance of ERD (Entity Relationship Modeling) in a Big Data world. ERD remains an important process/tool, no matter where the data is ultimately physically stored (RDBMS or HDFS).

    I still think an ERD is a bit hard for the average business person to read. We find that it’s easier to engage the business people when we focus on the metrics and scores that might predict a certain business outcome

    • Totally agree with your comment that it’s easier to get the business people to focus on the metrics and scores. That comes much more natural than trying to get the business people to understand the spaghetti ERD models!!