Historical Master Data Attributes

Historical master data attributes are often specially handled when designing an Enterprise Data Warehouse (EDW).   On most BI projects I’ve been on, there is an exercise of sitting with the business in order to determine which attributes are required to be stored historically.  Once this is complete, the requirements are then sent off to the Data Warehouse team for design and development.  How this solution gets architected may be one of the determining factors on whether your project is successful or not, and as a BI team member, sometimes getting your two cents can make your life that much easier.

In the past, I’ve come across three different designs for handling historical master data attributes, with one of them standing out above the rest.    The first design involves placing historical attribute fields directly on to fact tables.  Please remember we are talking about historical attributes that are master data related, and not historical attributes that are related directly with transactions.  The major issue with this comes about during validation and warehouse maintenance if data problems exist.   In order to validate that ETL processes are working properly, one now has to validate every row in fact table to make sure that the  historical attribute is correctly derived.   Also, because this attibute is stored at the fact level, we now have to add this attribute to any new fact tables when needed.  If, for any reason, problems are found in the historical attribute derivation, then the fact tables have to be dropped and reloaded, then re-validated.  Simply put, this solution should not be used if it can be avoided.

Another design which is quite popular is the use of time based, type 2 dimensions.  This solution involves storing attributes, along with the date range for which those values are valid.   So for example:

fig21a

This solution works just fine in conjunction with a fact table.  To pull the current value of an attributes, we only need to pull the latest record, usually marked with an end date of ‘12/31/9999′.   The only issues that come about with this approach is one, if you store more than one historic attribute on a dimension, then reporting history of those attributes without joining to a fact table cannot be accomplished.    From our example diagram above, notice that the Price Type field is accurately showing history, but Placement Type contains an extra record showing Aisle 5 starting on both 6/1/2001 and 7/1/2001.   One solution for avoiding this, is to store the Previous values for every historic attribute, and then only pull historic data for an attribute if the Prev value <> Current value.  The second issue often comes about when ETL is used to create history in the warehouse.   This may be the only possible solution, but allowing this can ’open up a can of worms’, and now makes validation a much more difficult process.

The recommended solution here actually involves setting rules to the type 2 dimension just discussed.   First off, only track the history of one historic attribute per dimension table.

fig21d

This simplifies the ETL and data storage designs, and also eliminates the confusion when you have more than one historical attribute on a table.    Storing the data in this way not only seems more natural, but also simplifies the validation and maintenance of the historical data.   If a problem is found, then quite simply, fix that problem in one place rather than dealing with more complex ETL procedure used to track multiple historic fields on one table.  Second, don’t use ETL to create history if it can be helped.   ETL created history only starts at the time that the ETL script is implemented, so you never get a full view of history unless it was implemented from the beginning.   Also, ETL created history is difficult to validate, and in the case that problems arise, fixing that history can also be a problem.   Of course, every implementation is different, and you may not have a choice.  All of the information in this article has been tested thoroughly through implementation via MicroStrategy 8.1.

About Ronald Park

Ronald has been working on Information Systems projects for just over 15 years now. He started out building custom Client-Server solutions at Andersen Consulting and soon moved into the world of packaged ERP systems specializing in SAP R/3. For the last 8 years, Ronald has been focused on delivering enterprise business intelligence solutions via packages like MicroStrategy, Cognos, and SAP BI.


Notice: compact(): Undefined variable: limits in /home1/microoat/public_html/blog/wp-includes/class-wp-comment-query.php on line 821

Notice: compact(): Undefined variable: groupby in /home1/microoat/public_html/blog/wp-includes/class-wp-comment-query.php on line 821

Notice: compact(): Undefined variable: groupby in /home1/microoat/public_html/blog/wp-includes/class-wp-comment-query.php on line 821

Warning: count(): Parameter must be an array or an object that implements Countable in /home1/microoat/public_html/blog/wp-includes/class-wp-comment-query.php on line 399

4 thoughts on “Historical Master Data Attributes

  1. This is nice…ETL work seems to be neglected, or adjusted, too many times after-the-fact.  I’m a big fan of “best practices” and learning from other people’s mistakes.

  2. Greetings,

    I certainly agree that recording history for dimensions is often overused really isn’t required based on real requirements (it however, often gets requested by people that think it’s better to ask for everything up front).  Having said that, if a customer requirement truly dictates type 2 SCD, then enforcing/suggesting that only a single attribute should be historically captured seems like an over simplification of real life.  I can’t think of many cases where if we had to capture type 2 SCD, that we only needed to do it for 1 attribute.  Once you have to do it, the work to capture others (depending on how you approach it) is not much extra work processing wise.  Not that I’m saying you need to capture every attribute, only one doesn’t see relaistic to me.

    Like I said, unless the user requires call for it, it’s best to avoid since in the end I have found it just makes reporting more difficult for  end users (who now have to then qualify what version to associate with related facts….this is of course if you allow them to link any version of a dimension to facts.  If you embed the surrogate key of the version into the fact, then there is no issue there).

    My two cents

    Cheers

    1. Tim, I agree it might be too much to ask for to have only one type 2 SCD per table. Good point. Just to be a little more specific, the situation that I was writing about was trying to report master data historically without any reference to a fact table. Probably not the most common situation, but at my current company, we wanted a data model that could handle this situation as well.

      One other idea is to store as many type 2 SCD’s as you want on the table, but also store for each field, the date that the type 2 SCD changed. If the type 2 SCD field does not change, then leave the date field blank. You can then create a view for each type 2 SCD field which will allow you to filter out only changes for that specific field.

      However, with all that said, I completely agree with you. If you only need to report the type 2 SCD’s in conjunction with a fact table, then store as many type 2 SCD’s on a table as you want, and don’t worry about what is written in this article.

  3. Hi Ronald,

    It’s a good suggestion and the scenerios vary by company to company.  I agree what ever you say.
    Do you mind to provide the solution basing on the above example please?

Comments are closed.