Monday, May 3, 2010

Slowly changing dimension

As part of the data mart design process, you should identify which of your dimensions will be changing over time, and which will be static. One you have identified the changing dimensions, the next design decision centers on the tracking of history in dimensions. You have two options; you can decide to store history about changes in a dimension, or to simply keep the most recent information, and not store the history. The exact decision you make really depends on the nature of the change, and how the data mart should reflect the change. Sometimes storing history is appropriate, in other cases it's not necessary. No matter which decision you do make, it shouldn't be influenced by your ability to create the necessary mappings.

There are five basic implementations for dimension tables in a data mart, and each has a corresponding mapping model in PowerCenter. This table contains a summary of the characteristics of these basic types.
Dimension Type
Change Processing
History Tracking
Notes

A.
Static
None
None

B.
Slowly Growing
New records added
None

C.
Slowly Changing, Update with Changes
Content updated
None
commonly known as a "Type 1" slowly changing dimension

D.
Slowly Changing, Versioning of Changes
Content updated, new records added
Full
Commonly known as a "Type 2" slowly changing dimension

E.
Slowly Changing, Keep Current and Previous values
Content updated
Partial
Commonly known as a "Type 3" slowly changing dimension


Usually, a dimension like time will be static (although it may need to be refreshed occasionally to extend it with new entries). Examples of some other common static dimensions are transaction types, shipping method, and status dimensions of various types. These dimensions fall under type A in the table above.

Examples of dimensions that change over time are the customer dimension, product dimensions, and many dimensions that include organizational hierarchy information. This last group is important, since companies regularly reorganize themselves. In the event of a reorganization, it's necessary to update the dimension to reflect the new organization, sometimes the old one as well.

Changes in dimensions fall into two subcategories. The first type of change is growth - simply adding new dimension records over time, like new products or customers. This is type B in the table. The second type of change is a change in the content of the dimension records, and this is the case where a decision needs to be made about tracking history. Types C, D, and E in the table all deal with changes in the content of the dimension, and provide different options for tracking history.

To demonstrate the mapping logic used in these situations, this article will cover examples of types B, C, D, and E. In this case, they will all be variations of a customer dimension, since it's probably the most easily understood example, allowing us to focus on the implementation. In addition, a customer dimension is often required in a data mart project, so it provides a real world reference.

Basic Building Blocks for Dimensions in PowerCenter
The next issue we have to deal with is how to implement these mappings in PowerCenter. Let's take a look at the same dimension types again, but this time in terms of the per row operations required at transformation time for each case.

Dimension Type
Row Level Operation
Change Detection
Notes

A.
Static
Insert

B.
Slowly Growing
Insert if new
None

C.
Slowly Changing, Update with Changes
Insert if new, update if changed
Optional
Can update entire record instead of detecting changes.

D.
Slowly Changing, Versioning of Changes
Insert if new, Update and Insert if changed
Required
Requires up to two operations at the target per source row

E.
Slowly Changing, Keep Current and Previous values
Insert if new, Update if changed
Required


All these dimensions have some characteristics in common, that provide clues about the general structure of the mappings. First of all, on a row-by-row basis, the mapping needs to decide the appropriate operation at the target, either insert or update. This means the mapping needs to take advantage of the per row operation control available in PowerCenter, using an update strategy transformation. In addition, when updates of an existing row are involved, the primary key for the row that is already in the target needs to be determined. Normally, you use a Lookup transformation to find out whether a record already exists in a target.

One of the trickier issues to deal with is identifying whether or not there has been a change since the last refresh. If the source actually provides a change indicator, this is simple. It's also rare - in most situations, it will be necessary to compare the new source with the existing dimension data, to determine if and where a change has occurred. We can do this with a lookup transformation that retrieves the 'old' data from the target, and an expression transformation to compare the current and old values. As a general rule, not all changes in the source are relevant, only some, so 'change' is an application specific thing. For example, a customer note field in the source might be irrelevant, but an address change is important.

Finally, for type D in the table, we may need to perform two operations on the target for a single source row. This means bringing two instances of the same target into the mapping, and splitting the data flow so it goes to both instances. Filter transformations will provide control over which paths the data takes.

This gives us our basic building blocks for dimensional mappings:

Update Strategy transformation to decide insert or update
Lookup transformation to determine whether records already exist in the target
Lookup transformation to get the previous values from the targets
Filter transformations to control the routing of the data to the target
Expression transformations to implement decision logic like comparisons

GEN AI

  Stay Tuned....