Working Through Some SCD 2 and 3 Examples Using OWB10gR2
A couple of days ago I mentioned that I wanted to take a look at SCD2 and 3 handling using OWB10gR2. This was prompted by the fact that up until now I hadn't really looked in detail at this feature, and I'd been reading a few articles and blog postings that suggested the way it works is a bit odd in the first release. In particular, this blog posting by Roelant Vos suggests that it doesn't update non-trigger columns in the way you might expect, and corresponding separately with Roelant there does appear to be an issue around how it handles changes to parent-level attributes.
In this example then, I first of all create a new user called GLOBAL_SCD and register it with OWB as a target schema. I then create a target module in OWB and point it at this new user. The data I'm going to use as my test case is based on the PRODUCT_DIM table in the Global Sample Schema, and I start off then by dropping in to SQL*Plus and creating a copy of the table, leaving a few rows out for insertion later on.
SQL> create table product_dim 2 as 3 select * 4 from global.product_dim 5 where item_id select count(*) from product_dim; COUNT(*) ---------- 30
This leaves me with 5 more rows left to insert, which I can use later on to test how the SCD mechanism works with new as well as changed data.
Next I go into Warehouse Builder 10gR2 and define two dimensions:
- PRODUCT_DIM_TYPE_2, defined as a SCD2 dimension, and
- PRODUCT_DIM_TYPE_3, defined as a SCD3 dimension.
For anyone now familiar with the PRODUCT_DIM table in the Global Sample Schema, it contains data corresponding to four levels (item, item family, item class and total product), with the bottom item level having three additional attributes (item package id, item buyer and item marketing manager). In my scenario, I want to check out how both the SCD2 and SCD3-handling works, but currently you can't mix and match type 2 and type 3 attributes in the same dimension, and so I'm going to set up PRODUCT_DIM_TYPE_2 to record all history for the product family and product class levels, together with the item marketing manager attribute; for the PRODUCT_DIM_TYPE_3 dimension, I'll be recording the previous and current values held in the item buyer field
Once I've created the basic definition for each dimension, creating the dimension attributes, levels, hierarchy and so on the same for both dimension versions, I then go into the Data Object Editor to fine-tune the SCD handling.
Starting with the Type 2 dimension, I click on the SCD tab and then click on the "Settings..." tab next to the SCD2 radio button.
Then, when I view the settings, I can designate the DSC attributes from the Family and Class levels, together with the Item Marketing Manager attribute, as ones that trigger new rows to be written. Through a process of trial and error I established that, if I designate an attribute to trigger history in a level - class or family, in my example - you need to have effective and expiration dates on every level concerned, having these dates just once on each row isn't enough. This is because of the way that OWB creates dimensions such that they can be easily snowflaked - by having creation and expiry dates for each level, you can easily separate them off in to separate tables. When you do this through you'll have to re-bind the dimension to the table to create the additional columns.
Notice also that I've left item buyer and item package ID blank - what will happen with these attributes if new values come in? - as I have also with the ID columns for the class and family levels - these cannot be designated as trigger columns as they are already business keys, the change in hierarchy can only be detected through the description value changing.
I then go to the Type 3 dimension and edit the SCD settings for this one. In this case all I do is designate the effective date column, and pick "Previous Item Buyer" as the new attribute that old copies of Item Buyer will be moved in to.
Notice how the SCD wizard has added in columns for all the other attributes that I might want to track history on? To remove these, I go to the Attributes tab for the dimension, right-click on the ones I don't want, and remove them from the attribute list.
Then, when I re-bind the dimension and view it again in the SCD settings dialog, the additional, redundant PREVIOUS_ attributes have disappeared.
Now I switch over to the mapping editor and build mappings to load my two dimensions. For the Type 2 dimension, I map the columns from the PRODUCT_DIM table to the relevant dimension attributes, and map a constant using the expression SYSDATE to the effective date column - this provides a value for 'starting off' dimension rows if a trigger attribute changes. Warehouse Builder takes care of the effective date attribute automatically, when it closes off an expired row.
Notice how there's nothing Type 2-specific about the attributes, except of course there's now these two new date attributes, effective and expired date, and we just map to the dimension as normal. If you select the dimension operator though, and click on the "Expand Child Graph" button on the toolbar, you'll find that the dimension operator is actually a pluggable mapping that implements the splitting, merging, inserting and so on that's needed to implement this feature behind the scenes.
This is effectively what you'd have had to have built by hand with earlier versions of OWB; now it's just wrapped up as a pluggable mapping that presents a dimensional interface to the rest of the mapping.
It's similar with the Type 3 mapping. In this case however, there's a new PREVIOUS_ITEM_BUYER attribute that Warehouse Builder maintains automatically (you can't map to it) and an effective date attribute, which you populate as with the Type 2 mapping.
Selecting the dimension operator and clicking on "Expand Child Graph" shows a similar mapping that implements the Type 3 load.
So now we can perform the first load. After running the Type 2 and then Type 3 mappings from the Control Center, I bring up the dimension viewer and take a look at the first run of the Type 2 dimension.
Looking good. Checking the Type 3 dimension gives me the same view. Now if I use the data viewer against the underlying table, I see that it's created a "solved" dimension, where there are additional rows in the table, with negative DIMENSION KEYs, to represent the different higher levels in each hierarchy. This allows Warehouse Builder to create fact tables against any level in the dimension hierarchy, not just the bottom level, which was a restriction in earlier versions of the product.
Notice that the effective date attributes for the higher level rows (with a negative DIMENSION_KEY) are only populated for that level and higher, so that for example the rows for the class level only have the CLASS_EFFECTIVE_DATE column filled in, with the FAMILY_EFFECTIVE date and ITEM_EFFECTIVE left blank.
Now comes the interesting bit. I'll go back to the PRODUCT_DIM table and make a few changes:
-
I'll pull across two more rows, with ITEM_IDs of 43 and 44, from the source table in the Global Sample Schema.
-
I'll amend the row with ITEM_ID = 20, to change the ITEM_MARKETING_MANAGER to "Rittman", the ITEM_PACKAGE_ID to "Home Multimedia", the FAMILY_ID to 13 and the FAMILY_DSC to "LCD Monitors", which should create a new entry in the Product hierarchy.
-
I'll amend the row with ITEM_ID = 21, to change the ITEM_BUYER to "Berbatov", the ITEM_PACKAGE_ID to "Home Multimedia", and the FAMILY_ID and FAMILY_DSC to the same new values that I've used for ITEM_ID = 20.
Now what I'd expect here is the following.
-
There should be two new entries in the dimension for the two new ITEM_IDs.
-
The row that currently houses the dimension member with ITEM_ID = 20 should be closed off, and a new row written with the new ITEM_MARKETING_MANAGER and the new values for FAMILY_ID and FAMILY_DSC. I'd also expect the FAMILY_ID and FAMILY_DSC changes to propagate through the "solved" rows in the underlying table for the Family, Class and Total Product levels.
-
For the changes to the row with ITEM_ID = 21, I'd expect one row still to remain with the existing value for ITEM_BUYER moved to the PREVIOUS_ITEM_BUYER column, and the new value in it's place. I don't know what will happen with the other columns, we'll have to wait and see.
I drop into SQL*Plus and make the changes to the PRODUCT_DIM table.
SQL> update product_dim 2 set item_marketing_manager='Rittman' 3 , item_package_id = 'Home Multimedia' 4 , family_id = 13 5 , family_dsc = 'LCD Monitors' 6 where item_id = 20 7 / 1 row updated. SQL> update product_dim 2 set item_buyer = 'Berbatov' 3 , item_package_id = 'Home Multimedia' 4 , family_id = 13 5 , family_dsc = 'LCD Monitors' 6 where item_id = 21 7 / 1 row updated. SQL> insert into product_dim 2 select * 3 from global.product_dim 4 where item_id in (43,44) 5 / 2 rows created. SQL> commit; Commit complete.
Now I run the two mappings again to load in this new and updated data, and then check out the results.
Starting off with the Type 2 dimension, I first check out the dimension object viewer and notice that the entry with Monitor - 17" Super VGA appears in both the new LCD Monitors family and also the old, now redundant, Monitors family. The 19" Monitors one is OK though. What's happened?
What's different about the two rows? The row with a 17" Super VGA monitor has an ITEM_ID of 20, the 19" Super VGA monitor is ITEM_ID = 21. Looking back at the update statement, we changed the Marketing Manager (the trigger column) and Package ID, as well as the Family ID and Description, whilst for the row with an Item ID of 21 we changed just the Item Buyer and Package ID, neither of which were trigger columns for the Item level, plus the Family ID and Description. Presumably a new row has been written for the first item as the trigger has 'fired', but it didn't for the second, even though the family ID and description changed. Let's look at the data and find out.
There you go. None of the changes propagated through for ITEM_ID = 21, as none of the ITEM-level attributes that were triggers were changed. With the row with ITEM_ID = 20, a new row has been written with the old one expired off with an ITEM_EXPIRATION_DATE set. The dimension viewer still displays the expired row though, it seems to ignore the expiry date for the level. But what about the rows also in the dimension that are specifically for the FAMILY level?
Well the new member of the Family level has come through ok. So, we've got two issues so far - as Roelant found out, attributes that aren't marked as trigger columns don't have their values updated unless a trigger attribute also in that level changes. Secondly, the dimension viewer in OWB still seems to show members that have expired off, leading to duplicates when working with SCD 2 handing.
Looking at the Type 3 dimension though, the two items we just amended are now within the LCD Monitor family, and the Monitor family has now disappeared.
That's strange though as only Item Buyer is a trigger column, and only the row with ITEM_ID = 21 had a change of Item Buyer. I wonder what happened? Let's take a look at the data again.
So with the row with the Item ID of 20, it's still picked up on the change of Family Description and Item Package ID, even though the trigger column didn't change. That's what I'd want though so no issue there.
Now this first test is complete, I want to carry out some other tests on issues that I'd heard mentioned:
-
If you make a change to a higher level description, say FAMILY_DSC in my case, OWB doesn't close off the previous versions of the levels below this to reflect that they once were once children of a differently named dimension member, and
-
If you don't mark an attribute as triggering history, changes to it won't get reflected in the dimension entry.
Starting with the first observation, I go back to the PRODUCT_DIM table and amend all the PRODUCT_DSC values that are currently "Modems/Fax" to just "Modems". This should affect two rows, with an ITEM_IDs of 24 and 33.
SQL> update product_dim 2 set family_dsc = 'Modems' 3 where family_dsc = 'Modems/Fax' 4 / 2 rows updated. SQL> commit; Commit complete. SQL> select distinct(family_dsc) 2 from product_dim 3 / FAMILY_DSC -------------------- CD/DVD Documentation Accessories Desktop PCs Memory Modems Operating Systems Portable PCs LCD Monitors 9 rows selected.
Now when I load the Type 2 dimension, I would expect the the ITEM level rows to be closed off and new ones written with the new FAMILY_DSC, and the "solved" FAMILY, CLASS and TOTAL_PRODUCT levels to be similarly affected. I run the load and see what's happened.
First of all I check the dimension viewer, and this has worked fine - the old family description of "Modems/Fax" has disappeared to be replaced with just "Modems". So far so good.
Checking out the table underneath the dimension and firstly just checking the rows with ITEM_ID in (24,33), I see that Roelant is correct - the record for this item just shows the new family description, losing it's earlier historic value.
However, if I expand the view to look at all rows with FAMILY_ID = 9, i.e. bring in the solved rows for the higher level, you can see the row with "Modems/Fax" closed off and a new row written.
So functionally, whilst the user will see the correct new value in the dimension viewer, the historical old value for the family description is lost when only referring to the bottom level row, which is what most query tools would only refer to.
On to the second issue, where attributes that aren't marked as trigger columns aren't updated in the dimension - this is potentially quite a serious issue, as you'd need to split these cases off from the SCD 2 and 3 load and process them separately, or lose the changes. In my earlier example, this indeed did seem to be the case with a Type 2 load, but the values did come through with the Type 3 load.
To create a new test case, I go back to the PRODUCT_DIM table and set the ITEM_MARKETING_MANAGER to "Robinson" (which should get picked up by the type 2 load, as marketing manager is a trigger column) and the ITEM_BUYER to "King" for the row where ITEM_ID = 31.
SQL> update product_dim 2 set item_marketing_manager = 'Robinson' 3 , item_buyer = 'King' 4 where item_id = 31 5 / 1 row updated. SQL> commit; Commit complete.
Let's see what happens. I run the Type 2 load and then check out the relevant rows using the data viewer.
That's good - the new row, triggered by the change in item marketing manager, also contains the new item buyer. But what if we just change the item buyer for another row, and not a value that's in one of the trigger columns?
SQL> update product_dim 2 set item_buyer = 'Chimbonda' 3 where item_id = 33 4 / 1 row updated. SQL> commit; Commit complete.
What happens now when I run the mapping?
Roelent's right - the change in Item Buyer doesn't get picked up on, as it's not a trigger column. What this means is that if you want the new values for an attribute to be recorded at all, they've got to be a trigger column - if they're not they will get forgotten. Roelent suggests a method of splitting off rows with changed non-trigger columns and updating them separately, this would make sense but it's a bit of an annoying workaround.
Let's do a similar test with the Type 3 dimension. Item Buyer is the trigger column here and so should trigger a change for ITEM_ID = 31, but let's create a new example with ITEM_ID = 34 where we just change the marketing manager on it's own.
SQL> update product_dim 2 set item_marketing_manager = 'Mido' 3 where item_id = 34 4 / 1 row updated. SQL> commit; Commit complete.
Running the Type 3 load gives me the following rows in the data viewer.
In this case, the Item Marketing Manager change got pulled through OK, even though it's not a trigger column and even when, in the case of ITEM_ID = 34, the trigger column didn't even change.
To sum up then, what I think I've found as follows:
-
For Type 2 loads, changes to non-trigger columns don't get propagated through to the dimension rows, unless the trigger column for that level changed.
-
However for Type 3 loads, changes to non-trigger columns always get propagated, even if the trigger column doesn't change.
-
For Type 2 loads, if attributes of a higher level change in a row, this doesn't trigger a new entry for that particular row, instead the new higher-level attribute value just over-writes the old one. When you find the row elsewhere in the solved dimension specifically for that higher level, that row gets expired and a new one written, but this history is lost to the child row below it.
-
When a detail-level, child row gets expired off by the Type 2 load process, the dimension viewer still displays it, creating duplicates in the dimension hierarchy when viewed through the dimension viewer. The data in the underlying table is still OK, but the viewer displays the hierarchy wrong.
So how do you work around this? Well as Roelant suggests here, if you're doing a type 2 load and the possibility exists that a non-trigger column can change without a trigger column changing in the same level, you'll have to update the dimension separately to handle these cases; alternatively, make all the attributes in the level trigger attributes, which is probably what Oracle would suggest you do.
With the issues around new rows not being written for child dimension members when the parent level attribute changes - well that's just a shortcoming of the first implementation, as is the quirk with the dimension viewer. I expect Oracle development are aware of this and it'll be addressed in a forthcoming patch release of OWB 10gR2. I've also uploaded the MDL file here if you're interested in working through the examples.