DW Design (2) – staging data

As mentioned yesterday, the staging area of the data warehouse has three functional uses:

  • It is the initial target for data loads from source systems
  • It validates the incoming data for integrity
  • It is the data source for information to be published the ‘user visible” layers of the data warehouse
Optionally, it may also be where the logic to transform incoming data is applied.

Data loading: Early Oracle data warehouses used one of two techniques to populate the staging layer, reading in flat files and loading them with, say SQL Loader or directly querying source systems through database links. Although these methods are still widely used, subsequent versions of Oracle have offered more sophisticated approaches. SQL Loader advanced to have direct path load options and other techniques became available. From Oracle 9 onwards it became possible to query flat files directly from the database and technologies developed (or enhanced) for database replication and recovery added some very useful techniques for capturing incoming data; in particular log mining and transportable tablespaces offer some powerful features within data warehouse data load.

Once loaded into staging tables it is a good idea to check the validity of the data. Minimally, we check for duplicated data and that all of the foreign keys have parents, this can be further extended range check attribute values. For reference data I favour checking that the whole of loaded reference set is self-consistent; that is all hierarchical relationships (such a state has a country as a parent) are present. For fact data I favour validating against the already published data in the raw data layer. Using enabled foreign keys for referential integrity checking may not be the best approach for performance. One technique I have used in the past is to load the stage table with all of its constraints disabled and then to enable the constraints with and exceptions into clause, this captures the rowids of the rows that fail validation. Simplistically we then copy the rows that fail validation into a ‘rejects’ table and delete them from stage table, remember though it quite possible for a row to reject for more than one reason, it is also good practice to datestamp the rejected rows and give a reason for rejection, if only to help support staff find our what is wrong with the data!. Another technique to validate data is to outer join the stage table with its foreign keys and then use a multi-table insert to separate rejected data (which has null values in the joined key columns) from the valid.

Normally we only publish data that does not already exist in the data warehouse. For reference data this would imply we look for changes between the already published and the newly staged. This can be achieved using the SQL MINUS operator followed by an upsert merge of the resulting set of changed and new records. If we are tracking slowly changing dimensions we would insert both the new and changed records with some form of date stamp to mark the effective date. Most DW systems would not delete reference items where they no longer exist in the incoming feed, simply because there may be historical fact that refers to it.

In the case of fact data we would generally insert it into the historic fact tables. But there are a wealth of techniques we can us to this and often they depend on the design we have adopted for the raw fact layer. I’ll look at this next time.