Partitioning Fact Tables, Part 1
I'm dogmatic about certain aspects of data warehousing. For instance, fact tables should be range partitioned by DATE. I tell my clients all the time: you will have a very difficult time persuading me otherwise. But they always try: they argue about all the attributes that are more pervasive than DATE: customer classes, transaction types, etc., etc. But I'm just not buying it. We are building data warehouses, and the third rail of the Soul of the Data Warehouse is how it handles time.
If you agree with me about this precept (and I really think you should), this is still not the end of the story. We must charge ahead into the lion's den of a debate that has been raging in the Oracle data warehousing world for years: do we make the surrogate key of our date dimension a NUMBER, or do we make it a DATE? It's funny... I remember this being the first question I ever posed to Mark years and years ago, and he did a blog entry that evolved out of our email communication. I don't see the entry on the blog any more... it must have been lost in The Great Blog Disaster. Pity.
The choice between NUMBER and DATE bubbles up from the two streams at work in the Oracle Data Warehousing community: the data warehousing folks, and the Oracle folks. Ralph Kimball argues that the surrogate key of the date dimension should be numeric. In the Data Warehouse Lifecycle Toolkit book (or at least, in my edition of it), Kimball basically makes the argument that numbers require less space than dates. That one never did too much for me. However, in his Latest Thinking on Time Dimension Tables design tip, he makes a better argument: if our surrogate key is a DATE, then how do we handle "Not Applicable" type rows? This one has teeth, and I think that most designers who struggle with this decision point to this issue. If we use an actual DATE as our surrogate key, then what value can we use that actually means "no date at all"?
Oracle experts like Tom Kyte argue that "dates belong in DATES". (If you look really hard at this post, you can see a younger and more naive version of myself weighing in on the debate... and also, apparently, not knowing how to gather histograms with DBMS_STATS. Oh well.) As Tom demonstrates on that post, the optimizer just plain works better when dates are stored in DATE datatypes.
I've typically been on Kyte's side in this debate, both from a performance and a maintenance perspective. I've parted ways with Kimball on this point and urged my clients to build date dimensions with DATE surrogate keys, calling the column something like DATE_KEY. For the 'NA' types of dimension records, I use a wacky DATE value for DATE_KEY, such as '12/31/9999' or '01/01/0001'. Think of this as the equivalent of -1 if the surrogate key were actually numeric. Being a surrogate key... it really doesn't matter what value it contains: we just need to know the column name so we can construct the correct JOIN syntax. Then, I'll build another DATE column in the table called SQL DATE, and this is the one that I expose to the reporting layer. Since SQL DATE does not have to serve as the primary key, it's fine for it to be a NULL if desired.
In subsequents posts, I'll examine new partitioning features in 11g, including interval partitioning (which Pete Scott recently blogged about), and also reference partitioning, and whether these enhancements provide more options to this historically binary choice.