Partitions
Yesterday I rambled on about tables, indexes and tablespaces. In passing I mentioned changes to partitioning schemes would be another posting.... so here it is.
One of my customers operates a chain of outlets across the UK. Most outlets trade 7 days a week. In the original data warehouse sales records were partitioned by week (7 days) and 4-way hash sub-partitioned on item sold; remember that in 8.1.4 range/hash was one of the few sub-partitioning options available. The original developers chose sub-partitions to improve the manageability of the DWH - smaller objects, easier recovery etc. I think they also had some expectation of improved performance through partition elimination.
As part of the project to migrate and enhance the DWH we needed to add newly available fact data to the sales feed and hence redefine the table structure. This, to me, was the ideal time to revisit the partitioning scheme. Over the four years the old DWH had been running it struck me that the sub-partitioning scheme only addressed the issue of keeping objects 'small' and hence manageable. It did nothing for query performance and for the lowest level aggregate table, the batch build time.
Batch build time
Every day we receive a feed of sales by customer, location, product. etc - most records are for the day just ended but on occasion we get data for earlier days (problems polling the stores). After cleansing this data is appended into the base fact table and then the summary tables updated by truncating the whole week's partition (and sub-partitions) and appending the new aggregations. If data was received for a day in the previous week we had to truncate two weeks. The big performance hit here was that at the end of the week we needed build 7 days of summary data even if only one day had changed.
Partition elimination
The original customer requirement was that date structure was day/week/4 weeks/year. This fits well with the weekly partitions and can lead to partition elimination. But subsequently part of the business needed to report by calendar month; as calendar months do not align to weeks (ok, some Februarys do) We often found that we needed to read at least one extra week of data to just get the two or three missing days. The second 'problem' was the use of product as the sub-partition key. Most users were interested in multiple products, and the chances of all of the products of interest being in a single sub-partition was remote.
Mindful of the above I proposed that we moved to a single partition per day with no sub-partitions. True we gain three partitions extra per week, but we also get the following improvements:
Simplified batch summarisation - - For each day of data received in the batch: truncate the relevant partition and rebuild as a append parallel. No more building 7 days when we only need to do one.
Better query performance on calendar month queries
Better partition elimination on multiple product based queries