Summary table design

Most data warehouses used for reporting and built on conventional database technology such as SQL Server, Oracle and DB2 use some form of summarisation to boost query performance. In principle we doing once in the batch load and refresh process what would need to be done for each query at run time if that summary did not exist. The key benefit of the summary is the saving of IO required to read the detailed information, and possibly the IO associated with a the sorting of a large volume of data as part of the aggregation

But how do you choose the summary tables you need? Obviously, if the business only ever runs one report, then a single specific summary to drive it would be ideal. But most DW systems are used for significant ad-hoc query load and choosing the best aggregates is not quite so simple; building a summary for every possible access path is not going to be viable.

We are fortunate that modern technology can help us mitigate the need for an excessive number of aggregates. Many query tools have 'aggreagte awareness', that is, the ability to map a query to different source table if that would result in a more efficient extraction of data from the database. And Oracle has the ability to transparently rewrite queries so that they access more efficient summaries; in Oracle 8 & 9 this was limited to Materialized Views, but since Oracle 10g other rewrites could also be defined. With either approach we use the best available summary which is not the same thing as having an exact match summary.

So back to our question, how do we choose the summary tables we need? The first thing to consider is what are users likely to want to know; users often have specific business interests which will make certain combinations of dimension level unlikely, for example in retail not many people interested in product supplier also have a great interest in specific named customers but they might well have an interest in customer demographic. This type of analysis allows us to discount the more implausible combinations. But we still will arrive an over large list of summary candidates.

Here a good approach is to estimate the size reduction between the aggreagte and its source table and, perhaps, weight this by a predicted frequency of use. If the summary is not appreciably smaller than its source it is probably not worth building; that is the IO from hitting the summary is not that different from doing the calculation from source. I think that a size ratio of 10:1 is borderline and good candidates should be 50:1 or better. The only real way to calculate the size ratio is to use queries against the data set, remember that the source data is probably sparse along several dimensions and the aggregation reduces sparseness. And the third thing I consider is that choosing a aggregate that can answer three queries with simple further aggregation is probably better than building an aggregate for of those three queries.

So in summary, good aggregate tables should be considerably smaller than their source fact tables. If the size reduction is not significant then consider the use of suitable indexes instead