Time series revisted (2)
Yesterday I wrote about the challenge of calculating YTD on a sparse dataset. Even in these days of low cost storage I can not justify the vast amount of additional storage required to persist a YDT figure for each sale of a product + customer pair for all available days; true we would only need YTD figures from the first sale of a pair, missing records implying a zero YTD value. Simplistically, we need a query of the form:
SELECT D.YTD_DATE, SUM(S.SALES_VALUE) YTD_SALES_VALUE FROM SALES_SUMMARY S, DATE_DIMENSION D WHERE S.SALES_DATE between D.FIRST_DATE_OF_YEAR and D.YTD_DATE GROUP BY D.YTD_DATEThe BETWEEN predicate could be replaced by an in list or we could replace the dimension table altogether with a pipelined function to generate the dates to be aggregated. But between is simpler to code and performs well. However this query is only applicable for total sales of "all products" to "all customers", adding extra dimensions increases the number of columns in the select and the group by. Adding the extra dimensions has two significant effects: it increases the amount of data that Oracle needs to manipulate in the query, and, because we are now referencing all of the dimensions at the lowest child levels, it inhibits query rewrite.
Another approach for YTD calculation is to calculate the sum of the individual dates' sales. This approach is especially viable in OLAP environments that have access to 'moving total' functions (I have used this approach with Oracle Sales Analyzer and that functionality is still available in Oracle OLAP). The downside of the 'sum of sums' method is that it can be hard to set-up in user query tools such as Business Objects, especially if you are using Oracle query rewrite to control summary usage.
For a recent customer demonstration I had to set up YTD functionality in Oracle BI suite EE. The underlying database exposed a single low level table for the sales summary and used query rewrite to select the best match summary for the query. At first, setting up YTD measures that still allowed query rewrite seemed too much of challenge. But in the end I implemented a very simple solution. I created a new physical table in the Siebel Administration tool as a select statement; this is functionally similar to a database view except it does not exist in the database! The definition joined my SALES data source and my date source and included all of the dimension keys in the select but only the time series key in the group by. This is then mapped into the logical layer as a further source to the sales fact table and new measure columns created for the YTD totals. These YTD measures are in turn is published to the presentation layer. This is surprising performant in the Answers tool, high level queries such as "what is my total sales value this year hit" suitable aggregates and when people drill down to how many video cameras were sold in Slough in the first 11 weeks of the year then the bitmap indexing reduces the result set size to manageable proportions.