Time series problems
A while back David Aldridge and I both blogged about the joys of time series analysis on sparse data; that is, the sort of data where maintaining and storing moving totals within the data does not make sense in terms of space usage. One common technique to resolve this type of problem is to have some form of helper table to generate a list a dates - effectively changing a single date to a range of dates.
Recently, I have been looking at performance problems in a customer's data warehouse where single day selections are performing particularly poorly. At the time of implementation they were having problems with year-to-date (YTD) queries and their BI consultant suggested building a table to contain lists of dates - so if the first day of the year was 1 January then there would be one row for YTD the next day would have two rows, the 1st and the second of January and the 31st of December would have one row for each day of the year. But the customer decided to extend this idea to include YTD (last year) and at other levels of date aggregation such as week and period. So they added in a column to indicate the level of aggregation (day week etc) and another to to indicate the type of time series (YTD etc) being used. So what was basically a join on a date to bring back one or more days became a lot more complex. And then the master stroke, why not also include the current date and same date last year and then we could answer any query by joining to our time series table and supplying the series type (perhaps from a drop down in the query tool). So now we supply three predicates: the date, the time series and the date level. And we now expect the query to return either exactly one row, or a number of rows between 1 and 365 or a number of rows between 1 and 52 or 1 and 13 or 1 and 12 or 1 and 4 rows - but how can we tell the optimiser that.
My problem is the case where we return exactly one row from the time series table. I could look at some fancy indexing and associated optimiser stats or I could go with simple way out... just don't use the time series table for current date or same date last year queries