Time series

Of all dimensions, time is perhaps the most common. In fact I would be hard pressed to think of a data warehouse that did not encompass time. Inventory, sales, phone calls, web click-throughs, energy usage or whatever, they all hold measures (quantities) along a time line. So you’d think that time series aggregates (such as year to date or moving quarterly total) would come naturally to a data warehouse, sadly this is not always the case.

Calculating total sales for the current year (year-to-date, YTD) is (conceptually) not particularly hard – just add the total sales for each day.

There are two fundamental approaches to providing such aggregates over time: pre-calculate the aggregate or do the sum on-the-fly. On the face of it pre-calculation looks very attractive; expensive summations are done just once, in the batch window. For high levels summaries (all products, all customers) this approach works well, but there is a problem when you try to calculate year-to-date by customer, by product. At low levels of aggregation, data warehouse data is sparse: if you have 200 stores, 60,000 products and a million customers you will find a) most customers only visit one or two stores, b) they don’t shop every day and c) they only buy a few of the products. Clearly, it would be nonsensical to store an YTD figure for each available combination (that is 200 X 60,000 X 1 million values) as most values would be zero. Calculating the values for the combinations that occur looks a better bet but sparsity again causes us a problem. Suppose that on January 10 I sold item A to customer X for $10 (my first sale of the year!) – so, my YTD for customer X and product A is $10. The next day customer X does not shop but I still need to include that sale in my YTD calculation. That is once a sale is made I need to maintain an YTD value for the reminder of the year. To make maters worse we don’t need just a current YTD figure but one for each time slice in the data warehouse; without this it would be impossible to provide year on year comparison reports. And even worse, what happens when a store has a communications failure and sends it data to the data warehouse 10 days late and we need to revisit previous YTD values?

Calculating YTD on-the-fly has the problem that (potentially) a large amount of data needs aggregating – worst case you need to total a whole year. But you do not have problems with sparsity loss and storing masses of seldom used data. The other downside here is that many of the common reporting tools expect pre-calculated values since time series aggregate functionality is lacking in the products. This lack can be circumvented by joining queries to ‘list of dates’ style tables or using functions to return date lists. Oracle OLAP and Oracle Express both have the functionality to calculate time series aggregates, this is an option worth considering if you need time such aggregates and find the brute-force method of reading most of the database unappealing