Tales of woe: Query performance

Peter Scott

Feb 14, 2007 • 2 min read

The nuts and bolts of putting together a decent data warehouse are not that hard. You just need to get the I/O right ;-)

OK, in reality there is a little more to it: you need to make sure you store the correct granularity of data, you have all of the right constraints in place (plus all those other helpful bits of metadata that the query optimizer uses), the right backup and recovery regime, some useful pre-built summaries, a blisteringly fast ETL (preferably set-based) load process and enough grunt to delivery the data to the users. Not rocket science (but perhaps hard enough for people to want to employ me)

But the hard thing is to plan for those ad-hoc user queries created in a drag and drop query tool. Typically, the user picks columns from a palette of available data objects and drops them on to a canvas representing a report, graph, cross-tab or whatever. But sometimes I have the sneaking impression that the intellect of the mouse they are using to create these reports somehow transfers to the user's brain. Just because it is easy to create a report on the screen does not mean it is simple to get the information back from the data warehouse.

Recently I have seen users looking for all of the customers that spent more than £1000 on a single item by looking at every item sold across 200 stores for a whole year and wondering why the query takes more than a few hours to run - looking at the SQL running against the database we are hitting the base fact table, finding about quarter a billion sales lines then using a having clause to isolate the results we need. Given a little thought about what the data means it is quite possible to do a few smart things - there is only one row per item per customer per store per day so we don't really need to aggregate and use a having clause, and it would be much better to filter the items first to only include those that cost more than the threshold value - that way we just do a bitmap index join on a big data set, a lot of work, still not quick, but achievable.

But it is not always the user's fault. I saw one slow query today where the user concatenated a store code and its name to create a meaningful long description and then used that in a query predicate. The query was of course slow running with a full-tablescan of the dimension table, change the predicate to a single column match and it runs in less than five seconds using a bitmap index. Our fault was not predicting that a user wants to use long descriptions and to create the required column and a suitable index in advance.

Sign up for more like this.