Query tuning

Yesterday I mentioned a query that would not eliminate unused partitions and resulted in a somewhat large nested loop join plan that would take hours to return. A kind of "find all the rows that match predicate B and then filter the result set on a range of 7 days (out of 800)" type of query.

I have a naive approach to query tuning - I assume that hints do not exist and the performance problem is solely caused by the inadequacy of my optimiser statistics. That is not to say that I never hint queries - I do as an aid to proving that things can be improved - but the endgame is to present a plain, unhinted query that can be generated by a query tool on someone's web browser. True, plan stability is an option, but in a BI context the number of unique queries being asked dilutes the value of this method.

The problem I was dealing with was basically simple. The query optimiser had made an outrageous (for the data) assumption on cardinality. In this case we looking for an attribute on customer that can have one of 40 values, including "Not Applicable" for the majority. If the attribute values was evenly distributed we would expect each attribute to appear 20,000 times in our 800,000 row reference data table; but in reality most values only occur a few thousand times at most, and some attributes appear less than ten times. The statistics collected at the column level gave the number of distinct attributes and the range of values stored. By changing the way statistics are collected we can model the data distribution more exactly through use of a series of histogram buckets. In this case the simple change of one argument in a call to DBMS_STATS.GATHER_TABLE_STATS resulted in the expected partition elimination to occur with the query dropping to sub-minute times. The other change I introduced for this customer was to use the feature in DBMS_STATS where the existing stats can be written to a stats table, invaluable if you need a quick way to get out of trouble if the newly gatherer stats makes things worse - why more people don't use this feature I don't know.

Straw poll question - is it worth collecting histograms on partitions as they only would be used by queries that access a single partition by partition key - the global histograms would be more useful as they will be used by queries that access multiple partitions or single partitions where the key can not be resolved at parse time. I'll comment on that another day.