Fixing up performance
Last week I mentioned one of the reasons why DW performance drops, users start to use different queries.
One of my customers had a design principle that 'raw data' should look like it's source. So where we had a feed of invoice headers and invoice lines (the classic master / detail, two table approach: the who and where in the header table and the what in the line table, of course there was a bit more to it that that). The other requirement was that there were no indexes on raw data as people don't need to see (most of the time) Well that was a couple of years back and since then the individual outlets started looking at the most profitable customers on a rolling quarter basis and we end up with hundreds of queries per day hitting two massive tables (the line table has upward of 500 million rows) with no keys to link master to detail or even search on customer; at least both tables are partitioned on day so as long as the query supplies a date to both invoice tables we get some partition pruning.
I have a few days spare and a 1.5 TB test system so I am taking some time to play with options to improve things. The current problem query takes a single customer and a range of 12 weeks to identify the invoice numbers and then find the items sold and their value. I am not going to rewrite the user's query - it came from a reporting tool and beside fixing individual queries (unless they are obviously broken) is not always a sound strategy - other users will come up with their own queries and helping just one user may not be the best way to spend time.
The two big tables are partitioned, and are part of the batch data load with partition exchange loading and partition rolling going on in the ETL process. They are also source tables for the summarised sales. So what options will I be trying:
Create a single 'normalized version the two tables | Slightly more storage required as we need to duplicate header data for each line. Massive changes required to ETL process, base summary build code and user queries, but it does remove the killer join on invoice number |
Leave the existing table structures but add single column indexes to aid common queries | I hate global indexes on partitioned tables, so I will do local partitioning. I may need to add some more index maintenance into the batch to avoid any ETL induced problems |
Convert both raw tables to Partition Index Organized Tables. | Although the natural keys are going to be document number and document number with line number, I'm going to build a composite key that (say for the header table) has customer and store a leading edges with document number last, I will also compress away the leading edges to minimise space. This approach will leave the same tables structures in place for queries and summary build but does need a bit of work around partition exchange as we are now dealing with an 'index' and not a table |
Leave existing table structures but add multiple column indexes to aid common queries | I think I will be trying some alternative partitioning strategies, document number by hash of customer might be a good choice |
I'll write again on this when I have some results.