Reasons to review data warehouse health
A few weeks ago I was chatting to someone about 'health checks' for data warehouses and the sort of reason that data warehouse owners might have for commissioning them. We came up with three fundamental reasons:
- doubts about the current system
- fitness for future enhancement
- an as check on the developers currently building a data warehouse.
In my experience the most common reason to have a health check carried out is that something is perceived to be wrong with the current system.
Sometimes the ETL processes just take too long for the batch window and availability gets squeezed but the usual reason to think something to be amiss is that the users (that is, the guys who paid for the data warehouse) are complaining the system is too slow or the results are wrong.
Systems that were once speedy can become slow with increasing data volumes and numbers of users; what was right-sized five years ago may not be so good now that more people use the system and we have 10 years of history and not 5. Likewise the things that people ask change; we have a customer that used to have only six of their users that needed to access one 'raw' data table and then only ran a few simple queries per week against it, now there are 200 users each rolling up 6 months of sales for individual customers hitting the table many times every day, and performance suffers because design compromises where used to balance storage against performance weighted by usage. Older data warehouses are often locked into older design principles. Even if the DBA team keeps on top of things with patching and database versions, the actual code behind the data warehouse may well be locked into the features that existed when the system was originally developed; old 8i data warehouses even when upgraded to 11g may not exploit features such as multi-table inserts, index organised tables and the whizzy partitioning options we have now because nobodyhas revisited the code that maintains the DW. Sometimes, of course, the data warehouse was never ever speedy, perhaps because the hardware is physically under specified for the job in hand or the design wastes the potential of the system.
Wrong results or more likely, untrusted results are in some ways more of a challenge to investigate: what basis does the doubt come from, is there scope in the process to omit, or double count data, do our data cleansing routines standup to scrutiny?
Fitness for future enhancement is self-evident. Successful data warehouses have enthused users that want to know more ; a fundamental question is will the more fit?
To be honest, I rarely get invited to review someone else's data warehouse development whilst it is being built; a sort of "police the developer role". I get asked to review plans and finished systems, but work in progress seems off limits to many organisations, perhaps it is just too political a topic