DW Wisdom

Over the years there has been a collective set of received wisdom about Business Intelligence in general and about data warehouses in particular. Some of these ideas are lead by the technology involved:

Use as many small disks as possible – a 1TB disk would be a bad idea for a system that inherently reads large volumes of data, everything would go through a single IO point.
Keep all the OLTP tables separate from the DW systems; OLTP has lots of small, fast transactions, DW has slower, big reads. DW loves bitmap indexes, OLTP hates them.
Use high degrees of parallel processing

Others stem from business needs of the project sponsors:

Consolidate all of your data in one place
Cleanse your data

Only recently has a holistic view a of corporate data warehouse taken hold:

Consistent data definitions across the whole business
Consistent data across the whole business, that is a single view of the truth

But are these truths still valid?

Technology has certainly moved on since I first became involved in data warehouses (and that was sometime in the last century!) With SAN and NAS technology is possible to manage all of the data storage of a company as single unit and technologies such as Oracle RAC means that processing power can be added in by simply bolting another commodity processor to the storage network. Monolithic, multiprocessor, dedicated disk DW servers are now less likely to be specified for new DW developments than the inherently more scaleable RAC systems. A possible exception to this is the Data Warehouse appliance approach where a lot of the predicate processing of the query engine is pushed out to processors attached directly to the individual disk drives; an interesting trend in this type of machine is to use low cost, low power consumption devices (perhaps those developed for the game console industry).

Use small disks: DW systems are IO intensive. Very rarely do we target a single row from a fact table, we tend to read large numbers of rows and aggregate the results, even when we used indexed reads we tend to retrieve many rows, and often (because of the time based batch nature of our dataload process) they are clustered together in adjacent blocks. This means that our IO throughput depends on speed of the disk and the number of IO channels available to us. This is not necessarily an indicator that small is good but, rather, effective data striping across multiple drives (either on the storage array or by using Oracle ASM) is good. Of course, a single disk can only physically read from one location at a time and if too many reads need to access the same disk at the same time there is potential for an IO bottleneck. So, there is some truth in “small is good” but perhaps more truth in enough disks are good.

Next time I’ll write on OLTP versus DW and parallel processing. I may even stray into RAID5 as there is a popular misconception about that too!