My data warehouse is bigger than yours!

Sadly, it seems ingrained into Western culture that having more of something than someone else is "good". This idea that "my dad's tougher / bigger / richer / cleverer (insert your choice here) than yours" has been the stuff of school yard taunts down the ages. But for a data warehouse is bigness a desirable attribute? Vendors seem keen to quote the size of the biggest systems running on their hardware, operating systems or databases (perhaps it is the kid in them), but for normal folk like us, does size matter that much?

Size brings problems

Not problems that can't be solved, but the sort that needs thought to resolve before making design commitments. Sadly, these problems are not always helped by the marketing hype of vendors such as :
  • "You can buy big disks now, one disk per terabyte is a good idea!"
  • "Modern RAID5 SAN hardware is good for 'read only' databases such as data warehouses; the caching can keep up with the occasional disk writes"
Think about what a data warehouse is designed to do; most queries run against a DW return multiple rows as part of a precursor to some form of aggregation: "tell me all the customers that bought this widget on the 3 September", "how much ice cream did I sell in December in Alaska?"; these queries require us to read lots of data and then do something with the result set. For a database such as Oracle (and not all vendors are the same here) we need to identify all of the the rows that comply with the query predicate and then fetch them from the disk and process them, and sometimes the efficient route to do this is to bring back all of the data first and then check it for compliance. We are reading a large amount of data and if this is all on a single large disk the data is traveling along a single IO path, and there lies the problem. CPU speeds and processing power have increased greatly (I still remember the 4.7MHz IBM PC), but the speed of disk IO has not shown the same improvement trend. Too much data going down a single path is going to be a bottleneck.

The other thing that a data warehouse does is process the result set returned from the disk, and this often means that the data needs to be sorted, and for large result sets this can not always happen in memory, so what happens? We write the sort to disk. And writing large volumes of data to RAID5 disk is potentially a performance killer, at first things seem quick - the battery backed cache makes things seem quick until we saturate the cache and then things slow the true disk write rate. Personally, I have never seen a SAN with enough cache to cope with a 100GB sort.

(more to follow later...)