Techniques to reduce IO - partitioning and compression

Last time I wrote a few general notes about the storage of fact and reference data. I now will look at partitioning and compression in slightly more detail

Partitioning is a way to divide a large table into separately managed chunks; the contents of each chuck being determined by the partitioning key. Although it is possible to create a table with a single partition (and this does have uses!) most partitioned tables would have multiple partitions. With Oracle there are three possible type of partition key: Hash, Range, and List. Other database vendors also offer partitioning, most recently it was Microsoft with range partitioning in SQL Server 2005, however the partitioning from most vendors is offen restricted to one type (and usually hash)

Range and list partitions are perhaps simplest to understand; each table has a key column (or columns in the case of a compound key) rows being allocated to partitions based on the value of the key. In the case of range we store values less than the partition's key maximum value, and for list we store rows where the key matches a value in the list for the partition. With hash partitioning the column value is first hashed to determine which partition to use. The hashing algorithm works best if there is an exact power of two partitions in a table. Composite partitioning is also possible; here the partitions are further sub-divided into sub-partitions. In Oracle 8i, sub-partitions where limited to hash, but 10g we can also use list (but not range) sub-partitions.

Partitioning schemes first and foremost allow us to improve database manageability by dividing large tables into smaller, more manageable chunks; however there is also the possibility that we can also boost performance by partitioning. Consider a query against a partitioned table; if we include the partition key in the predicate the database will know in which partition to find the required data, the other partitions will not be accessed. Similarly, if all the data to be removed from a data warehouse is held in a single partition (probably not a hash partition!) then we can simply truncate or drop a partition which is far, far quicker than deleting from a table based on a predicate.

Although hash partitioning can work well in OLTP systems it is my opinion that it is less useful in a data warehouse. As we are not normally aware of which partition contains which key values we are unlikely to be able to exploit any of the partition manipulation operations such as partition exchange or partition truncation as part of our data load process. We are also unlikely to achieve much partition elimination for queries that involve more than one partition key value and in the real world people often query on related groups of items such as the cities in a region and the hashing algorithm would most likely spread the data across many partitions. A further limitation of hash partitioning is that we are unlikely to be able to set the tablespace for a partition to be read only as it is unlikely that we could ensure that newly arrived data would belong in that partition

The concept behind partitioning has been part of Oracle for a long while; Oracle 8i introduced the first incarnation of true partitioned tables. Before that Oracle 7.3 had Partition Views. Partition views used multiple physical tables of the same design to hold the data and used a UNION ALL view over the set of tables to create the partition view; each table only contained data relevant to the partition and this was typically enforced using check constraints. The query optimiser knew about this construction and would only access the partitions that contained the data to answer the query. Although the database setting PARTITION_VIEWS_ENABLED=TRUE was no longer supported from Oracle 10gR1 it seems that by default this value is treated as true, that is, partition views still work. However partition views are not in anyway as flexible as partitioned tables and should certainly not be considered for new applications.

Table Compression is another good way to reduce IO. The rational behind this is that we are reading lots of rows from a table and therefore the more rows we can read in one hit the better as the overhead to uncompress the rows is less than overhead in reading the uncompressed rows from disk. Table compression works with both tables and partitions (you can even mix compressed and uncompressed partitions within a single table). Unlike index compression, table compression only works on bulk insert; you can still use non-bulk techniques to insert into a compressed table but the data inserted will not be compressed, and updating rows also cause those rows to be uncompressed. Compression is organised by database blocks and relies on the detection of repeating data patterns within the block. The repeating patterns are tokenised and stored in a look-up area within the block and the data itself is replaced with the tokens. Enhanced compression can be obtained by ordering the data being inserted to maximise the number of tokens being replaced.