Comparing Oracle 10g Aggregation Techniques

What with all this work I've been doing the past few weeks with Oracle's BI tools, I've got a real hankering at the moment for some real, honest-to-goodness database work. Something using the command line, perhaps tinkering around with trace files, working on a blog posting where more than one other person in the world knows what I'm on about. I was particularly pleased therefore to make a start towards the end of this week on one of my UKOUG Conference papers, where I'll be comparing the various aggregation technologies in the 10g database.

The thinking behind it is this: In Oracle Database 10g, you can summarise data using regular materialized views, or you can use GROUP BY ... ROLLUP to create materialized views that contain multiple levels of aggregation. In addition, if you're using the OLAP Option, you can create summaries using analytic workspaces, and create views using OLAP_TABLE over them so that they mimic regular star schemas. But which approach is faster to initially aggregate, and then keep refreshed with new data? Which ones take up the least space, and which ones return data the fastest? What if you could approach this question using a "scientific" approach, with reproducible scripted test cases, a published methodology and open it up for review? And make it something that the average DBA could understand, and work with? That's what I'm trying to achieve.

The actual presentation is in the DBA stream and therefore I'm trying to avoid the use of GUI tools, keeping it all to the command line where possible. Therefore, instead of using OWB or AWM, I'll be using the DBMS_ODM packages to create the GROUP_BY ... ROLLUP summaries, and DBMS_AW_XML to create analytic workspaces, and pointing people to the GUI tools if they want to use them later. An important thing will be the separation of the data load and the data aggregation, and making sure that we use features such as partitioning and compression on the analytic workspace side, and whatever the equivalent is on the relational side (I'm not so up on this part). At the end of the day though, it's got to stand up to the scrutiny of the DBA crowd at the UKOUG conference, so a key thing will be to get it reviewed by a few people before finally submitting the paper.

I'm working on setting up the test cases at the moment, and my first thought is around using the SH sample schema as my sample dataset, but I'll have to work out whether it's big enough or representative enough to be worth working with. In reality, I think I need a sample dataset in the gigabytes in size, to see whether performance is different with a desktop test compared to a real data warehouse - simulating this could be an issue.

Whatever though, it should be an interesting couple of weeks and I'll try and post updates on the blog as I progress. To be honest, I don't know which technology will come out on top, but it's an interesting topic that'll allow me to combine my knowledge of OLAP with a bit of good old fashioned database performance work. Should be good.