Dear Uncle Peter...
I have a medium sized data warehouse, about 4 terabytes, and have noticed that performance is not always consistent. This surprises me as it running on decent hardware (12 CPU SUN SPARC, 24GB RAM, and industrial strength SAN storage) and the system stats from our monitoring guys show “$%@! all” resource usage most of the time. We are using Oracle 9.2 and UNIX files for storage – the storage admins said that files would be best for them to configure as RAW was a bit difficult. I spoke to the DBA and he said that we were not using direct IO (whatever that is) but we could change that by just doing an alter system command. He suggested that perhaps the best time to make the change is when we upgrade to 10g in a couple of months time – we could then pretend the improvement was caused by the upgrade and not because the setting was wrong. He is a helpful DBA. He also suggested that we should change the datafiles to ASM – he could do this with something called RMAN at the time of the upgrade – he said this would be quick to do as we had ten terabytes of SAN disk spare so would could use that to speed the migration.What do you think?Yours
Bob
No, not a real letter (I don’t get letters), and I am not really an agony uncle.
But what do I think? Well for one thing I would not be too happy with a DBA who would suggest hiding a change to direct IO in a database upgrade. All change (especially in production or live systems) needs to be documented, tested and to have a way back to the current state if it does not work. I also am suspicious of his motives to sneak in the ASM change; it sounds more like getting another experience to include on the resume than a thought through change. I am not to happy with the storage guys either – configuring raw storage is not that uncommon and should be well documented.
Back to the consistency of response problem; storage configuration is not the only possible cause of problems like this. Before ‘fixing the problem’ we need to be pretty sure that we have identified the problem. And this means you have to measure things – you have to identify what has inconsistent performance and then zero in on what aspects of that ‘slow thing’ take the most time. You need to document what else is happening at the same time within the database, within the storage system and in the wider network; I have seen network traffic make queries appear slow, I have seen automatically allocated degrees of parallel lead to extremely inconsistent performance as users fight for resource, I have seen IO contention as too many queries try to access the same physical device at the same time. Oh, I have also seen problems with non-direct io and double caching so it could be the problem, but it does not have to be.