Testing times

As an organisation, my employers firmly believe in working to standards; we are ISO certified, all of the technical staff in our outsourcing group have a minimum of ITIL foundation level certification. Lots of our guys also have vendor certifications in the products they use in their day-to-day working lives (OK, we also make sure we have a good headcount of certified professionals so that we enhance our partner status with various vendors; and it impresses potential customers)
So, we are strong on process. This means that some things take a while to happen. We need to test and document the test, nothing happens on a production system without the electronic equivalent of piece of paper say that somebody is happy that the change does exactly what it should and the can be applied to the production system to resolve some issue or other. Or so the theory goes.
Recently we patched a customer’s Oracle 9.2 test database to the most recent patch-set, tested and found no problems, rolled it out to live and hit a bug with star transform, found a patch on Metalink, retested and patched. And then we found problem with our statistics collection routine taking 15 hours instead of two. This one is going through Oracle’s Service Request mechanism; my guys have a bit of work to do to capture trace data and prepare a test case for support; we can see what is going wrong, we just need to have the evidence so that the problem can be fixed. But for now we have put in a work-around to put enough stats in place for the on-line day.
Neither of these problems were seen on the test system; the first because we did not run that user query; the second did occur but on a far smaller test system (less than 1 TB) it was not obvious. So do we need to revise the way we test or accept that sometimes things can get through and have the mechanisms in place to deal with it – I think the later.