Thoughts on the New DW Features in Oracle Database 11gR2
Well it's been about a week since Oracle Database 11gR2 became available for download on Linux, and we've since had a chance to play around with it and see what's new. The Linux release on 32 and 64-bits is available for download from OTN now, with other Unix releases coming later in the year and Windows probably in the first quarter of 2010.
At Rittman Mead we'd been on the 11gR2 database beta, and kept an eye out for new BI and data warehousing features coming along with this release. As well as our interest in the database, we were also keen to see what's new in the new release of OWB that accompanied this new version of the database, as it introduces a bunch of new features around integration with SOA, Oracle Data Integrator and Oracle BI Enterprise Edition. I'll get on to my first impressions of OWB in a future posting, but what's in the 11gR2 release of the database that might be of interest?
For me, the big buzz around the beta (which couldn't be revealed due to the NDA) was columnar compression, something that on the face of it did a similar thing to the column-store databases such as Vertica (hence my interest in interviewing Mike Stonebraker last year), Sybase IQ and SAND. Column-store databases turn conventional database storage ideas on their head by storing columns of data in database blocks rather than rows, which means you get more interesting data per block when you are just retrieving individual columns of data from a database (as opposed to whole rows), something that's fairly common in DSS-style applications. Now Oracle have always maintained that "one database fits all", with the Oracle database being suited to both transactional and DSS applications, so it's been interesting over the past couple of years hearing vendors such as Vertica (advocating column-store) and Netezza (advocating specialized, massively MPP with specialist hardware) saying the opposite and apparently skimming off some of Oracle's largest data warehouse customers. Of course Oracle addressed this competitive threat in a big way last year with the HP Oracle Database Machine and HP Oracle Exadata Storage Server, which adopted some of the ideas put forward by Netezza (but not all of them, if you listen to Netezza, and I wonder if there'll be an "Exadata 2" announced at next month's Open World which closes some of this gap?), and the new column storage (or more correctly, compression) option in Oracle Database 11gR2 seemed to be addressing the threat from Vertica, who interestingly have a number of ex-Oracle Server Tech people in their product team.
Columnar Compression has now made it into the production release of 11gR2, however it turns out that it's only available if you use Exadata storage to hold your data files. According to Kevin Closson this is for for technical (as opposed to commercial) reasons, and it certainly enhances the appeal of Exadata and gives us another way of improving the performance of database queries when using this new hardware unit. The way it works is that you have two new COMPRESS options; one that apparently gives around a 10x compression ratio, stores data in blocks in what are called "compression units" which are organized into individual columns of data and then compressed, and other that builds on this approach but compresses the data even further to around 40x, at the expense of query performance. License-wise, prior to the production release I was under the impression that both of these new compression types required the Advanced Compression option, however I'm not sure if this is the case now that it's only available with Exadata, it might be a regular feature of the database only available if you have this particular bit of hardware. Best check with your Oracle rep on this.
So, how does columnar compression compare to the pure column-store approach of Vertica? Well I haven't done the benchmarks yet (and I'm not sure you're allowed to publish them anyway, under the terms of the Oracle license), but you can imagine how each vendor will position their offering as being the superior one. Mike Stonebraker would probably argue that tacking hybrid column storage onto a row-based database is just a superficial change as the engine within the database is still working with rows of data, whilst Oracle would probably argue that their hybrid approach keeps the database manageable, reduces support costs (through only supporting one type of database) and maintains the performance of random row access. The actual loading of data, and the overhead of maintaining a whole new database technology, has always been the weak point of the specialist column-store vendors (though ones like Vertica are still innovating and introducing their own hybrid storage), and it'll be interesting to see if there are any announcements around the next generation of Exadata storage at Open World to see if any more of the query processing load is passed down to the storage units.
The other main new feature in 11gR2 that we were excited about was "in-memory parallel execution". This sounds fairly innocuous (you probably think it's done in memory anyway?) but it's pretty important as it allows parallel queries to make much better use of all the memory in a grid of what are now generally 64-bit database server nodes. Now you obviously need a fairly hefty clustered database setup to test this, but the general idea seems to be that queries that run in parallel (something that's made much more automatic in the 11gR2 release) will, if the query is suitable, cause fragments of the table to be cached in the memory of individual cluster nodes so that each node's parallel server will have all the data it needs to work on located directly in the database node's memory, rather than it having to be either read directly from disk (which always happened prior to 11gR2, and will still happen for very large tables) or read into the buffer cache in any one particular database server node (again which happened for small tables prior to 11gR2, and will still happen for very small tables now). So the important new development here is the local caching of fragments of the table in the memory of each database server node, again something that's not too dissimilar to the data distribution that you get with vendors such as Vertica and Netezza but all of it happening automatically and without the DBA needing to physically separate out the data during the data load. Oracle are positioning this as a complementary technology to Exadata Smart Scans, and again it'll be interesting to see how this new feature pans out in the real world and how Netezza and Vertica react to the announcement (and how much they say it validates their MPP vs SMP approach). I wonder also if at some point, all this in-memory stuff (TimesTen, in-memory parallel execution etc) is married up with Oracle's BI tools to address some of the in-memory tools from vendors such as Microsoft, SAP and QlikTech that are enjoying a fair bit of popularity in the BI market at the moment.
There were a number of other new features announced with the 11gR2 release, including "Analytic Functions 2.0" which I believe Pete will be blogging about soon, some minor enhancements to the two new partitioning schemes (interval and reference), easier ways to set up RAC (including an option to install databases in single-node RAC mode when initially setting them up), and a new Universal Installer that first made its appearance in the 11g release of Oracle Fusion Middleware. All together, if you're an Exadata customer you've got some pretty interesting new options around storage and querying that are now presented by columnar compression, and customers using RAC and parallel query can make use of the new in-memory parallel query feature. Once we set up a suitable RAC environment no doubt we'll be giving this all a thorough run-through over the next few months, and it'll be interesting to see if this is enhanced with any new hardware at Open World in a few weeks time.