Interview with Mike Stonebraker
And now for something a bit different. Some of you in the Oracle data warehousing world might remember Lilian Hobbs, who used to work in the Server Tech part of Oracle and worked on areas such as materialized views, partitioning and information lifecycle management, and who was co-author of Oracle Database 10g Data Warehousing, published by Digital Press. Well, Lilian left Oracle a year or so ago but we stayed in touch, and then I noticed that she'd ended up working with Vertica, one of the new column-store databases that compete with Oracle in the high-end data warehousing space. What makes Vertica particularly interesting is that Mike Stonebraker, one of the original figures in the relational database world, was involved in the formation of Vertica and Lilian offered to set an interview with him up for the blog.
Now I'm certainly aware that there's quite a few differing opinions out there on what the best architecture for data warehouse databases is, and of course Oracle recently released Exadata and the HP Oracle Database Machine which competes directly in this space. Vertica, and other products like Sybase IQ and SAND address the problem in a different way in that they store data "column-orientated", so I thought it'd be interesting to hear from Mike why he thinks column-store is the ideal solution for DSS-style applications. Feel free to add comments to the posting; given that columns-store and Exadata seem to be a hot-topics I'd be keen to hear any insights into the two technologies.
Anyway, here's the interview, and thanks to Mike and Lilian for setting this up:
[Mark Rittman] Mike, you have a distinguished history in the database market, being involved in the creation of databases such as Ingres and Postgres, and now being involved in a new startup, Vertica. Tell us a bit about how you became involved in databases, how you got involved with Vertica and what your new company is offering?
[Mike Stonebraker] I received a PhD from the University of Michigan in Computer Science in 1971 and joined the faculty at the University of California, Berkeley. As you know an Assistant Professor must "publish or perish". It was clear that my PhD topic was a dead end, and I had to find something else. It was a happy accident that I stumbled into the DBMS area in 1972.
At the time, Codd's pioneering paper had just appears, and there was immediate controversy over whether the relational model could be efficiently implemented. Hence, I started (with Eugene Wong) the Ingres project to demonstrate the feasibility of Codd's ideas.
After building and commercializing Ingres over the next 8 years, I moved on to build Postgres. In fact, Vertica is the 5th DBMS that I have built, first as a research prototype and then commercialized.
At some point I realized that column stores would clobber row stores in the data warehouse marketplace, and set about building a research prototype at M.I.T. (with help from a bunch of other professors and students). The success of this prototype (C-Store) led to Vertica Systems, which I founded (with Andy Palmer)
[Mark Rittman] Vertica to my understanding differs from the Oracle database in that it stores data in a "column-based" form. Tell us a bit more about this, how it differs to the Oracle database and why you think it has certain advantages?
[Mike StoneBraker] Most data warehouses either use a star schema or something close to a star schema. The central "fact" table in such a schema usually has many attributes. 50 attributes is very common and 200 is not unusual. So assume a fact table with 60 attributes.
However, the query from a business analyst invariably only accesses 3 or 4 of these 60 attributes.
A "row store" such as Oracle will read all 60 attributes because the attributes needed are intermixed with all of the others. In contrast, a column store, such as Vertica, will read only the relevant attributes. Note that this will result in a factor of 15-20 less I/O activity.
In addition, a column store has other benefits. Compression is more effective in a column store than a row store, because any storage block only contains one kind of thing. In contrast a storage block in Oracle contains many different kinds of objects. Obviously, it is easier to compress one thing than many things. In head-to-head bakeoffs, Vertica compression is typically a factor of 3 better than Oracle's. This further lowers I/O activity in a column store.
In customer bakeoffs against Oracle, Vertica has yet to win a benchmark by less than a factor of 30, largely because of these two considerations.
[Mark Rittman] Vertica also has a "shared-nothing" architecture. Again, what does this mean, how does it differ from Oracle's Real Application Clusters, and why do you think this is a superior architecture for data warehousing?
[Mike Stonebraker] A "shared nothing" architecture means that each computer system has its own private memory and private disk. This architecture is followed by essentially all high performance, scalable, DBMSs, including Teradata, Netezza, Greenplum, Paraccel, DB2 and Vertica. It is also used by most of the high-end e-commerce platforms, inclusing Amazon, Akamai, Yahoo, Google, and Facebook.
Oracle RAC does not run on a shared nothing system. It was built many years ago to run on a "shared disk" architecture. In this world, a computer system has private memory but shares a disk system with other computer systems. Such a "disk cluster" was popularized in the 1990's by Sun and HP, among others In the 2000's this architecture has been replaced by "grid computing", which uses shared nothing.
Shared disk has well-known scalability problems, when applied to DBMSs. These result from the requirement that each system have its own lock table and buffer pool, which must be synchronized with their peers. This synchronization is painful and has serious performance problems, which limit the scalability of shared disk implementations. In fact, the largest Oracle RAC configuration I am aware if is 6 systems.
In contrast shared nothing DBMSs, such as Vertica, are routinely run on 50-100 nodes.
In summary, shared nothing scales much better than shared disk. hence, it is the superior architecture.
[Mark Rittman] Now that Oracle have released Exadata and the HP Oracle Database Machine, with predicate pushing, return of just the result set from a query and a shared-nothing disk architecture, does this not negate the benefits of Vertica? Why should an Oracle customer invest in another database technology?
[Mike Stonebraker] I have 2 comments concerning this question.
First, Exadata is intended to overcome the scalability limitations of RAC. Hence, it is an appliance with a total of 120 CPUs in 22 systems. Clearly, 22 is better than 6, and Oracle has managed to "push out" the RAC limit noted above. However, the scalability limit is still vastly lower than that available in a shared nothing system, such as Vertica.
Second, Exadata is still Oracle and is still a row store. Vertica on 120 systems will outperform Exadata by the same factor of 30-100. Hence, one can run Vertica on a comparable hardware platform and get 1-2 orders of magnitude better response time. Alternately, one can run Vertica on a low end server and get comparable performance to that offered by Exadata on a huge complex of machines.
[Mark Rittman] You presented a paper at the 2007 VLDB Conference on something called the H-Store project, that proposed running high-end OLTP databases entirely in RAM with SQL being replaced by Ruby or Python. Tell us a bit about this project and what it means to the typical Oracle (or even Vertica) customer?
[Mike Stonebraker] Oracle is selling "one size fits all" -- i.e. a single DBMS code line intended to solve any kind of application. Vertica is proof that one can do vastly better with a specialized architecture oriented to data warehousing. In fact, a different specialized architecture will run circles around Oracle on OLTP. The details would take me sometime to explain, so I would like to refer the interested reader to the H-store literature. If you google "H-store VLDB", you will get to three recent papers that explain the system in detail.
[Mark Rittman] Although Vertica appears to offer several benefits to data warehouse customers, how would you propose incorporating a column-store database into a typical business intelligence & data warehousing architecture that also needs to incorporate for example OLAP, Data Mining and Master Data Management technologies? Does it replace the Oracle database, OLAP and so on?
[Mike Stonebraker] I think any CIO would be crazy to suggest "cold turkey", i.e. replacing one component of a complex system with another by flipping a switch. Instead, I would recommend that he set up a new system, such as Vertica, in parallel with his existing production system. His ETL load system would load both, and he could direct queries to either one. Over time he could shift his load onto the new engine and eventually disconnect the old one. Such a gradual transition from old to new technology is way less ricky than "cold turkey"
[Mark Rittman] Mike, you've been involved in the database industry for many years. How do you see the industry going in the future, particularly with regard to high-end business intelligence and data warehousing customers?
[Mike Stonebraker] Think petabytes -- and eventually exabytes.