Thoughts on Oracle Data Integrator

I've been working fairly solidly with Oracle Data Integrator over the past few weeks, and I think I've used it enough now to form a few opinions on the product. I did some work with a client who wanted to evaluate it for it's ability to handle Changed Data Capture, I've been writing an article for it for OTN and I've been scoping out the chapter on it for the forthcoming book. All of these tasks have required me to actually deliver something using it where I couldn't just change the requirement if the tool didn't do it well, so it's been about as close an exercise as you can get to without actually using it on a real project. I've also had a chance to discuss how the tool is used with some of the ex-Sunopsis people who now work at Oracle, so hopefully I've got a good idea on how the tool is being positioned going in to the future.

Just to recap: Oracle Data Integrator is an ETL tool that Oracle acquired when it took over Sunopsis late last year. Data Integrator is being positioned as being the ETL tool to use as part of Oracle Fusion Middleware, as it's just at home loading data into a non-Oracle database as an Oracle one, it can read from and publish to Web Services, it's repository can reside on more or less any database, but it still uses the OWB paradigm of using the target database to do the ETL processing. In a way, it's like OWB but written in a database-agnostic way, which I guess is why Oracle were interested in it in the first place. I wrote about ODI in it's previous Sunopsis Data Conductor guise late last year, this link contains an archive of all the relevant blog postings.

If you drew a venn diagram of the functionality of OWB and the functionality of ODI, you'd get a considerable overlap, with the intersection being data integration, extracting from all different databases, loading to Oracle databases, data quality and error handling, support for OLAP concepts such as loading slowly changing dimensions and the creation of processes and workflows.

ODI would have as it's unique capabilities the ability to load non-Oracle databases, native support for Changed Data Capture on both Oracle and non-Oracle databases, support for real-time ETL processes, built-in support for Web Services, and native support for extracting from non-Oracle databases (i.e. without the need for Oracle Gateways, or through ODBC). OWB would have, as it's unique set of features, data modelling including OLAP data modelling, support for Oracle OLAP, a richer set of data mapping operators, the data profiler, transportable modules and change management/metadata change propagation.

So, one way of positioning ODI would be to say that it's an alternative to OWB when performing the data integration and staging part of your warehouse build, particularly when you need to:

  • Capture new and changed data from a database
  • Source data from non-Oracle databases using native utilities
  • Capture data from, or publish transformed data to, a Web Service
  • Publish real-time ETL processes that respond to data pushed through them

One way of visualizing this is using the following diagram of the Oracle DW technology stack.

As you can see, ODI is in a way a potential replacement, or supplement, to the Enterprise ETL Option for Warehouse Builder - especially as, in the 10.2.0.1 and 10.2.0.2 release of OWB, Slowly Changing Dimension support has a number of bugs, and the Change Data Capture and Real-Time mappings features originally in the beta version were pulled prior to the production release. In this case, it makes a fairly good case for replacing at least some of the more complex and technical aspects of OWB's functionality, at least in the stages prior to loading into the warehouse structures proper, and it's even more of a compelling story if you're dealing with non-Oracle sources and targets or indeed a SOA environment.

So, where ODI fits in with the existing Oracle DW stack is as a complement, or an alternative, to some of the functionality OWB provides up until you get to the ODS and analytic layers of the warehouse. Beyond that, that's where OWB comes in to it's own; the OLAP data modelling features, support for building Discoverer EULs and BI Beans presentations, support for loading AWs. These are areas ODI leaves well alone, which means that, one option going into the future is to use ODI together with OWB on your more complex data warehouse projects, perhaps paying for ODI ($12k per target database CPU, $4k per source database CPU) by foregoing the Enterprise ETL license ($10k per CPU on each database OWB is deployed to). I suspect Oracle would prefer you to license them both, but I'd say if your data integration needs are more complex than OWB can handle, consider ODI as a way of building the staging layer of the data warehouse, with OWB used to build the layers from the ODS up.

Data Quality is another area that both tools handle well, but differently. As you probably know, OWB now comes with the Data Quality Option, which apart from the well-known Data Profiler, also allows you to define error-handling on all your target objects with error rows neatly moved off into error handling tables. Although it uses PL/SQL and the Exceptions Into clause, in 10g this sort of code is just as fast - if not faster - than LOG ERRORS and it also allows you to keep all your error handling in one place, defined as data rules, rather than scattered all over your mappings.

ODI takes a slightly different approach. What it does is allows you to define "virtual" constraints on tables that ODI enforces, which are then monitored by Control Knowledge Modules with erroneous rows again moved off in to error tables. What's good about ODI's approach is that it effectively "firewalls off" erroneous data contained in source tables, such that they get copied on to error tables in a separate, workarea schema on the source database, and never get so far as to be loaded into the target environment.

This has the advantage of reducing the amount of data loading the target database has to carry out, as it doesn't have to first load all the data, errors and all, into the target database before finding out which rows contain errors. This is particularly useful if your load window is small and you could do without unneccessarily processing rows that contain errors anyway - these are "firewalled off" by the virtual constraint and control knowledge module and never come near the target database. Where OWB comes in to it's own though is the Data Profiler - with ODI, you need to know in advance what data rules (i.e. constraints) to apply, OWB on the other hand helps you determine them using the structure and semantics of your data.

Another area I found interesting with ODI was the support for Changed Data Capture. Some of the OWB old hands will know that CDC was originally going to be a feature of OWB10gR2, along with Real-Time mappings, but these were pulled from the product during the beta cycle, presumably so that they could actually get the product out of the door in a reasonable time. ODI however has pretty much full support for Changed Data Capture, both for Oracle databases and for other ones; for Oracle data sources, you can either capture changes using triggers, or using the Asynchronous Hotlog Changed Data Capture process I blogged about last year, with ODI acting as a graphical front-end for the process. I'm currently writing an article on this for OTN and so I won't go into the full details, but in short the process for setting up CDC in this way on an Oracle 10g data source is as follows:

Firstly, enable your data server model (ODI terminology for an Oracle schema) for Journalizing, using either the simple trigger method, or the more featured consistent method that allows the capture of sets of tables using the LogMiner feature of Oracle 9i and 10g.

If you're brave, ODI will even configure the source database to support Asynchronous Hotlog Change Data Capture, although it's not recommended for production systems as it bounces the database half way through (that'll keep your users/developers on their toes...)

Then, you select one or more tables to be part of the CDC set, and then use the Designer application to start the journal. Taking a look at the steps ODI takes to create the journal, you can see that it sets up CDC, in the standard way that Oracle 9i and 10g supports it, creates subscriber tables and sets up the capture and publish processes - pretty much all the things I did manually in my previous article.

Now I'll be honest here, in that it took me a pretty long time to get this running properly, but the problems were more problems with the complexity of CDC on Oracle - getting the capture processes set up properly, clearing out failed setups when I was working out how to do it - rather than issues with ODI per se, although one annoying aspect of ODI did make things unneccesarily difficult. Basically, when you create your connections through to the source Oracle database, you naturally create them using the credentials of the source Oracle user account. The problem comes though when ODI tries to write data to the workarea schema it creates on the source database; it needs the CREATE ANY TABLE, SELECT ANY TABLE, DELETE ANY TABLE etc priviledges, together in fact with the DBA role if you want it to configure the database for you, which in production is a bit of a no-no. In fact, to get it working properly, I had to create the source database connection using the SYSTEM account credentials; the advice from Oracle is to do this initially, then set it back to the credentials of the source schema after setup is complete, but it was a bit of a hassle, and something that prolonged the exercise by about a week, to be honest. So the advice here is, especially if you're using CDC - when you create your Oracle data server, use the SYSTEM account as the login (apparently the password is encrypted when stored in the repository, so it's not the end of the world).

Anyway, once you've set up journaling, actually selecting from the new and changed data is a doddle - none of the fiddling around selecting from subscriber views that you get if you set it up yourself. You still need to extend and then purge the subscriber window, which you can do manually, or incoporate into a process flow, but then you just tick a box to say "work with journaled data" when adding the data store to an interface, and bob's your uncle.

Beyond that point, you just deal with new and changed data the same way as data from the original table - a neat solution.

The obvious next question though, once you've got changed data capture working, is how to integrate the trickle of changes into your target database in real-time. In OWB, assuming you'd got CDC working through setting it up manually outside of OWB, you'd then probably design a mapping that read from the subscriber views and ran on a regular basis, say every minute or so, giving you latency of around a minute, depending on how long the CDC propagation and your mapping took. ODI takes a different approach though, in that you can define a process flow that reads from the journaled table, but then waits at the start for some data to appear in the journal before executing. Therefore, the process can be set to run, listen for changed data, and then kick-off when either a certain number of journal rows appear, or after a set period of time, whilst at the end the last step in the process kicks the process off again.

So, in conclusion, ODI looks like a pretty nifty additional toolset when working with OWB on a data integration project, or when building a data warehouse when the integration and staging element is non-trivial. If you need to deal with non-Oracle sources, or you need to capture changed data (possibly in real-time) on Oracle or any platform, or if you're doing data integration in a SOA environment, it's a handy box of tricks that makes the Oracle stack a more capable data integration platform. Of course when Oracle isn't the target platform, it makes things possible that in reality just aren't practical with OWB, and reading the tea leaves ODI is going to be the integration point for the metadata used for OBI EE and for loading data in to Hyperiod Essbase. In one sense, it could be seen as an alternative to Oracle Gateways or the Enterprise ETL Option for OWB, but in reality, I hope on projects in the future it's an additional piece of kit that we use alongside OWB's features, as I think the two tools complement each other, once you get to know what each of them does well.