Getting Started with Sunopsis Data Conductor

If you were reading this blog a month or so ago, you may well have seen a posting about Oracle's purchase of Sunopsis. Sunopsis are a French company with offices around the world that specializes in ETL tools; Oracle have bought them because they bring to Oracle the ability to load and transform data on all data platforms, not just the Oracle database, whilst keeping to the current OWB philosophy of using the database as the ETL engine, rather than using a separate ETL hub with it's own language, database and so on.

I met up with some of the Sunopsis team in the UK last week to go through their product architecture, and to get hold of an evaluation copy of Sunopsis Data Conductor, the main part of their wider ETL and data movement suite. I'll be covering Data Conductor in the forthcoming Oracle Business Intelligence book I'm currently finalizing, and so I thought it worthwhile putting the product through it's paces and seeing how it compares to Oracle Warehouse Builder. Data Conductor comes with an evaluation guide and dataset as part of the download, and so I've worked through this first, and I'll move on to transforming and loading my own data later on.

Taking a quick look at the product architecture first, like Warehouse Builder Sunopsis Data Conductor uses the target data warehouse, or in some circumstances the source database, to do the data transformation rather than using a separate ETL server. The rationale behind this is that set-based transformations on the target data warehouse are usually faster than row-by-row transformations and inserts performed by an ETL server, and by buying Sunopsis Data Conductor, you preserve and leverage the investment you've made in your database engine rather than paying all over again for a separate ETL box to do the work for you.

Like Warehouse Builder, Data Conductor is based around a repository that can be held in any relational database. It stores metadata about the source and target databases, details of the mappings (called "Interfaces" in Data Conductor), the process flows and details of the execution results.

Data Conductor is written in Java, and therefore communicates with source and target databases via JDBC, and other sources (LDAP directories, flat files, ERP systems and so on) via methods such as JMS, JCA, JNDI and JDBC/OS. Various applications, such as the Designer (equivalent to the Design Manager in OWB), the Operator (like the Control Center Manager), the Metadata Navigator (like the OWB Browser), Security Manager and so on talk directly to the repository, and the Sunopsis Agent acts as the scheduler and carries out non-database activity such as calling Web Services, working with SOA components and so on.

So, how do we start a project and load up some data? The evaluation kit comes with a built-in database (Hypersonic SQL) so at this point, there's no need to sort out connectivity to Oracle, so after getting the demo database up and running, I start up the Designer application and look at the repository that's been created for me.

Like Warehouse Builder, I have projects which in this case is called "Sales Administration". Projects have Interfaces, the equivalent of Warehouse Builder mappings, and Packages, the equivalent of Process Flows. Procedures map to Transformations, Variables and Sequences being fairly self-explanatory, User Functions being a bit of functionality written in SQL, Java or whatever, and Knowledge Modules being implementation methods to actually load data into an object. 

Data Conductor takes the approach of splitting a data mapping into the business rules, the "what" of the data mapping, and the flow, the "how" of the mapping. For example, a business analyst might create a data interface that declaratively says that the sales fact table is populated from the orders table, the order items table and data from an XML document, whilst the database developer would pick a flow implementation type that actually uses SQL to get data from the tables, XQuery to get the XML data and joins it together using a bit of Java. These flows are encapsulated into "Knowledge Modules", a bunch of which come with the product at the start but which can be extended yourself if you want to. Knowledge Modules are split into "Loading" modules for getting data from a source onto a target platform, "Implementation" modules to handle situations such as insert/updates, slowly changing dimensions, bulk loads and so on, and some other module types that I haven't worked with yet. The difference between Data Conductor and OWB is, though, that OWB mixes up the Oracle-specific details of the implementation with the business logic of the mapping, Data Conductor separates them out, which is an interesting approach and one that we could have done with on a couple of projects before where we had a team of non-IT business analysts trying to capture business logic in OWB data mappings, but not knowing the specifics of how you efficiently load data into an Oracle database, let alone more complicated concepts such as SCDs and match-merge.

Anyway, as well as containing details of mappings, the Designer tool also contains details of the source and target databases. To see this, we switch to the Models tab and take a look at the sources and targets.

In this case, our source data is in an Orders Application and some flat files, whilst the target is a Sales Administration application. After applying a few data rules to the source application, so as to exclude rows that would fail an insert, I then go on to create the data mapping.

As I mentioned earlier, Data Conductor splits a mapping into the interface - the "business rules" of the mapping - and the implementation, the specifics of how you best load data into your target database. This implementation can include loading slowly changing dimensions, reading from a Web Services, writing to a Web Service or bulk-loading a database, and by splitting this from the business rules, you can define your mappings logically and change the actual implementation type if your technology platform changes. Quite neat.

Anyway, the first step is to define the interface. I go back to the Projects tab, and right click on the Interfaces node to create a new interface.

Then, the right-hand part of the screen shows me a tabbed view of the interface components. I name the interface "Pop. TRG_CUSTOMER, and then switch to the Diagram tab to map the data. I drop the TRG_CUSTOMER table onto the Target Datastore panel, and then the SRC_CUSTOMER, SRC_AGE_GROUP and SRC_SALES_PERSON tables onto the Sources pane. I then join the tables together, and add the details of the other mappings that aren't just column-to-column copies; Data Conductor uses the SQL native to the plaform you're working with, and you can choose to perform transformations on the source, staging or target databases, depending on what's most appropriate.

Now that I've defined the business rules, I switch over to the Flows tab and specify actually how the data will be loaded. The Flows work with Knowledge Modules, "encapsulated best practices" the tell Data Conductor the most efficient way to load an Oracle database, load a generic database, get data from an XML document and so on. In my case, I'll pick the generic "SQL to SQL" Loading knowledge module to get my data out and across, and then the "SQL Incremental Update" implementation knowledge module to load data into the target database.

At this point I could run the interface as it stands, but instead I pull this mapping together with some others into a Package, the equivalent of an OWB process flow. I create the package using the Designer...

and then start stringing interfaces and procedures together to create a process flow.

Again, I can execute the process now by clicking the Execute button, or I can do one further step and create a Scenario from the package, which then creates something I can schedule and run using a third-party scheduling tool. Finally, I go back to the package and run it, and switch over to the Operator application to check the progress of the package. By clicking on the individual interfaces, I can see what's loaded into the target tables, and also check the error tables to see what data needs correcting.

Now there's obviously a lot more to the product than that, but it's a good start and it's interesting to compare it to OWB and other ETL tools. Like OWB, it's Java-based and uses the target database to do the data transformation (or the source database, if more appropriate), but unlike OWB it splits mappings into business logic and technology implementation, which seems fairly logical and certainly caters for the multitude of platforms the product supports. Going into the future, I'd like to hook it up to my Oracle database and a few other platforms, try out some of the other Sunopsis products with the suite, and try and get it working with Web Services. For the time being though, it certainly looks interesting and it'll be interesting to see how it eventually gets integrated into the wider Oracle BI/Fusion Middleware platform.