Simple Steps to Sustainable ETL
Recently Stewart posted his first blog post here on the Rittman Mead Blog, it drew quite a bit of comment about using external techniques to the ETL to provide mechanisms to do things more efficiently or to do things that could not otherwise be done.
Notwithstanding the claims of some DW appliance vendors that you just "wheel it in and slap your data on (just load, and don't worry where it is) and then you do the analytics" a lot of organisations go along the long-established route of regular batch data loads to a data warehouse. These traditional data warehouses often have long life spans, they run day in day out for many years; and in my opinion to do that you need sustainable, supportable ETL code.
In all likelihood the code will need to be maintained by developers or application support staff who were not involved in the original development. Also, in all likelihood, there is going to be a need to modify ETL code as source systems have a tendency to change more frequently than the target data warehouse.
Sound, sustainable data warehouse design needs to take into account that change is inevitable and must allow for it to occur with minimal impact to the data warehouse as a whole. It also must make it simple for a new developer or support engineer to come onboard; consistent approach to coding and no hidden surprises outside the ETL framework. Below I give some of the principles I try to adopt in a data warehouse development
- Layering. Source systems will change - sometimes minor changes, such as when a source system version increments and minor changes in interface specifications occur, sometimes more major if one source application is replaced by another or if the data transfer mechanism changes from remote database link to external table or from batch to change data capture. But, by placing the code to extract data in its own layer we can often reduce the impact of many of these changes to data warehouse. Ideally we need only modify a simple map that extracts the source specific data and one that transforms it into the form used within the data warehouse. Restricting change to just a few maps reduces the risk of breaking other parts and simplifies the testing required to implement the change. Oracle BI Apps use a similar concept with source dependent and source independent loads.
- Single purpose mappings. By this I mean loading a single target or distributing data to a set of related targets. If you ask "what does this mapping do?" and the answer is "x AND y" then that "AND" is a clue to too much happening. Single purpose does not necessarily means simple, sometimes complexity can not be avoided especially in mappings that handle data validation - recently I wrote a map to validate membership of a ragged, skip level hierarchy, it was a chain of splitters, union alls and joiners, but it was easy to follow and (under OWB) ran as single set-based SQL operation.
- Don't hide functionality. ETL tools often allow the inclusion of "custom" procedures or functions, but what actually goes on in these "code boxes" is hidden from the ETL tool - is it a simple function that modifies a value or is it a major piece of ETL code it is own right? I have seen "functions" that update tables in addition to returning a value. Often functions can be replaced with the ETL tool's own expression operator - why plunge into a function when we can just as easily write a case statement or a simple decode. And as mentioned recently on the blog, sometimes an analytic view as the data source is the clearest way to go; "hidden" functions to look up previous or next values are a nightmare to support whereas a view can expose previous values in descriptively named columns.
- Keep data moving in sets. ETL is about getting the right thing done both accurately and quickly. Databases are great at manipulating sets of data, serialising data is slow. So avoid operations that need to manipulate one row at a time, or worse still, the same row many times. Often functions to supply defaults or perform data value mappings can be replaced by simple in-line expressions or joins to look-up tables.
- Don't revisit data. If you have a map to load data into a staging area followed by a map to delete unneeded rows then perhaps you as should filter the unneeded rows out before loading in a single mapping. Insert followed by delete is wasteful of time and processing and could prevent useful features such as segment compression from being utilised.