Analytic Views as OWB Data Source
Readers of my old blog - much now ported here to the Rittman Mead blog may remember my fondness for things SQL (rather than PL/SQL) and analytic functions in particular. It is not that I can't "do" PL/SQL it is more that I think there is sometimes a better way. Recently, I came across one such case in an Oracle Warehouse Builder project.
OWB is lovely tool and can produce some really nice SQL and visual mapping diagrams that are so easy to follow: "Take these columns from this table and those columns from another table and join them together on these columns and insert the results into that table"; and simple transforms that can be represented by a CASE construct or SQL function are so easy to add using the expression operator.
The thing I am not too happy with is the adding of PL/SQL processes to the map as either a pre-map or post-map operator (yes, I know that is not their purpose, but I have seen them used like that) or as a operator in the mapping. This can hide the purpose of the map and means that you need to jump out of the OWB design tool and look at the source of the process to find what is going on. On this project the customer had three data sources for one of the dimensions, each with its own set of attributes and effective dates for its attributes. The need was to construct a single attribute history table (similar to a SCD type 2 table) from the three sources. The existing code used a very complex cursor based package to "fill in the nulls" on the target table.
I always go back to "what are we trying to achieve?" when looking at refactoring a process - getting hung-up on optimising the "as now" can miss so many opportunities for other approaches. In this case I saw the opportunity of replacing the "build a table with nulls for missing data then update it to fill in the blanks" method with an analytic view within the database as the OWB source for a simple insert into target OWB mapping.
As a simplified example consider three tables: A,B and C each with a common key column, some attributes and an effective date for the attribute set: A with columns PK, A1, A2, A3, DATE_EFFECTIVE_A; B with columns PK, B1, B2, DATE_EFFECTIVE_B; C with columns PK, C1, C2, DATE_EFFECTIVE_C. We need to create a table, ATTRIBUTE_HISTORY with columns: PK, A1,A2,A3,B1,B2,C1,C2, DATE_EFFECTIVE where the attribute columns report the active attribute values current at DATE_EFFECTIVE
This can be achieved using a full outer join between the three tables and using the analytic function LAST_VALUE(). LAST_VALUE() is one of those often overlooked analytic functions that seems to have little real purpose until you realise that the option to IGNORE NULLS allows you to "copy down" the last non-null value and the windowing clause allows you to only consider rows "before" the current one. The other SQL goody to use in this view is the COALESCE function on the three possible date effective columns to return the non-null date_effective and similarly to return the first non-null value of PK.
So we get
CREATE OR REPLACE VIEW my_view ASSELECT pk ,last_value(a1 ignore nulls) over (partition BY pk order by date_effective) AS A1,last_value(a2 ignore nulls) over (partition BY pk order by date_effective) AS A2, last_value(a3 ignore nulls) over (partition BY pk order by date_effective) AS A3, last_value(b1 ignore nulls) over (partition BY pk order by date_effective) AS B1, last_value(b2 ignore nulls) over (partition BY pk order by date_effective) AS B2, last_value(c1 ignore nulls) over (partition BY pk order by date_effective) AS C1, last_value(c2 ignore nulls) over (partition BY pk order by date_effective) AS C2, date_effective FROM (SELECT COALESCE(a.pk, b.pk,c.pk) AS PK, a.a1,a.a2,a.a3 , b.b1,b.b2 , c.c1,c.c2 , COALESCE(a.date_effective, b.date_effective,c.date_effective) AS date_effective FROM a FULL OUTER JOIN b ON a.pk = b.pk AND a.date_effective = b.date_effective FULL OUTER JOIN c ON c.pk = NVL(a.pk, b.pk) AND c.date_effective = NVL(a.date_effective,b.date_effective) );
In the absence of a windowing clause on the analytic function it defaults to ROWS UNBOUNDED PRECEDING - which is, of course, just what we need here. We just expose this view to OWB and use it as the data source for a straight insert into the target table
There are changes we need to add to the view to use it in a real data warehouse - for example we need to get the current set of attributes from the target table to be the intial values for the columns and we need to code for default values for truely new value of PK, but these are not too complicated to do.
For me the test of the new approach is peformance and in this case the set based operations (even with two full outer joins) was far better than the iterative PL/SQL route it replaced.
Edit to include example query plans
As with any development or test system the plans I obtain may not be representative of "real world" plans. I only use a few rows in my tests, production examples of this technique would be likely to involve a few thousand rows.
Plan with no indexes
---------------------------------------------------------------------------------| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |---------------------------------------------------------------------------------| 0 | SELECT STATEMENT | | 9 | 1350 | 34 (12)| 00:00:01 | | 1 | WINDOW SORT | | 9 | 1350 | 34 (12)| 00:00:01 | | 2 | VIEW | | 9 | 1350 | 33 (10)| 00:00:01 | | 3 | UNION-ALL | | | | | | |* 4 | HASH JOIN OUTER | | 7 | 1141 | 17 (12)| 00:00:01 | | 5 | VIEW | | 7 | 1036 | 13 (8)| 00:00:01 | | 6 | UNION-ALL | | | | | | |* 7 | HASH JOIN OUTER | | 4 | 136 | 7 (15)| 00:00:01 | | 8 | TABLE ACCESS FULL| A | 4 | 72 | 3 (0)| 00:00:01 | | 9 | TABLE ACCESS FULL| B | 4 | 64 | 3 (0)| 00:00:01 | |* 10 | HASH JOIN ANTI | | 3 | 81 | 7 (15)| 00:00:01 | | 11 | TABLE ACCESS FULL| B | 4 | 64 | 3 (0)| 00:00:01 | | 12 | TABLE ACCESS FULL| A | 4 | 44 | 3 (0)| 00:00:01 | | 13 | TABLE ACCESS FULL | C | 3 | 45 | 3 (0)| 00:00:01 | |* 14 | HASH JOIN ANTI | | 2 | 118 | 17 (12)| 00:00:01 | | 15 | TABLE ACCESS FULL | C | 3 | 45 | 3 (0)| 00:00:01 | | 16 | VIEW | | 7 | 308 | 13 (8)| 00:00:01 | | 17 | UNION-ALL | | | | | | |* 18 | HASH JOIN OUTER | | 4 | 88 | 7 (15)| 00:00:01 | | 19 | TABLE ACCESS FULL| A | 4 | 44 | 3 (0)| 00:00:01 | | 20 | TABLE ACCESS FULL| B | 4 | 44 | 3 (0)| 00:00:01 | |* 21 | HASH JOIN ANTI | | 3 | 66 | 7 (15)| 00:00:01 | | 22 | TABLE ACCESS FULL| B | 4 | 44 | 3 (0)| 00:00:01 | | 23 | TABLE ACCESS FULL| A | 4 | 44 | 3 (0)| 00:00:01 | ---------------------------------------------------------------------------------Predicate Information (identified by operation id): --------------------------------------------------- 4 - access("C"."DATE_EFFECTIVE"(+)=NVL("A"."DATE_EFFECTIVE","B"."DATE_EFFECTIVE") AND "C"."PK"(+)=NVL("A"."PK","B"."PK")) 7 - access("A"."DATE_EFFECTIVE"="B"."DATE_EFFECTIVE"(+) AND "A"."PK"="B"."PK"(+)) 10 - access("A"."PK"="B"."PK" AND "A"."DATE_EFFECTIVE"="B"."DATE_EFFECTIVE") 14 - access("C"."PK"=NVL("A"."PK","B"."PK") AND "C"."DATE_EFFECTIVE"=NVL("A"."DATE_EFFECTIVE","B"."DATE_EFFECTIVE")) 18 - access("A"."DATE_EFFECTIVE"="B"."DATE_EFFECTIVE"(+) AND "A"."PK"="B"."PK"(+)) 21 - access("A"."PK"="B"."PK" AND "A"."DATE_EFFECTIVE"="B"."DATE_EFFECTIVE")
With two-column indexes on the PK and Date_effective columns of each of the three tables the plan changes to use nested loops on the indexes.
------------------------------------------------------------------------------------------ | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | ------------------------------------------------------------------------------------------ | 0 | SELECT STATEMENT | | 9 | 1350 | 19 (16)| 00:00:01 | | 1 | WINDOW SORT | | 9 | 1350 | 19 (16)| 00:00:01 | | 2 | VIEW | | 9 | 1350 | 18 (12)| 00:00:01 | | 3 | UNION-ALL | | | | | | |* 4 | HASH JOIN OUTER | | 7 | 1141 | 13 (8)| 00:00:01 | | 5 | VIEW | | 7 | 1036 | 10 (10)| 00:00:01 | | 6 | UNION-ALL | | | | | | |* 7 | HASH JOIN OUTER | | 4 | 136 | 7 (15)| 00:00:01 | | 8 | TABLE ACCESS FULL | A | 4 | 72 | 3 (0)| 00:00:01 | | 9 | TABLE ACCESS FULL | B | 4 | 64 | 3 (0)| 00:00:01 | | 10 | NESTED LOOPS ANTI | | 3 | 81 | 3 (0)| 00:00:01 | | 11 | TABLE ACCESS FULL | B | 4 | 64 | 3 (0)| 00:00:01 | |* 12 | INDEX RANGE SCAN | INDEX1 | 1 | 11 | 0 (0)| 00:00:01 | | 13 | TABLE ACCESS FULL | C | 3 | 45 | 3 (0)| 00:00:01 | | 14 | MERGE JOIN ANTI | | 2 | 118 | 5 (20)| 00:00:01 | | 15 | TABLE ACCESS BY INDEX ROWID| C | 3 | 45 | 2 (0)| 00:00:01 | | 16 | INDEX FULL SCAN | INDEX3 | 3 | | 1 (0)| 00:00:01 | |* 17 | SORT UNIQUE | | 7 | 308 | 3 (34)| 00:00:01 | | 18 | VIEW | | 7 | 308 | 2 (0)| 00:00:01 | | 19 | UNION-ALL | | | | | | | 20 | NESTED LOOPS OUTER | | 4 | 88 | 1 (0)| 00:00:01 | | 21 | INDEX FULL SCAN | INDEX1 | 4 | 44 | 1 (0)| 00:00:01 | |* 22 | INDEX RANGE SCAN | INDEX2 | 1 | 11 | 0 (0)| 00:00:01 | | 23 | NESTED LOOPS ANTI | | 3 | 66 | 1 (0)| 00:00:01 | | 24 | INDEX FULL SCAN | INDEX2 | 4 | 44 | 1 (0)| 00:00:01 | |* 25 | INDEX RANGE SCAN | INDEX1 | 1 | 11 | 0 (0)| 00:00:01 | ------------------------------------------------------------------------------------------Predicate Information (identified by operation id): --------------------------------------------------- 4 - access("C"."DATE_EFFECTIVE(+)=NVL("A"."DATE_EFFECTIVE","B"."DATE_EFFECTIVE") AND "C"."PK"(+)=NVL("A"."PK","B"."PK")) 7 - access("A"."DATE_EFFECTIVE"="B"."DATE_EFFECTIVE"(+) AND "A"."PK"="B"."PK"(+)) 12 - access("A"."PK"="B"."PK" AND "A"."DATE_EFFECTIVE"="B"."DATE_EFFECTIVE") 17 - access("C"."PK"=NVL("A"."PK","B"."PK") AND "C"."DATE_EFFECTIVE"=NVL("A"."DATE_EFFECTIVE","B"."DATE_EFFECTIVE")) filter("C"."DATE_EFFECTIVE"=NVL("A"."DATE_EFFECTIVE","B"."DATE_EFFECTIVE") AND "C"."PK"=NVL("A"."PK","B"."PK")) 22 - access("A"."PK"="B"."PK"(+) AND "A"."DATE_EFFECTIVE"="B"."DATE_EFFECTIVE"(+)) 25 - access("A"."PK"="B"."PK" AND "A"."DATE_EFFECTIVE"="B"."DATE_EFFECTIVE")
With any plan you need consider whether it is the appropriate way to access your data - keep an open mind; full table scans or nested loops are not always bad!