DW Wisdom (3) - departmental business

The first two parts of this discussion covered the physical aspects of conventional DW wisdom. Today I will cover some of the business drivers for data warehouses.

When people found that their transactional systems were unsuited for BI reporting (perhaps because of the performance impact of running BI on a transactional system, or the transactional system did not hold all of the data required for reporting) they started to look towards dedicated data warehouses.

At first departmental DW systems were the norm, typically because it was a department that funded them; “Purchasing” paid for a system to report on product profitability and volumes purchased, marketing were perhaps more interested in promotion performance and customer profitability. DW performance was (or should have been!) good as all of the required data was in one place. But this lead to three distinct disadvantages:

  • Departments often used their own pet data sources. Often these sources held similar, but not identical, information the sources used in other departments.
  • There was duplication of data as people required similar fact data for their analysis, for example stock levels form key metrics for both purchasing and logistics. This had two effects: the duplication (or worse) of batch data load processes and wasted physical storage
  • It was inherently difficult to access information outside of the department’s usual subject domain. Mainly early query tools struggled to give adequate performance when querying multiple databases.
Cleansing data: Even within a single departmental DW, dimensional data is not always consistent. Duplicate and near duplicate data can arrive from multiple source systems (and even sometimes from a single source). This gives us these challenges:
  • Identify true duplicates
  • Identify records that have a high probability of being duplicates
  • Identify records that contain significant errors – that is the records that fail business rules for loading
the first and last points are relatively easy to facilitate in a database and can often be achieved by the use database integrity rules. The second is harder to manage; after all with what degree of confidence can we say that two similar address apply to the same customer or two identical address apply to different people (for example there are four of living in my house). But the ultimate resolution for all three cases is that only clean (or acceptably so) data should be applied to a data warehouse and duplicate data must be rejected or at least sidelined for human intervention. In the case of duplicates and near duplicates both records must be sidelined as we can tell with any degree of certainty which is the true record. But identifying data that can not be automatically applied is only part of the story – we must trace the problems back to the source systems and seek resolution there too.

Restricting queries to a single data warehouse had two immediate paybacks

  • boosted query performance as joins to disparate remote sources were avoided
  • more security over data sources, authorised people only accessed approved sources
Although the first point is perhaps less important now as query tools are optimised to work with multiple data sources the last point is especially important in today's climate of enhanced compliance management.