Sunday, May 8, 2011

Data validations for slowly changing dimension

In data warehousing world, it is advisable to have all dimensions as type II dimension. Overwriting the history removes all trails of how the attributes of the dimensional entity changed over time. I have come across scenarios where the client shoots down the idea of type II dimension citing the possibility of bloated dimension table along-with more complex ETL to maintain type II attributes. How many us haven't faced a scenario where a type I dimension made a bank's (read Wachovia) revenue or advertisement spending roll up into another bank's (read Wells Fargo) account as WFB bought Wachovia? If anybody is still not convinced with the benefits of type II dimension, Kimball's book is a recommended reading.

I don't dispute the fact that dimensional history tracking makes the table grow faster. The design of the dimension plays a pivotal role in correctly identifying the attributes for which history tracking is necessary. Fast changing attributes can be problematic. In most cases I have seen "hybrid" type 2 dimensions with some attributes tracked over time while the other ones are updated in place. This can keep the growth restricted (type II dimension on "Tums"). I can't resist mentioning a type II (slowly changing) dimension that I came across in a large internet company. This dimension was for the keywords that advertisers placed their bids on. Bid price was changing very often - many times within a day. Bid price was considered as an attribute and housed in the main dimension table. The results was a multi-billion record dimension table!!!

Slowly changing dimension makes the ETL process complex. If not tested thoroughly buggy dimensional table can have devastating effect. There are times when I execute couple of checks to validate the sanity of the dimensional data. These checks can be built into the ETL step. It is even better to define them as declarative constraints wherever possible. That way constraints reside closest to data making it impossible to bypass them.

Primary key in the dimensional table: The surrogate key generated by ETL subsystem usually serves as primary key. As long as we are using database's sequence number object, then it is guaranteed that the generated number is unique. The primary key constraint enforced by unique index helps when reporting queries have predicates on different attributes of the dimension.

Uniqueness of active record: If product dimension is maintained as type II dimension then product ID (identifier coming from operational system) may occur may times in the dimension table. All occurrences of records with a product ID needs to be inactive except the last one which serves as a most current record. Oracle supports a declarative way of defining this constraint.

create table product_dim (
   product_sid NUMBER ,
   product_id  NUMBER,
   product_name VARCHAR2(256 CHAR),
   product_package VARCHAR2(64 CHAR) ,
   effective_start_date DATE NOT NULL ,
   effective_end_date DATE NOT NULL,
   current_row_indicator VARCHAR2(1) NOT NULL,
   CONSTRAINT product_dim_pk PRIMARY KEY (product_sid) USING INDEX) ;
  
INSERT INTO product_dim VALUES (1,1,'Coca Cola','Red package',DATE '2000-01-01',DATE '2000-01-23','N') ;

INSERT INTO product_dim VALUES (2,1,'Coca Cola','Blue package',DATE '2000-01-23',DATE '2000-01-25','N') ;

INSERT INTO product_dim VALUES (3,1,'Coca Cola','Green package',DATE '2000-01-25',DATE '2050-01-01','Y') ;

CREATE UNIQUE INDEX product_dim_prod_uk ON product_dim(CASE WHEN current_row_indicator = 'Y' THEN product_id ELSE NULL END) ;


Effective dates of records with same operational key needs to be non-overlapping:

I have come across dimensions which obey the above two constraints but create trouble by having records with overlapping dates. They come into existence because of "buggy" ETL code that maintains the dimensions. It will manifest itself by inflating the metric values when joined with fact table . Following is a simple example:

CREATE TABLE sales_fact (
   sale_day_sid NUMBER ,
   product_sid   NUMBER,
   product_id    NUMBER ,
   sale_amount   NUMBER ,
   CONSTRAINT sales_fact_prod_fk FOREIGN KEY (product_sid) REFERENCES product_dim(product_sid) ,
   CONSTRAINT sales_fact_date_fk FOREIGN KEY (sale_day_sid) REFERENCES day_dim(day_sid)) ;

INSERT INTO sales_fact VALUES (4765,1,1,100) ;
INSERT INTO sales_fact VALUES (4772,1,2,150) ;
INSERT INTO sales_fact VALUES (4776,1,3,200) ;


# Intentionally introduced record in product dimension to have overlapping dates
INSERT INTO product_dim VALUES (4,1,'Coca Cola','Yellow Package',DATE '2000-01-28',DATE '2050-01-01','N') ;



The above insert statement will not trigger the unique constraint violation because the record status is "inactive". Such records get introduced often by incorrect manipulation of end dating the existing record and creating a new one during type II dimension maintenance.

First issue: query against this dimension will return two records for a given product_id and a specific date.

SELECT *
FROM    product_dim
WHERE  DATE '2000-01-29' >= effective_start_date AND
              DATE '2000-01-29' < effective_end_date
AND      product_name = 'Coca Cola'

Second issue: If there is a need to aggregate the total sale amount of the product for all time periods and roll it up against the product description valid on 29th January, then the query will look like the following:

SELECT SUM(sale_amount)
FROM   sales_fact a,
             product_dim b
WHERE  a.product_id = b.product_id
AND      b.effective_start_date <= DATE '2000-01-29' AND
AND      b.effective_end_date > DATE '2000-01-29'
AND      b.product_name = 'Coca Cola'

This will  inflate the total sale amount.

A quick way to check such error will be to introduce a data quality check at the end of ETL

SELECT product_id , effective_start_date,
             effective_end_date, next_effective_start_date, next_effective_end_date
FROM   (SELECT product_id, effective_start_date, effective_end_date,
                           next_effective_start_date, next_effective_end_date,
                           WM_PERIOD(effective_start_date, effective_end_date )  first_period ,
                           WM_PERIOD(next_effective_start_date,next_effective_end_date ) next_period
              FROM   (SELECT product_id,effective_start_date, effective_end_date,
                                         LEAD(effective_start_date) OVER (PARTITION BY product_id
                                                   ORDER BY effective_start_date) next_effective_start_date,
                                         LEAD(effective_end_date) OVER (PARTITION BY product_id
                                                   ORDER BY effective_start_date) next_effective_end_date      
                           FROM   product_dim)
              WHERE next_effective_start_date IS NOT NULL
             )
WHERE WM_OVERLAPS(first_period,next_period) > 0






No comments:

Post a Comment