Tuesday, May 17, 2011

Enabling Oracle's table level constraints in parallel

In DW environment often fact tables are inordinately large. They can be easily be several billions and at times hundreds of billions especially in cases where the fact tables contain most granular data. Creating foreign key constraints for such fact tables can be quite a challenge.

For daily data load this issue can be handled easily assuming the fact table is partitioned on day or any other time based columns. Hopefully daily data processing doesn't need to touch data partitions relating to previous days. In that case incoming daily data is loaded into a non-partitioned table having identical structure as that of the fact table. We go ahead and enable the foreign key constraints in that table and then exchange partition with main table including the constraints.

Recently I faced a different problem where I had to drop the existing foreign key constraints from a fact table because the product master table which it was pointing to had to undergo significant data patch. After that operation is over I needed to recreate the foreign key constraints back. The fact table is question had several billions of records. A simple statement like the following ran for hours and didn't complete. The statement ran as single thread.

ALTER TABLE click_fact ADD CONSTRAINT click_fact_prod_fk FOREIGN KEY (product_sid) REFERENCES product_dim(product_sid) ;

After reading through the documentation for Oracle 11.2 and several other articles, here is what I did to exploit Oracle's parallelism:

ALTER TABLE click_fact ADD CONSTRAINT click_fact_product_fk FOREIGN KEY (product_sid) REFERENCES product_dim (product_sid) ENABLE NOVALIDATE; -- enables constraint but does not validate existing data

ALTER SESSION ENABLE PARALLEL DDL;
ALTER TABLE click_fact PARALLEL ;
ALTER TABLE click_fact MODIFY CONSTRAINT click_fact_product_fk VALIDATE ;
ALTER TABLE click_fact NOPARALLEL ;

In TOAD's session browser I saw the following query running in parallel while the constraint is getting validated. It is a very prudent way to use outer join to validate the constraint.

SELECT                                                 /*+ all_rows ordered */
      a.rowid,
       :1,
       :2,
       :3
FROM  CLICK_FACT a,
            PRODUCT_DIM b
WHERE  (A.PRODUCT_SID IS NOT NULL)
AND       (B.PRODUCT_SID(+) = A.PRODUCT_SID)
AND      (B.PRODUCT_SID IS NULL)

Sunday, May 8, 2011

Data validations for slowly changing dimension

In data warehousing world, it is advisable to have all dimensions as type II dimension. Overwriting the history removes all trails of how the attributes of the dimensional entity changed over time. I have come across scenarios where the client shoots down the idea of type II dimension citing the possibility of bloated dimension table along-with more complex ETL to maintain type II attributes. How many us haven't faced a scenario where a type I dimension made a bank's (read Wachovia) revenue or advertisement spending roll up into another bank's (read Wells Fargo) account as WFB bought Wachovia? If anybody is still not convinced with the benefits of type II dimension, Kimball's book is a recommended reading.

I don't dispute the fact that dimensional history tracking makes the table grow faster. The design of the dimension plays a pivotal role in correctly identifying the attributes for which history tracking is necessary. Fast changing attributes can be problematic. In most cases I have seen "hybrid" type 2 dimensions with some attributes tracked over time while the other ones are updated in place. This can keep the growth restricted (type II dimension on "Tums"). I can't resist mentioning a type II (slowly changing) dimension that I came across in a large internet company. This dimension was for the keywords that advertisers placed their bids on. Bid price was changing very often - many times within a day. Bid price was considered as an attribute and housed in the main dimension table. The results was a multi-billion record dimension table!!!

Slowly changing dimension makes the ETL process complex. If not tested thoroughly buggy dimensional table can have devastating effect. There are times when I execute couple of checks to validate the sanity of the dimensional data. These checks can be built into the ETL step. It is even better to define them as declarative constraints wherever possible. That way constraints reside closest to data making it impossible to bypass them.

Primary key in the dimensional table: The surrogate key generated by ETL subsystem usually serves as primary key. As long as we are using database's sequence number object, then it is guaranteed that the generated number is unique. The primary key constraint enforced by unique index helps when reporting queries have predicates on different attributes of the dimension.

Uniqueness of active record: If product dimension is maintained as type II dimension then product ID (identifier coming from operational system) may occur may times in the dimension table. All occurrences of records with a product ID needs to be inactive except the last one which serves as a most current record. Oracle supports a declarative way of defining this constraint.

create table product_dim (
   product_sid NUMBER ,
   product_id  NUMBER,
   product_name VARCHAR2(256 CHAR),
   product_package VARCHAR2(64 CHAR) ,
   effective_start_date DATE NOT NULL ,
   effective_end_date DATE NOT NULL,
   current_row_indicator VARCHAR2(1) NOT NULL,
   CONSTRAINT product_dim_pk PRIMARY KEY (product_sid) USING INDEX) ;
  
INSERT INTO product_dim VALUES (1,1,'Coca Cola','Red package',DATE '2000-01-01',DATE '2000-01-23','N') ;

INSERT INTO product_dim VALUES (2,1,'Coca Cola','Blue package',DATE '2000-01-23',DATE '2000-01-25','N') ;

INSERT INTO product_dim VALUES (3,1,'Coca Cola','Green package',DATE '2000-01-25',DATE '2050-01-01','Y') ;

CREATE UNIQUE INDEX product_dim_prod_uk ON product_dim(CASE WHEN current_row_indicator = 'Y' THEN product_id ELSE NULL END) ;


Effective dates of records with same operational key needs to be non-overlapping:

I have come across dimensions which obey the above two constraints but create trouble by having records with overlapping dates. They come into existence because of "buggy" ETL code that maintains the dimensions. It will manifest itself by inflating the metric values when joined with fact table . Following is a simple example:

CREATE TABLE sales_fact (
   sale_day_sid NUMBER ,
   product_sid   NUMBER,
   product_id    NUMBER ,
   sale_amount   NUMBER ,
   CONSTRAINT sales_fact_prod_fk FOREIGN KEY (product_sid) REFERENCES product_dim(product_sid) ,
   CONSTRAINT sales_fact_date_fk FOREIGN KEY (sale_day_sid) REFERENCES day_dim(day_sid)) ;

INSERT INTO sales_fact VALUES (4765,1,1,100) ;
INSERT INTO sales_fact VALUES (4772,1,2,150) ;
INSERT INTO sales_fact VALUES (4776,1,3,200) ;


# Intentionally introduced record in product dimension to have overlapping dates
INSERT INTO product_dim VALUES (4,1,'Coca Cola','Yellow Package',DATE '2000-01-28',DATE '2050-01-01','N') ;



The above insert statement will not trigger the unique constraint violation because the record status is "inactive". Such records get introduced often by incorrect manipulation of end dating the existing record and creating a new one during type II dimension maintenance.

First issue: query against this dimension will return two records for a given product_id and a specific date.

SELECT *
FROM    product_dim
WHERE  DATE '2000-01-29' >= effective_start_date AND
              DATE '2000-01-29' < effective_end_date
AND      product_name = 'Coca Cola'

Second issue: If there is a need to aggregate the total sale amount of the product for all time periods and roll it up against the product description valid on 29th January, then the query will look like the following:

SELECT SUM(sale_amount)
FROM   sales_fact a,
             product_dim b
WHERE  a.product_id = b.product_id
AND      b.effective_start_date <= DATE '2000-01-29' AND
AND      b.effective_end_date > DATE '2000-01-29'
AND      b.product_name = 'Coca Cola'

This will  inflate the total sale amount.

A quick way to check such error will be to introduce a data quality check at the end of ETL

SELECT product_id , effective_start_date,
             effective_end_date, next_effective_start_date, next_effective_end_date
FROM   (SELECT product_id, effective_start_date, effective_end_date,
                           next_effective_start_date, next_effective_end_date,
                           WM_PERIOD(effective_start_date, effective_end_date )  first_period ,
                           WM_PERIOD(next_effective_start_date,next_effective_end_date ) next_period
              FROM   (SELECT product_id,effective_start_date, effective_end_date,
                                         LEAD(effective_start_date) OVER (PARTITION BY product_id
                                                   ORDER BY effective_start_date) next_effective_start_date,
                                         LEAD(effective_end_date) OVER (PARTITION BY product_id
                                                   ORDER BY effective_start_date) next_effective_end_date      
                           FROM   product_dim)
              WHERE next_effective_start_date IS NOT NULL
             )
WHERE WM_OVERLAPS(first_period,next_period) > 0