Wednesday, September 14, 2011

Pivoting columns into rows using Pentaho

Another interesting Pentaho data integration tool's functionality that I played around with is: pivot delimited fields in a file into rows. Following is the description of the test case.

This transformation requires pivoting a column in the incoming row into multiple rows. Input & outputs are file based so no database table structure is presented in this section.

Input Data:

Input file contains two data elements - email address and list of pages (properties) visited by the user. Second field is of variable length and number of properties (e.g. 3 pages for the first email as compared to 1 for the second email address) is each line is not known beforehand. Email & "properties" are separated by ","  while individual elements in the second field is separated by semi-colon (";").

Example:
Input data
a@sbcglobal.net,Auto;Mail;Finance
b@sbcglobal.net,Auto

Intended Transformation:

The transformation will generate two output files. The format of the output files are described below.
First file will contain columns from source pivoted into rows.
Output will look like
Output data
a@sbcglobal.net,Auto
a@sbcglobal.net,Mail
a@sbcglobal.net,Finance
b@sbcglobal.net,Auto

Second file will contain count of unique emails per property. So this output file will look like:
Output Data
Auto,2
Mail,1
Finance,1

Generation of test data:

Oracle PL/SQL block is used to generated random data set with email ids and pages that these users visited.

DECLARE
   l_fp UTL_FILE.file_type := UTL_FILE.fopen('DATA_DIR','output.txt','w') ;
BEGIN
   FOR c1 IN (SELECT email,
                     CASE
                        WHEN rn = 1 THEN TRIM(SUBSTR(property_list,1,INSTR(property_list,';') - 1))
                        ELSE TRIM(SUBSTR(property_list,INSTR(property_list || ';' ,';',1,rn - 1) + 1, INSTR(property_list || ';' ,';',1,rn ) - INSTR(property_list || ';' ,';',1,rn - 1 ) - 1 ))
                     END property_name
               FROM  (SELECT a.* , b.column_value rn
                      FROM   (SELECT a.email, a.property_list, LENGTH(property_list || ';' ) - LENGTH(REPLACE(property_list || ';' ,';','')) no_of_elements
                              FROM   property_visited_ext a) a,
                             TABLE(generate_rows(no_of_elements)) b
                     )
              )
   LOOP
      UTL_FILE.put_line(l_fp,c1.email || ',' || c1.property_name) ;
   END LOOP ;
   UTL_FILE.fclose(l_fp) ;
END ;

Slowly changing dimension maintenance using Pentaho

After installed Pentaho enterprise data integration tool in my laptop I attempted to test few of Pentaho’s functionality. Out-of-the-box data warehousing transformation for slowly changing dimension appears to be very impressive. So I started playing around with it and in due course generated some random data for testing purposes. During my exploration some of the shortcomings became evident.
  • The tool CAN support hybrid dimension (type I & II mixed).
  • It will maintain effective start and end date in dimension record.
  • If the dimension happens to have a version number, it can maintain that as well.
  • I wanted to have a current record indicator and didn’t find any way to maintain it automatically.
Current record indicator (allowable values Y/N) has immense value in identifying the latest version of a record. I felt restricted by not being able to maintain it as part of “dimension lookup/update” step. What this means is several additional steps (update/conditional branches) to complete such a dimension maintenance ETL step. Additional updates against a big dimension table can turn out to be quite expensive.




The following section describes the test case for SCD2

DROP TABLE product_dim PURGE ;

CREATE TABLE product_dim (
   product_sid NUMBER , /* surrogate key */
   product_id NUMBER , /* operational key */
   product_name VARCHAR2(256) , /* type I attribute. In place update of all records with same operational key when changes are detected in this field */
   manufacturer_name VARCHAR2(256) , /* type I attribute. In place update of all records with same operational key when changes are detected in this field */
   product_category VARCHAR2(64) , /* type II attribute. Create new record with product SID generated out a database sequence (or similar object). End date the last record. Mark current_record_indicator = 'N' for that record */
   effective_start_date DATE,
   effective_end_date DATE , /* active record for an operational key will have effective_end_date = DATE '2500-01-01' */
   record_version NUMBER, /* version number of dimension record */
   current_record_indicator VARCHAR2(1) DEFAULT 'N', /* allowable values - Y/N */
   CONSTRAINT product_dim_pk PRIMARY KEY (product_sid) USING INDEX) ;

DROP TABLE changed_products ; -- daily modifications to product dimension

CREATE TABLE changed_products (
   product_id NUMBER,
   product_name VARCHAR2(256) ,
   manufacturer_name VARCHAR2(256),
   product_category VARCHAR2(64))

Instead of “changed product” table, I modified the transformation to read a file primarily to test out the capability of reading a compressed file. Oracle SQL script is used to generate some random data set:

DECLARE
   l_manf_names SMALL_VARCHAR_NTT := SMALL_VARCHAR_NTT('Kellogs','General Mills','Tyson','Foster Farm','Johnson','Philips') ;
   l_prod_category SMALL_VARCHAR_NTT := SMALL_VARCHAR_NTT('Baby','Beauty','Food','Health','Drugs') ;
   l_fp UTL_FILE.file_type := UTL_FILE.FOPEN('DATA_DIR','changed_products.txt','w') ;
BEGIN
   FOR i IN 1..10000
   LOOP
      UTL_FILE.put_line(l_fp,i || ',' || 'Product - ' || i || ',' || l_manf_names(TRUNC(Dbms_random.value(1,6))) || ',' ||  l_prod_category(TRUNC(Dbms_random.value(1,6)))) ;
   END LOOP ;
   UTL_FILE.fclose(l_fp) ;
END ;
/
TRUNCATE TABLE product_dim ;


DROP SEQUENCE product_sid_s ;

CREATE SEQUENCE product_sid_s START WITH 1 ;

Number of records in the incoming flat file can be easily controlled by changing the upper limit of the loop index. First run of the transformation creates new production dimension records. Type I attribute changes will require updating all records with new value for all records with same operational key (referred as “punch through” in Pentaho).Transformation process is repeated after incoming data file is changed to include type I & type II attribute changes. Data generation for second run is follows:

DECLARE
   l_manf_names SMALL_VARCHAR_NTT := SMALL_VARCHAR_NTT('Kellogs','General Mills','Tyson','Foster Farm','Johnson','Philips') ;
   l_prod_category SMALL_VARCHAR_NTT := SMALL_VARCHAR_NTT('Baby','Beauty','Food','Health','Drugs') ;
   l_product_ids NUMBER_NTT := NUMBER_NTT() ;
   l_fp UTL_FILE.file_type := UTL_FILE.FOPEN('DATA_DIR','changed_products.txt','w') ;
BEGIN
   SELECT product_id
   BULK COLLECT INTO l_product_ids
   FROM   (SELECT product_id
           FROM   (SELECT a.* , TRUNC(DBMS_RANDOM.value(1,10000)) rec_no
                   FROM   product_dim a
                   WHERE  current_record_indicator = 'Y')
           ORDER BY rec_no)
   WHERE  rownum <= 500 ;
   FOR c1 IN (SELECT column_value product_id
             FROM   TABLE(l_product_ids))
   LOOP
      UTL_FILE.put_line(l_fp,c1.product_id || ',' || 'Product - ' || c1.product_id || '.1' || ',' || l_manf_names(TRUNC(Dbms_random.value(1,6))) || ',' ||  l_prod_category(TRUNC(Dbms_random.value(1,6)))) ;
   END LOOP ;
   UTL_FILE.fclose(l_fp) ;
END ;
/
In the above script randomly selected 500 records type I & type II attributes are changed. Type II changes will have additional overhead of updating effective end date of the last record with same operational and inserting a new record. Here also, by changing the upper limit of the loop index we can make the incoming data set bigger and test scalability of the tool. Ability of Pentaho to read the compressed file is very beneficial specially for large data set.

Sunday, July 17, 2011

Comparison of Cathay Pacific and Singapore Airlines

This is the first time when we traveled to India using airlines other than Singapore Airlines from U.S. West Coast. We used Cathay Pacific whose price per adult ticket came out to be $300 less than SIA. I thought that I would publish a comparison of these two airlines so that others reading this blog might make prudent decision.

  • Both airlines fly B-747 in San Francisco – Hong Kong route during peak of the summer.
  • Cathay flies out of SFO during day time. So we didn’t have to struggle with half-asleep kids. SIA also had a similar option. Afternoon flight by SIA results in increased wait time in Singapore.
  • Afternoon flight out of SFO seems to be perennially late for Cathay Pacific. For my family, it was more than two hours reducing their wait time in Hong Kong for the connecting flight to Chennai. At one time it posed a threat of missing the connecting flight to Chennai. During my travel it was little more than an hour late.
  • From Hong Kong, we had a connecting flight to Chennai. The wait time in Chennai is close to 4 hours. It gives enough time to come out, stretch legs, go for bathroom breaks etc. Singapore airlines rates terribly in this area. They take you to Singapore from Hong Kong. Then there will be 8-10 hours wait time depending on which Indian city you are going to. With kids and family, options are - to book the hotel inside the airport or go for a city tour. The hotel is not the cleanest one and is not worth its cost. Considering the above aspect, Cathay Pacific got us to India very swiftly.
  • Cathay Pacific doesn’t have a connecting flight to Kolkata which was our final destination. So we had to take detour to Chennai. Domestic airlines (like Indigo) may have more stringent weight / count of checked-in luggage restrictions compared to international counterparts. 
  • Inflight entertainment options of Cathay Pacific are very similar to SIA with personal television, large number of movies in South Asian language.
  • The seats in San Francisco – Hong Kong flight don’t recline. This is the first time I noticed that in a long haul flight. Instead, as the reclining switch is pressed, the seat cushion (or part of it) moves forward. It is very awkward.  Hong Kong – Chennai flight was devoid of that crookedness. 
  • Toilets seem to lack toiletries like comb, disposable shaving set. I am not sure whether it was our plane only or part of the low cost drive of Cathay.
  • Food appeared to be frugal and less tasty compared to SIA. We tried with Hindu vegetarian and non-vegetarian variations.  I can’t comment on the remaining choices.
  • Snacks between the meals were far apart. Even going to the kitchen and requesting some snacks for the kids hardly evoked any response. SIA beats Cathay Pacific hands down in this area. When we traveled with very young kid, airhostesses would come and hold the baby (if needed) when the mom went for bathroom break. They kept us going with steady supply of diapers if we ever ran out of them. Now our daughters being little grown up, they need things to keep them engaged. SIA cabin crew always had ready stock of coloring books, pencils, playing cards. They never seemed to run out of stock. Cathay Pacific didn’t have anything. Kids were on their own when it comes to entertaining them and keep them engaged. They watched movies for a while and then started getting bored.

Wednesday, June 8, 2011

SQL trick

A friend of mine posed a challenging problem today. I will restate the problem using an example.

 create table foo (
    val number ,
    lvl number);

insert into foo values (3,1);
insert into foo values (5,2);
insert into foo values (12,3);
insert into foo values (20,4);

We need to write a single SQL to convert data in foo in following form:

 (20 - (12 - (5 - 3)))

It seemed to me that the table acted like a stack with lvl = 4 indicating the top of the stack. The problem is to construct an infix expression with correct number of parenthesis.


Here is what I did:

WITH tmp AS 
   (SELECT REPLACE(WM_CONCAT(val),',','') str
    FROM   (SELECT CASE WHEN lvl > 1 THEN '(' || val || ' - ' ELSE TO_CHAR(val) END val
            FROM   foo
            ORDER BY lvl DESC)
   ),
   max_lvl AS
   (SELECT MAX(lvl) - 1 lvl
    FROM   foo)
SELECT RPAD(a.str,LENGTH(a.str) + b.lvl, ')')
FROM    tmp a,
             max_lvl b

WM_CONCAT is Oracle provided package and is part of Oracle work space manager. It combines all records in the foo into a single comma separated line. Before using WM_CONCAT function, every value is prefixed with parenthesis and post fixed with minus (-) sign except for the first level. Eventually commas are replaced by null string and closing right parenthesis are added at the end. I didn't have to write a single line of procedural code.

Tuesday, May 17, 2011

Enabling Oracle's table level constraints in parallel

In DW environment often fact tables are inordinately large. They can be easily be several billions and at times hundreds of billions especially in cases where the fact tables contain most granular data. Creating foreign key constraints for such fact tables can be quite a challenge.

For daily data load this issue can be handled easily assuming the fact table is partitioned on day or any other time based columns. Hopefully daily data processing doesn't need to touch data partitions relating to previous days. In that case incoming daily data is loaded into a non-partitioned table having identical structure as that of the fact table. We go ahead and enable the foreign key constraints in that table and then exchange partition with main table including the constraints.

Recently I faced a different problem where I had to drop the existing foreign key constraints from a fact table because the product master table which it was pointing to had to undergo significant data patch. After that operation is over I needed to recreate the foreign key constraints back. The fact table is question had several billions of records. A simple statement like the following ran for hours and didn't complete. The statement ran as single thread.

ALTER TABLE click_fact ADD CONSTRAINT click_fact_prod_fk FOREIGN KEY (product_sid) REFERENCES product_dim(product_sid) ;

After reading through the documentation for Oracle 11.2 and several other articles, here is what I did to exploit Oracle's parallelism:

ALTER TABLE click_fact ADD CONSTRAINT click_fact_product_fk FOREIGN KEY (product_sid) REFERENCES product_dim (product_sid) ENABLE NOVALIDATE; -- enables constraint but does not validate existing data

ALTER SESSION ENABLE PARALLEL DDL;
ALTER TABLE click_fact PARALLEL ;
ALTER TABLE click_fact MODIFY CONSTRAINT click_fact_product_fk VALIDATE ;
ALTER TABLE click_fact NOPARALLEL ;

In TOAD's session browser I saw the following query running in parallel while the constraint is getting validated. It is a very prudent way to use outer join to validate the constraint.

SELECT                                                 /*+ all_rows ordered */
      a.rowid,
       :1,
       :2,
       :3
FROM  CLICK_FACT a,
            PRODUCT_DIM b
WHERE  (A.PRODUCT_SID IS NOT NULL)
AND       (B.PRODUCT_SID(+) = A.PRODUCT_SID)
AND      (B.PRODUCT_SID IS NULL)

Sunday, May 8, 2011

Data validations for slowly changing dimension

In data warehousing world, it is advisable to have all dimensions as type II dimension. Overwriting the history removes all trails of how the attributes of the dimensional entity changed over time. I have come across scenarios where the client shoots down the idea of type II dimension citing the possibility of bloated dimension table along-with more complex ETL to maintain type II attributes. How many us haven't faced a scenario where a type I dimension made a bank's (read Wachovia) revenue or advertisement spending roll up into another bank's (read Wells Fargo) account as WFB bought Wachovia? If anybody is still not convinced with the benefits of type II dimension, Kimball's book is a recommended reading.

I don't dispute the fact that dimensional history tracking makes the table grow faster. The design of the dimension plays a pivotal role in correctly identifying the attributes for which history tracking is necessary. Fast changing attributes can be problematic. In most cases I have seen "hybrid" type 2 dimensions with some attributes tracked over time while the other ones are updated in place. This can keep the growth restricted (type II dimension on "Tums"). I can't resist mentioning a type II (slowly changing) dimension that I came across in a large internet company. This dimension was for the keywords that advertisers placed their bids on. Bid price was changing very often - many times within a day. Bid price was considered as an attribute and housed in the main dimension table. The results was a multi-billion record dimension table!!!

Slowly changing dimension makes the ETL process complex. If not tested thoroughly buggy dimensional table can have devastating effect. There are times when I execute couple of checks to validate the sanity of the dimensional data. These checks can be built into the ETL step. It is even better to define them as declarative constraints wherever possible. That way constraints reside closest to data making it impossible to bypass them.

Primary key in the dimensional table: The surrogate key generated by ETL subsystem usually serves as primary key. As long as we are using database's sequence number object, then it is guaranteed that the generated number is unique. The primary key constraint enforced by unique index helps when reporting queries have predicates on different attributes of the dimension.

Uniqueness of active record: If product dimension is maintained as type II dimension then product ID (identifier coming from operational system) may occur may times in the dimension table. All occurrences of records with a product ID needs to be inactive except the last one which serves as a most current record. Oracle supports a declarative way of defining this constraint.

create table product_dim (
   product_sid NUMBER ,
   product_id  NUMBER,
   product_name VARCHAR2(256 CHAR),
   product_package VARCHAR2(64 CHAR) ,
   effective_start_date DATE NOT NULL ,
   effective_end_date DATE NOT NULL,
   current_row_indicator VARCHAR2(1) NOT NULL,
   CONSTRAINT product_dim_pk PRIMARY KEY (product_sid) USING INDEX) ;
  
INSERT INTO product_dim VALUES (1,1,'Coca Cola','Red package',DATE '2000-01-01',DATE '2000-01-23','N') ;

INSERT INTO product_dim VALUES (2,1,'Coca Cola','Blue package',DATE '2000-01-23',DATE '2000-01-25','N') ;

INSERT INTO product_dim VALUES (3,1,'Coca Cola','Green package',DATE '2000-01-25',DATE '2050-01-01','Y') ;

CREATE UNIQUE INDEX product_dim_prod_uk ON product_dim(CASE WHEN current_row_indicator = 'Y' THEN product_id ELSE NULL END) ;


Effective dates of records with same operational key needs to be non-overlapping:

I have come across dimensions which obey the above two constraints but create trouble by having records with overlapping dates. They come into existence because of "buggy" ETL code that maintains the dimensions. It will manifest itself by inflating the metric values when joined with fact table . Following is a simple example:

CREATE TABLE sales_fact (
   sale_day_sid NUMBER ,
   product_sid   NUMBER,
   product_id    NUMBER ,
   sale_amount   NUMBER ,
   CONSTRAINT sales_fact_prod_fk FOREIGN KEY (product_sid) REFERENCES product_dim(product_sid) ,
   CONSTRAINT sales_fact_date_fk FOREIGN KEY (sale_day_sid) REFERENCES day_dim(day_sid)) ;

INSERT INTO sales_fact VALUES (4765,1,1,100) ;
INSERT INTO sales_fact VALUES (4772,1,2,150) ;
INSERT INTO sales_fact VALUES (4776,1,3,200) ;


# Intentionally introduced record in product dimension to have overlapping dates
INSERT INTO product_dim VALUES (4,1,'Coca Cola','Yellow Package',DATE '2000-01-28',DATE '2050-01-01','N') ;



The above insert statement will not trigger the unique constraint violation because the record status is "inactive". Such records get introduced often by incorrect manipulation of end dating the existing record and creating a new one during type II dimension maintenance.

First issue: query against this dimension will return two records for a given product_id and a specific date.

SELECT *
FROM    product_dim
WHERE  DATE '2000-01-29' >= effective_start_date AND
              DATE '2000-01-29' < effective_end_date
AND      product_name = 'Coca Cola'

Second issue: If there is a need to aggregate the total sale amount of the product for all time periods and roll it up against the product description valid on 29th January, then the query will look like the following:

SELECT SUM(sale_amount)
FROM   sales_fact a,
             product_dim b
WHERE  a.product_id = b.product_id
AND      b.effective_start_date <= DATE '2000-01-29' AND
AND      b.effective_end_date > DATE '2000-01-29'
AND      b.product_name = 'Coca Cola'

This will  inflate the total sale amount.

A quick way to check such error will be to introduce a data quality check at the end of ETL

SELECT product_id , effective_start_date,
             effective_end_date, next_effective_start_date, next_effective_end_date
FROM   (SELECT product_id, effective_start_date, effective_end_date,
                           next_effective_start_date, next_effective_end_date,
                           WM_PERIOD(effective_start_date, effective_end_date )  first_period ,
                           WM_PERIOD(next_effective_start_date,next_effective_end_date ) next_period
              FROM   (SELECT product_id,effective_start_date, effective_end_date,
                                         LEAD(effective_start_date) OVER (PARTITION BY product_id
                                                   ORDER BY effective_start_date) next_effective_start_date,
                                         LEAD(effective_end_date) OVER (PARTITION BY product_id
                                                   ORDER BY effective_start_date) next_effective_end_date      
                           FROM   product_dim)
              WHERE next_effective_start_date IS NOT NULL
             )
WHERE WM_OVERLAPS(first_period,next_period) > 0






Thursday, April 21, 2011

Cognos10 installation for Windows

I was struggling with Cognos BI server installation for my laptop which is running Windows 7. Later I faced similar challenge for Windows 2008 server machine. In both cases I was looking for default out-of-the-box installation.

The place where I got stuck was with configuring the web server. IBM Cognos documentation is barely adequate. So I thought I would post the steps for the installation for everybody's benefit. I installed Apache web server. The key is to edit httpd.conf file for virtual directory mapping.

Include conf/extra/cognos10.conf # added this line in httpd.conf

Contents of included file cognos10.conf:

ScriptAlias /ibmcognos/cgi-bin "C:/Program Files (x86)/ibm/cognos/c10/cgi-bin"
<Directory "C:/Program Files (x86)/ibm/cognos/c10/cgi-bin">
    AllowOverride None
    Options None
    Order Allow,Deny
    Allow from All
</Directory>
Alias /ibmcognos "C:/Program Files (x86)/ibm/cognos/c10/webcontent"
<Directory "C:/Program Files (x86)/ibm/cognos/c10/webcontent">
   Options None
   AllowOverride None
   Order Allow,Deny
   Allow from All
</Directory>