- The tool CAN support hybrid dimension (type I & II mixed).
- It will maintain effective start and end date in dimension record.
- If the dimension happens to have a version number, it can maintain that as well.
- I wanted to have a current record indicator and didn’t find any way to maintain it automatically.
The following section describes the test case for SCD2
DROP TABLE product_dim PURGE ;
CREATE TABLE product_dim (
product_sid NUMBER , /* surrogate key */
product_id NUMBER , /* operational key */
product_name VARCHAR2(256) , /* type I attribute. In place update of all records with same operational key when changes are detected in this field */
manufacturer_name VARCHAR2(256) , /* type I attribute. In place update of all records with same operational key when changes are detected in this field */
product_category VARCHAR2(64) , /* type II attribute. Create new record with product SID generated out a database sequence (or similar object). End date the last record. Mark current_record_indicator = 'N' for that record */
effective_start_date DATE,
effective_end_date DATE , /* active record for an operational key will have effective_end_date = DATE '2500-01-01' */
record_version NUMBER, /* version number of dimension record */
current_record_indicator VARCHAR2(1) DEFAULT 'N', /* allowable values - Y/N */
CONSTRAINT product_dim_pk PRIMARY KEY (product_sid) USING INDEX) ;
DROP TABLE changed_products ; -- daily modifications to product dimension
CREATE TABLE changed_products (
product_id NUMBER,
product_name VARCHAR2(256) ,
manufacturer_name VARCHAR2(256),
product_category VARCHAR2(64))
Instead of “changed product” table, I modified the transformation to read a file primarily to test out the capability of reading a compressed file. Oracle SQL script is used to generate some random data set:
DECLARE
l_manf_names SMALL_VARCHAR_NTT := SMALL_VARCHAR_NTT('Kellogs','General Mills','Tyson','Foster Farm','Johnson','Philips') ;
l_prod_category SMALL_VARCHAR_NTT := SMALL_VARCHAR_NTT('Baby','Beauty','Food','Health','Drugs') ;
l_fp UTL_FILE.file_type := UTL_FILE.FOPEN('DATA_DIR','changed_products.txt','w') ;
BEGIN
FOR i IN 1..10000
LOOP
UTL_FILE.put_line(l_fp,i || ',' || 'Product - ' || i || ',' || l_manf_names(TRUNC(Dbms_random.value(1,6))) || ',' || l_prod_category(TRUNC(Dbms_random.value(1,6)))) ;
END LOOP ;
UTL_FILE.fclose(l_fp) ;
END ;
/
TRUNCATE TABLE product_dim ;
DROP SEQUENCE product_sid_s ;
CREATE SEQUENCE product_sid_s START WITH 1 ;
Number of records in the incoming flat file can be easily controlled by changing the upper limit of the loop index. First run of the transformation creates new production dimension records. Type I attribute changes will require updating all records with new value for all records with same operational key (referred as “punch through” in Pentaho).Transformation process is repeated after incoming data file is changed to include type I & type II attribute changes. Data generation for second run is follows:
DECLARE
l_manf_names SMALL_VARCHAR_NTT := SMALL_VARCHAR_NTT('Kellogs','General Mills','Tyson','Foster Farm','Johnson','Philips') ;
l_prod_category SMALL_VARCHAR_NTT := SMALL_VARCHAR_NTT('Baby','Beauty','Food','Health','Drugs') ;
l_product_ids NUMBER_NTT := NUMBER_NTT() ;
l_fp UTL_FILE.file_type := UTL_FILE.FOPEN('DATA_DIR','changed_products.txt','w') ;
BEGIN
SELECT product_id
BULK COLLECT INTO l_product_ids
FROM (SELECT product_id
FROM (SELECT a.* , TRUNC(DBMS_RANDOM.value(1,10000)) rec_no
FROM product_dim a
WHERE current_record_indicator = 'Y')
ORDER BY rec_no)
WHERE rownum <= 500 ;
FOR c1 IN (SELECT column_value product_id
FROM TABLE(l_product_ids))
LOOP
UTL_FILE.put_line(l_fp,c1.product_id || ',' || 'Product - ' || c1.product_id || '.1' || ',' || l_manf_names(TRUNC(Dbms_random.value(1,6))) || ',' || l_prod_category(TRUNC(Dbms_random.value(1,6)))) ;
END LOOP ;
UTL_FILE.fclose(l_fp) ;
END ;
/
In the above script randomly selected 500 records type I & type II attributes are changed. Type II changes will have additional overhead of updating effective end date of the last record with same operational and inserting a new record. Here also, by changing the upper limit of the loop index we can make the incoming data set bigger and test scalability of the tool. Ability of Pentaho to read the compressed file is very beneficial specially for large data set.
No comments:
Post a Comment