Wednesday, September 14, 2011

Pivoting columns into rows using Pentaho

Another interesting Pentaho data integration tool's functionality that I played around with is: pivot delimited fields in a file into rows. Following is the description of the test case.

This transformation requires pivoting a column in the incoming row into multiple rows. Input & outputs are file based so no database table structure is presented in this section.

Input Data:

Input file contains two data elements - email address and list of pages (properties) visited by the user. Second field is of variable length and number of properties (e.g. 3 pages for the first email as compared to 1 for the second email address) is each line is not known beforehand. Email & "properties" are separated by ","  while individual elements in the second field is separated by semi-colon (";").

Example:
Input data
a@sbcglobal.net,Auto;Mail;Finance
b@sbcglobal.net,Auto

Intended Transformation:

The transformation will generate two output files. The format of the output files are described below.
First file will contain columns from source pivoted into rows.
Output will look like
Output data
a@sbcglobal.net,Auto
a@sbcglobal.net,Mail
a@sbcglobal.net,Finance
b@sbcglobal.net,Auto

Second file will contain count of unique emails per property. So this output file will look like:
Output Data
Auto,2
Mail,1
Finance,1

Generation of test data:

Oracle PL/SQL block is used to generated random data set with email ids and pages that these users visited.

DECLARE
   l_fp UTL_FILE.file_type := UTL_FILE.fopen('DATA_DIR','output.txt','w') ;
BEGIN
   FOR c1 IN (SELECT email,
                     CASE
                        WHEN rn = 1 THEN TRIM(SUBSTR(property_list,1,INSTR(property_list,';') - 1))
                        ELSE TRIM(SUBSTR(property_list,INSTR(property_list || ';' ,';',1,rn - 1) + 1, INSTR(property_list || ';' ,';',1,rn ) - INSTR(property_list || ';' ,';',1,rn - 1 ) - 1 ))
                     END property_name
               FROM  (SELECT a.* , b.column_value rn
                      FROM   (SELECT a.email, a.property_list, LENGTH(property_list || ';' ) - LENGTH(REPLACE(property_list || ';' ,';','')) no_of_elements
                              FROM   property_visited_ext a) a,
                             TABLE(generate_rows(no_of_elements)) b
                     )
              )
   LOOP
      UTL_FILE.put_line(l_fp,c1.email || ',' || c1.property_name) ;
   END LOOP ;
   UTL_FILE.fclose(l_fp) ;
END ;

No comments:

Post a Comment