Geeks With Blogs
Rahul Anand's Blog If my mind can conceive it, and my heart can believe it, I know I can achieve it.

Context:

Recently I got a chance to get introduced to Kettle, due to a new project which requires ingesting huge and dynamic data files using PDI in a data warehouse. I have a good experience working with Microsoft technology stack - .NET Framework, C#, SQL Server, SSIS. The idea of open source has always been fascinating to me but I have not worked much with open source tools.

So, working with PDI is quite a different experience for me. I liked the continuous evolution of PDI and the capabilities it offers.

The first thing which I fell in love with is ‘ETL Metadata Injection’ component. It is really very helpful in making your ETL packages completely dynamic. The rest of this post talks about ETL Metadata Injection component listed under ‘Flow’ category in the design steps for transformation package.

Requirement:

In my case the requirement was to load 100+ data files with dynamic columns. The columns are not fixed and are defined by a metadata file. The data files and the metadata files are delivered daily through an FTP. The files are provided as flat delimited files. Obvious enough I needed to normalize the data and store the column name as attribute name and the value as corresponding attribute value. Each line in the data file also contains an entity id for which attributes are provided.

To visualize the data you can think of following structure for the data file:

Entity Id, Attribute 1, Attribute 2, Attribute 3, …, Attribute N

1, ABC Corp, A1, 21.31, …,Y

2, BCD Corp, A2, 20.44, …,Y

3, CDE Corp, A3, 22.78, …,N

4, EFG Corp, A4, 24.21, …,Y

5, GEH Corp, A5, 25.67, …,Y

20000, XYZ Copr, Z3000, 10.50, …, Y

The N (number of attributes) could become as big as 40000. To manage the file size the data provider has split the file vertically into multiple files each containing 200+ columns, and repeated the Entity Id in each file to allow joining the data back.

Solution:

My first impression was that I would have to write a custom component to handle this dynamic scenario, but the ‘ETL Metadata Injection’ component came to rescue, and allowed me to load this data without a need to write a custom component.

The ‘ETL Metadata Injection’ as the name suggests allows you to inject the metadata of a supported component during run time. For example you want to define the metadata of a ‘CSV File Input’ component which might include the file name, column names and data type of each column, you can do so by editing the ‘CSV File Input’ component. But then it is fixed at the design time. Off course some of the metadata allows you to load the values from variables but not all. Like the file name can be defined using a variable but not the column details. Since the requirement is to load dynamic files I could not use the simple approach of splitting data and consuming through multiple ‘CSV File Input’ instances with fixed column details. If the number of columns in each file is fixed to 200 and the number of attributes is fixed to 40000, I could have done so with 200 different instances of ‘CSV File Input’ – a dirty approach but could have worked. But in my case the number of attributes is growing and number of columns in files keep changing. Therefore I thought of loading the column details in memory and inject the column details in the ‘CSV File Input’ using ‘ETL Metadata Injection’ component. The similar approach was used to normalize the data using ‘Row Normalizer’ before saving it using ‘Table Output’.

The rest of this post goes under the cover to understand more about the ETL Metadata Injection.

Under the cover:

The ETL Metadata Injection under the cover actually reads the XML of your package and does XML manipulation to update the section which defines column details and other metadata. Please do note that only few properties are exposed to be written through the metadata injection. And not all components are supported for metadata injection. As of this writing only following components are supported for metadata injection but the list is continuously growing:

1. CSV Input

2. Row Demoralizer

3. Excel Input

4. Split Field

5. Fixed file input

6. Row Normalizer

7. Select Values

After the metadata is injected this component executes the modified package. In future version it is expected to support outputting the modified XML to disk and an option to execute or not.

If you enable the detailed logging in Spoon you will be able to see the modified XML in logs.

References:

http://wiki.pentaho.com/display/EAI/ETL+Metadata+Injection

http://www.ibridge.be/?p=194

Posted on Wednesday, February 6, 2013 12:23 PM ETL , pentaho , kettle , PDI , Data Loading , Datawarehouse , Metadata Injection , Pentaho Data Integration | Back to top


Comments on this post: My introduction to Pentaho Data Integration – Kettle (ETL Tool)

# re: My introduction to Pentaho Data Integration – Kettle (ETL Tool)
Requesting Gravatar...
Hi
Can you please tell me how this code execute with c#.Is it possible?
Left by Talha on Oct 07, 2013 12:23 PM

# re: My introduction to Pentaho Data Integration – Kettle (ETL Tool)
Requesting Gravatar...
I have a similar requirement..Can you please post an example on how to load the column details in memory and inject the column details in the ‘CSV File Input’ using ‘ETL Metadata Injection’ component.What components should i use for this in place of data grid..Also do you have any idea on how to pass merge join diff key as parameterized instead of hard coding it...
Left by priyanka on Feb 26, 2014 3:35 PM

Your comment:
 (will show your gravatar)


Copyright © Rahul Anand | Powered by: GeeksWithBlogs.net