How to process a Kettle transformation once per filename

In ETL Processing, it is fairly common to want to process a directory of identical files. In Pentaho Data Integration (Kettle), there is no straightforward way of doing this. What you will find below is one method of accomplishing it. This is a rather kludgy process, and I'm surprised it has to be this complicated. For now, this is the only way I know of to do it. I got the basics from a forum posting on the Pentaho forums. I modified it just a bit so that you can unzip all the files into a directory, change a couple of vars, and run it.

Download

Installation

Unzip to a directory. There should be 6 files:

parent.kjb
foreach.kjb
do-stuff.kjb
read-file-names.ktr
set-var.ktr
process_file.ktr

Yes, it takes 3 jobs and 3 transformations to do this…can you believe it?

Configuration

parent.kjb

Open this job. The first job entry is Set Dir Path. open the entry, and modify dir.path and dir.wildcard to your liking. dir.path is the directory in which to get the list of files, and dir.wildcard is a regex expression for the filenames to pull. This is regular expressions, so stuff like “*.xml” just won't work. The correct regex to get all files whose extension is .xml is ”.*\.xml”. This isn't a regex tutorial, go look that up on the web.

process_file.ktr

This is the transformation that does the work on each file found. The example just reads the file as a fixed width. You will place your own logic in here. Notice the variable used as the filename: ${filename} This is the variable that you use in your transformation to access the file being passed to the transformation.

Execution

To execute, you will run parent.kjb.

This will run all the other jobs/transformations, culminating in your transformation (process_file.ktr) being run once for each file found in the directory you specified.

Explanation

parent.kjb calls read-file-names.ktr, which just gets a list of files from a directory (the one you specified), and passes the results back to the job. Then the job calls foreach.kjb, once for each row. That is the key to this thing working.

foreach.kjb then calls set-var.ktr, which sets up the ${filename} variable. This is necessary because the job entry version of the Set Variables transformation step won't let you use a fieldname to get the value from. Jobs aren't about data, so you can't access field names or data in them. Once the variable is set, do-stuff.kjb is called.

do-stuff.kjb just checks that the filename var is set, then calls process_file.ktr, which is where the action happens.