PDI (Kettle) looped job step parellelization

https://stackoverflow.com/questions/23420581

13-07-2023
|

Question

Preface

I have automated and scripted the creation of individual .ktr files to handle the extraction and syncing of data between Source (MySQL) and Target (InfoBright) databases. One .ktr file is created for each table.

I have a set of 2 Jobs and 2 Transformations which make up a "run" to find the data sync .ktr files and queue them for execution.

Job 1 (entry point)

Run Transformation to search target directory for files matching wildcard passed from command line
For every row, run Job 2 (file looper)
After the run is done, do some error checking, mailing, close out

Job 2 (file looper)

Run Transformation to take the result and populate a variable with the filename
Run ${filename} Transformation to perform syncing between MySQL and Infobright
Perform some error checking, populate an error log, etc. Standard graceful failures and error logging

This all works perfectly. I can queue up 250+ .ktr files in my target directory, and kitchen gets through them in about 9-15 minutes, depending on the volume of data to sync

Problem

Pentaho doesn't appear to support the parallelization of this abstract looped execution of transformations. Jobs don't support output distribution like Transformations do. I've checked the Pentaho support forums, and posted on there with no response.

I'm looking to get 4 or 5 parallel threads going, each executing one of the queued results (gathered filenames). I'm hoping somebody here can provide some sort of insight into how I can achieve this, aside from manually globbing files with filename tags, and running the kitchen job 5 times, passing in the filename tags as a parameter.

(This doesn't really address the output result distribution issue, as it just runs 5 separate sequential jobs, and doesn't distribute the workload)

EDIT: Here's the post on the Pentaho forums with images, that might help to illustrate what I'm talking about: http://forums.pentaho.com/showthread.php?162115-Parallelizing-looped-job-step

Cheers

Solution

After a lot of trial-and-error and a LOT of research, I've discovered the following:

Kettle doesn't support load-based distribution, only round-robin (It's typically used to distribute rows of data to different steps, so load / execution time is almost never a factor)
Round-robin-only distribution means my each Job in the distribution will handle the same number of results (in my case, each Job Executor step handles 9 transformations, regardless of how long each one could take.)
The workaround (round-robin distribution rather than true parallelization) was simpler than I thought, once I fully grasped the manner in which Kettle processed and passed results, and I only needed to move my job execution step from my parent job to the first Transformation, using the Job Executor step.
Because of this distribution method, it would be beneficial to have long-running results picking up next to each other in the results, so they are distributed evenly across the jobs

I did add a reply to my thread on the Pentaho Forums, providing a picture of my solution.

Unfortunately, per #1, it appears as though there's no support for my original goal.

OTHER TIPS

Cant this be done with a transaformation with input rowset -> Job executor step ?

Job executor step - Filename can be parameterized in Parameters tab - No of rows to send = 1 in Row grouping tab - No.of copies changed from 1 to whatever you need

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow