Question

I have a coordinator on oozie that runs a series of tasks, each of which depends on the output of the last. Each task outputs a dated folder and looks for the output of its predecessor using

${coord:latest(0)}

This all worked fine on my dev cluster when nothing else was running; every 5 minutes oozie would queue up another job, and in that 5 minutes the previous job had run so when the new job was set up it would see the directory it needed.

I run into problems on the production cluster; the jobs get submitted, but are put in a queue and don't run for a while, but still every 5 minutes oozie queues up another one, and in its initialization stage it is assigned its 'previous' folder, which hasn't been created yet as its predecessor hasn't run so the 'latest' function gives it the same input as the previous job. I then end up with 10 jobs all taking the same input...

What I need is a way of strictly preventing the next job in a coordinator sequence from even being created until its predecessor has finished running. Is there a way this can be done?

Thanks for reading

Was it helpful?

Solution

This is the exact use case that Oozie designed to solve. Oozie will wait all data dependency before launch.

Please try to understand the following configs in your coordinator.xml

    <datasets>
        <dataset name="my_data" frequency="${coord:days(1)}" initial-instance="2013-01-27T00:00Z">
            <uri-template>YOUR_DATA/${YEAR}${MONTH}${DAY}</uri-template>
        </dataset>
    ...
    <datasets>

    <input-events>
        <data-in name="my_data" dataset="my_data">
            <instance>${coord:current(-1)}</instance>
        </data-in>
    </input-events>

    <output-events>
        <data-out name="my_data" dataset="my_data">
           <instance>${coord:current(0)}</instance>
        </data-out>
    </output-events>

the "coord:current(-1)" in input-events means the previous output. It will interpret the dataset URI teamplate to "yesterday", and Oozie will check whether the data exist in HDFS by checking a success flag, which by default is an empty file named "_SUCCESS", right under the output directory. Oozie will keep waiting this flag before launching the current workflow.

btw, you can also set

<coordinator-app name="my_coordinator" frequency="${coord:days(1)}" start="${start_time}" end="${end_time}" ...>

to define start time and end time of a coordinator job, so you can catch up backlog data.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top