Strict coordinator job ordering on Oozie

Question

This is the exact use case that Oozie designed to solve. Oozie will wait all data dependency before launch.

Please try to understand the following configs in your coordinator.xml

    <datasets>
        <dataset name="my_data" frequency="${coord:days(1)}" initial-instance="2013-01-27T00:00Z">
            <uri-template>YOUR_DATA/${YEAR}${MONTH}${DAY}</uri-template>
        </dataset>
    ...
    <datasets>

    <input-events>
        <data-in name="my_data" dataset="my_data">
            <instance>${coord:current(-1)}</instance>
        </data-in>
    </input-events>

    <output-events>
        <data-out name="my_data" dataset="my_data">
           <instance>${coord:current(0)}</instance>
        </data-out>
    </output-events>

the "coord:current(-1)" in input-events means the previous output. It will interpret the dataset URI teamplate to "yesterday", and Oozie will check whether the data exist in HDFS by checking a success flag, which by default is an empty file named "_SUCCESS", right under the output directory. Oozie will keep waiting this flag before launching the current workflow.

btw, you can also set

<coordinator-app name="my_coordinator" frequency="${coord:days(1)}" start="${start_time}" end="${end_time}" ...>

to define start time and end time of a coordinator job, so you can catch up backlog data.