Here is the correct solution (it works for several days :))) ):
<coordinator-app name="Xvlr-parser-coordinator" frequency="60"
start="2013-03-07T16:35Z" end="2113-01-01T00:35Z" timezone="Europe/Moscow" xmlns="uri:oozie:coordinator:0.3">
<controls>
<timeout>3</timeout>
<concurrency>1</concurrency>
</controls>
<datasets>
<dataset name="xvlrInputDataset" frequency="${coord:hours(1)}" initial-instance="2013-03-07T16:35Z" timezone="Europe/Moscow">
<uri-template>${nameNode}/staging/landing/source/xvlr/${YEAR}/${MONTH}/${DAY}/${HOUR}</uri-template>
<done-flag></done-flag>
</dataset>
<dataset name="xvlrOutputDataset" frequency="${coord:hours(1)}" initial-instance="2013-03-07T16:35Z" timezone="Europe/Moscow">
<uri-template>${nameNode}/masterdata/source/xvlr/archive/${YEAR}/${MONTH}/${DAY}/${HOUR}</uri-template>
<done-flag></done-flag>
</dataset>
</datasets>
<input-events>
<data-in name="xvlrInputEvent" dataset="xvlrInputDataset">
<instance>${coord:current(0)}</instance>
</data-in>
</input-events>
<output-events>
<data-out name="xvlrOutputEvent" dataset="xvlrOutputDataset">
<instance>${coord:current(0)}</instance>
</data-out>
</output-events>
<action>
<workflow>
<app-path>${oozieAppHomeCatalog}/sub-workflows/Xvlr-parser-subworkflow.xml</app-path>
<configuration>
<property>
<name>inputDir</name>
<value>${coord:dataIn('xvlrInputEvent')}</value>
</property>
<property>
<name>outputDir</name>
<value>${coord:dataOut('xvlrOutputEvent')}</value>
</property>
</configuration>
</workflow>
</action>
</coordinator-app>
What does it do?
- At first it did start on 2013-03-07T16:35Z, so all previously
collected data has been passed through underlying workflow (an mr-job invocation with parsing functionality)- While working with "past time datasets" (dataset time less than current time) workflow was running one by one: it did consume /pastdate/hour_00, then it immediately started to consume /pastdate/hour_01, e.t.c.
- When coordinator reached present time, it started to invoke workflow each hour (as designed: 05:35, 06:35... 23:35).
- See the timeout declaration: I did have missing datasets: for example there was no data for the 10th hour of the first of march. Workflow did wait for 3 minutes and then died.
The problem is solved.