Question

The workflow I'm trying to build is this, Enqueue many related tasks to run in parallel (at least several thousand) Once all related jobs finish have a finalization job execute

I can't figure out how to get that single finalization job to execute. I'd like this task to execute asap after all the related tasks finish. However, the only thing I can think of is to resort to a single threaded polling job which checks if all tasks have finished, and enqueues the finalization task.

I've looked at the pipeline documentation https://code.google.com/p/appengine-pipeline/ and I've watched http://www.google.com/events/io/2010/sessions/high-throughput-data-pipelines-appengine.html which at first seemed promising but haven't been able to find a good solution from that.

After considering the pipeline library some more I think I see a pattern that could be used to scale the waited for jobs to the high number I'm wanting.

Have a batch enqueueing job this job enqueues a batch of the tasks at a time, then starts another batch enqueuing job which waits on the enqueued batch to complete. Finally, if there are no more batches to execute, the aggregator job is run.

Is that the pattern to be using for large numbers of waited on jobs?

Was it helpful?

Solution 2

I never posted this description since I hadn't implemented it yet, but the following is a high level description of how a single task can be executed after a large number of required tasks complete on a system such as GAE without depending on a third party framework.

Fan Out

  1. fanoutCount = # of fan out tasks
  2. Create and commit table entry with fanoutCount = #, completeCount=0, Byte array with a number of bits = to the fanoutCount
    • Since the max table size is 1Mb, this will support millions of jobs. Still might be safest to have an upper limit
  3. Iterate over the tasks, enqueing them with the entity key, and a task index (incremented count)
    • You might want to implement a strategy to recover a partially done fanout task. Tasks must be idempotent, so could just re-start if the same work is identified each time

Task Consumer

  1. Perform whatever business logic is desired, I've assumed the entity key is a sufficient identifier
  2. enqueue entry back to a fan in queue with the Job entry record id, and the task index
  3. Note: Task idempotency. There is no guarantee a task won't execute more than once. Can either
    • a) Execute business logic and final queue message in a transaction, checking inside that the business logic has not already been performed
    • b) Make individual business logic steps idempotent (fan in will ignore duplicate messages)

Fan In

  1. Consume the fan in queue
  2. In transaction
    • Get the table entry (candidate for simple caching),
    • Set the bit of the array corresponding to the task index to 1 (if it was already 1 we are double processing, just continue)
    • Increment the completeCount,
    • if completeCount == fanoutCount enqueu your finalization job
    • Commit transaction
  3. Note: Concurrency. The entry can only take soo many transactions soo fast. The logic itself is simple though. Consider processing batch messages, with low (or no) concurrency. Especially if you only have 1 fan out executing at a time

OTHER TIPS

Waiting for each batch to complete makes your process much more serial - it will take longer to run that way.

If high numbers of varargs are a problem, as a workaround you could have a fan-in task corresponding to each fan-out, assuming fan-out doesn't branch more than about 10 tasks at a time.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top