Question

I recently had a play around with Hadoop and was impressed with it's scheduling, management, and reporting of MapReduce jobs. It appears to make the distribution and execution of new jobs quite seamless, allowing the developer to concentrate on the implementation of their jobs.

I am wondering if anything exists in the Java domain for the distributed execution of jobs that are not easily expressed as MapReduce problems? For example:

  • Jobs that require task co-ordination and synchronization. For example, they may involve sequential execution of tasks yet it is feasible to execute some tasks concurrently:

                   .-- B --.
            .--A --|       |--.
            |      '-- C --'  |
    Start --|                 |-- Done
            |                 |
            '--D -------------'
    
  • CPU intensive tasks that you'd like to distribute but don't provide any outputs to reduce - image conversion/resizing for example.

So is there a Java framework/platform that provides such a distributed computing environment? Or is this sort of thing acceptable/achievable using Hadoop - and if so are there any patterns/guidelines for these sorts of jobs?

Was it helpful?

Solution

I have since found Spring Batch and Spring Batch Integration which appear to address many of my requirements. I will let you know how I get on.

OTHER TIPS

Take a look at Quartz. I think it supports stuff like managing jobs remotely and clustering several machines to run jobs.

I guess you are looking for a workflow engine for CPU intensive tasks (also know "scientific workflow", e.g. http://www.extreme.indiana.edu/swf-survey). But I'm not sure how distributed do you want it to be. Usually all workflow engines have a "single point of failure".

I believe quite a few problems can be expressed as map-reduce problems.

For problems that you can't modify to fit the structure your can look at setting up your own using Java's ExecutorService. But it will be limited to one JVM and it will be quite low level. It will allow for easy coordination and synchronization however.

ProActive Scheduler seems to fit your requirements, especially the complex workflows you mentionned with tasks coordination. It is open source and Java based. You can use it to run anything, Hadoop jobs, scripts, Java code,...

Disclaimer: I work for the company behind it

Try Redisson framework. It provides easy api to execute and schedule java.util.concurrent.Callable and java.lang.Runnable tasks. Here is documentation about distributed Executor service and Scheduler service

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top