Workload Distribution / Parallel Execution in JAVA

https://stackoverflow.com/questions/1039820

22-07-2019
|

Question

I have a situation here where I need to distribute work over to multiple JAVA processes running in different JVMs, probably different machines.

Lets say I have a table with records 1 to 1000. I am looking for work to be collected and distributed is sets of 10. Lets say records 1-10 to workerOne. Then records 11-20 to workerThree. And so on and so forth. Needless to say workerOne never does the work of workerTwo unless and until workerTwo couldnt do it.

This example was purely based on database but could be extended to any system, I believe be it File processing, email processing and so forth.

I have a small feeling that the immediate response would be to go for a Master/Worker approach. However here we are talking about different JVMs. Even if one JVM were to come down the other JVM should just keep doing its work.

Now the million dollar question would be: Are there any good frameworks(production ready) that would give me facility to do this. Even if there are concrete implementations of specific needs like Database records, File processing, Email processing and their likes.

I have seen the Java Parallel Execution Framework, but am not sure if it can be used for different JVMs and if one were to come down would the other keep going.I believe Workers could be on multiple JVMs, but what about the Master?

More Info 1: Hadoop would be a problem because of the JDK 1.6 requirement. Thats bit too much.

Thanks, Franklin

Solution

You could also use message queues. Have one process that generates the list of work and packages it in nice little chunks. It then plops those chunks on a queue. Each one of the workers just keeps waiting on the queue for something to show up. When it does, the worker pulls a chunk off the queue and processes it. If one process goes down, some other process will pick up the slack. Simple and people have been doing it that way for a long time so there's a lot information about it on the net.

OTHER TIPS

Might want to look into MapReduce and Hadoop

Check out Hadoop

I believe Terracotta can do this. If you are dealing with web pages, JBoss can be clustered.

If you want to do this yourself you will need a work manager which keeps track of jobs to do, jobs in progress and jobs never done which needs to be rescheduled. The workers then ask for something to do, do it, and send the result back, asking for more.

You may want to elaborate on what kind of work you want to do.

The problem you've described is definitely best solved using the master/worker pattern.

You should have a look into JavaSpaces (part of the Jini framework), it's really well suited to this kind of thing. Basically you just want to encapsulate each task to be carried out inside a Command object, subclassing as necesssary. Dump these into the JavaSpace, let your workers grab and process one at a time, then reassemble when done.

Of course your performance gains will totally depend on how long it takes you to process each set of records, but JavaSpaces won't cause any problems if distributed across several machines.

If you work on records in a single database, consider performing the work within the database itself using stored procedures. The gain for processing the records on different machine might be negated by the cost of retrieving and transmitting the work between the database and the computing nodes.

For file processing it could be a similar case. Working on files in (shared) filesystem might introduce large I/O pressure for OS.

And the cost for maintaining multiple JVM's on multiple machines might be an overkill too.

And for the question: I used the JADE (Java Agent Development Environment) for some distributed simulation once. Its multi-machine suppord and message passing nature might help you.

I would consider using Jgroups for that. You can cluster your jvms and one of your nodes can be selected as master and then can distribute the work to the other nodes by sending message over network. Or you can already partition your work items and then manage in master node the distribution of the partitions like partion-1 one goes to JVM-4 , partion-2 goes to JVM-3, partion-3 goes to JVM-2 and so on. And if JVM-4 goes down it will be realized by the master node and then master node will tell to one of the other nodes to start pick up partition-1 as well. One other alternative which is easier to use is redis pub sub support. http://redis.io/topics/pubsub . But then you will have to maintain redis servers which i dont like.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow