Distributing work to multiple cores: Hadoop or Scala's parallel collections?

https://stackoverflow.com/questions/9730028

24-05-2021
|

Question

What is the better way of making full use of multiple cores for parallel processing in a Scala/Hadoop system?

Let's say I need to process 100 million documents. Documents are not very large, but processing them is computationally intensive. If I have a Hadoop cluster with 100 machines with 10 cores each, I could either:

A) send 1000 documents to each machine and let Hadoop start a map on each of the 10 cores (or as many as are available)

B) send 1000 documents to each machine (still using Hadoop) and use Scala's parallel collections to make full use of the multiple cores. (I would put all documents in a parallel collection, and then call map on the collection). In other words, use Hadoop for distribution at cluster level, and use parallel collections to manage the distribution to cores within each machine.

Solution

The answer depends on the following question - does your Scala code capable to fully utilize all cores available. Probabbly if you have good intrinsic synchronization between parts of the document to be processed or some other way to parralelyze algorithm without lock contention - then the "B"" is the way. If so - configure one mapper per node and let your mapper to utilize cores in a best way.
If your gain from the parralelization is not that good, and adding more threads (cores) to the processing does not improve performance in a linear way - then the "A" can be better way. Efficiency of "A" also depends on the size of your RAM - you will need enough ram for 10 mappers per node.
I can suspect that ideal solution can be somewhere in between. So my suggestion is to develop mapper which takes number of threads used as a parameter and then do a few tests increasing number of threads per mapper and decreasing number of mappers per node.

OTHER TIPS

Hadoop is going to offer a lot more than just parallelization. It offers a platform to distribute work, a scheduler for handling concurrent jobs, a distributed filesystem, the ability to perform a distributed reduce, and fault tolerance. That said, it is a complicated system and can sometimes be difficult to work with.

If you plan to have multiple users submitting many different jobs, Hadoop is the way to go (out of the two options). However, if you are devoting a cluster to be always be processing documents through the same function, you could, without too much trouble, develop a system with Scala parallel collections and actors for inter-machine communication. The Scala solution would give you more control, the system could respond in real time, and you wouldn't have to deal with a lot of Hadoop configuration that doesn't pertain to your task.

If you need to run varied jobs over large amounts of data (larger than would fit on a single node), then use Hadoop. I can give you more information if you describe your requirements in more detail.

Update: one million is a fairly small number. You might want to do some calculations and see how long it would take on a single machine with parallel collections. The advantage here is that the development time is minimal!

Hadoop is not very good for processing a lot of small files, but for processing a small amount of very large files. Is there any way you can merge the files before processing them, or are they all totally different? Hadoop takes care of distribution and parallelism itself, so there is no need to explicitly send X docs to Y machines. And also i don't think you should use hadoop only as a distribution mechanism, that is not what it's made for. You should either use a real map/reduce, or build your own system for whatever you are trying to do, but not try to bend hadoop to your will.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow