MapReduce implementation in Scala

https://stackoverflow.com/questions/962075

12-09-2019
|

Question

I'd like to find out good and robust MapReduce framework, to be utilized from Scala.

Solution

To add to the answer on Hadoop: there are at least two Scala wrappers that make working with Hadoop more palatable.

Scala Map Reduce (SMR): http://scala-blogs.org/2008/09/scalable-language-and-scalable.html

SHadoop: http://jonhnny-weslley.blogspot.com/2008/05/shadoop.html

UPD 5 oct. 11

There is also Scoobi framework, that has awesome expressiveness.

OTHER TIPS

http://hadoop.apache.org/ is language agnostic.

Personally, I've become a big fan of Spark

http://spark-project.org/

You have the ability to do in-memory cluster computing, significantly reducing the overhead you would experience from disk-intensive mapreduce operations.

You may be interested in scouchdb, a Scala interface to using CouchDB.

Another idea is to use GridGain. ScalaDudes have an example of using GridGain with Scala. And here is another example.

A while back, I ran into exactly this problem and ended up writing a little infrastructure to make it easy to use Hadoop from Scala. I used it on my own for a while, but I finally got around to putting it on the web. It's named (very originally) ScalaHadoop.

For a scala API on top of hadoop check out Scoobi, it is still in heavy development but shows a lot of promise. There is also some effort to implement distributed collections on top of hadoop in the Scala incubator, but that effort is not usable yet.

There is also a new scala wrapper for cascading from Twitter, called Scalding. After looking very briefly over the documentation for Scalding it seems that while it makes the integration with cascading smoother it still does not solve what I see as the main problem with cascading: type safety. Every operation in cascading operates on cascading's tuples (basically a list of field values with or without a separate schema), which means that type errors, I.e. Joining a key as a String and key as a Long leads to run-time failures.

to further jshen's point:

hadoop streaming simply uses sockets. using unix streams, your code (any language) simply has to be able to read from stdin and output tab delimited streams. implement a mapper and if needed, a reducer (and if relevant, configure that as the combiner).

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow