Should I prefer hadoop vs condor when working with R?

https://stackoverflow.com/questions/4095664

28-09-2019
|

Question

I am looking for ways to send works for multiple computers on my University computer grid.

Currently it is running Condor and also offers Hadoop.

My question is thus, should I try and interface with R to Hadoop or to the Conder for my projects?

For the discussion, let's assume we are talking about Embarrassingly parallel tasks.

p.s: I've seen the resources described in the CRAN task views.

Solution

You can do both.

You can use HDFS for your data sets and Condor for your job scheduling. Using Condor to place executors on machines and HDFS + Hadoops Map-Reduce features to process your data (assuming your problem is map-reduce mappable). Then you're using the most appropriate tool for the job: Condor is a job scheduler, and as such does that work better than Hadoop. And Hadoop's HDFS and M-R framework are things Condor doesn't have (but are really helpful for jobs running on Condor to use).

I would personally look at has HDFS to share data among jobs that run discretely as Condor jobs. Especially in a university environment, where shared compute resources are not 100% reliable and can come and go at will, Condor's resilience in this type of set up is going to make getting work done a whole lot easier.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow