Question

This is a somewhat generic question for which I apologize, but I can't generate a code example that reproduces the behavior. My question is this: I'm scoring a largish data set (~11 million rows with 274 dimensions) by subdividing the data set into a list of data frames and then running a scoring function on 16 cores of a 24 core Linux server using mclapply. Each data frame on the list is allocated to a spawned instance and scored, returning a list of data frames of predictions. While the mclapply is running the various R instances are spending a lot of time in uninterruptable sleep, more than they're spending running. Has anyone else experienced this using mclapply? I'm a Linux neophyte, from an OS perspective does this make any sense? Thanks.

Was it helpful?

Solution

You need to be careful when using mclapply to operate on large data sets. It's easy to create too many workers for the amount of memory on your computer and the amount of memory used by your computation. It's hard to predict the memory requirements due to the complexity of R's memory management, so it's best to monitor memory usage carefully using a tool such as "top" or "htop".

You may be able to decrease the memory usage by splitting your work into more but smaller tasks since that may reduce the memory needed by the computation. I don't think that the choice of prescheduling affects the memory usage much, since mclapply will never fork more than mc.cores workers at a time, regardless of the value of mc.prescheduling.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top