R script exhausting memory - Microsoft HPC Cluster

https://stackoverflow.com/questions/10187202

01-06-2021
|

Question

I have an R script with the following source code:

genofile<-read.table("D_G.txt", header=T, sep=',')
genofile<-genofile[genofile$"GC_SCORE">0.15,]
cat(unique(as.vector(genofile[,2])), file="GF_uniqueIDs.txt", sep='\n')

D_G.txt is a huge file, about 5 GBytes.

Now, the computation is performed on a Microsoft HPC cluster so, as you know, when I submit the job it gets "splitted" across different physical nodes; in my case each one has 4 GB of RAM memory.

Well, after a variable amount of time, I get the infamous error cannot allocate vector of size xxx Mb message. I've tried to use a switch which limits the usable memory:

--max-memory=1GB

but nothing change.

I've tried Rscript 2.15.0 both 32 and 64 bit with no luck.

Solution

The fact that your dataset as such should fit in the memory of one node does not mean that when performing an analysis on it also means that it fits in memory. Often analyses cause copying of data. In addition, some inefficient programming from your side could also increase memory usage. Setting the switch and limiting the memory use of R only makes things worse. It does not limit the actual memory usage, it limits the maximum memory usage. And using a 32 bit OS is always a bit idea memory wise, as the maximum memory that can be addressed by a single process using a 32 bit OS is less than 4 GB.

Without more details it is hard to help you any further with this problem. In general I would recommend to cut the dataset up in smaller and smaller pieces, until you succeed. I assume that your problem is embarrassingly parallel, and cutting up your dataset further does not change anything to the output.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow