سؤال

I've recently started using R for data analysis. Now I've got a problem in ranking a big query dataset (~1 GB in ASCII mode, over my laptop's 4GB RAM in binary mode). Using bigmemory::big.matrix for this dataset is a nice solution, but providing such a matrix 'm' in the gbm() or randomForest() algorithms causes the error:

cannot coerce class 'structure("big.matrix", package = "bigmemory")' into a data.frame

class(m) outputs the folowing:

[1] "big.matrix"
attr(,"package")
[1] "bigmemory"

Is there a way to correctly pass a big.matrix instance into these algorithms?

هل كانت مفيدة؟

المحلول

I obviously can't test this using data of your scale, but I can reproduce your errors by using the formula interface of each function:

require(bigmemory)
m <- matrix(sample(0:1,5000,replace = TRUE),1000,5)
colnames(m) <- paste("V",1:5,sep = "")

bm <- as.big.matrix(m,type = "integer")

require(gbm)
require(randomForest)

#Throws error you describe
rs <- randomForest(V1~.,data = bm)
#Runs without error (with a warning about the response only having two values)
rs <- randomForest(x = bm[,-1],y = bm[,1])

#Throws error you describe
rs <- gbm(V1~.,data = bm)
#Runs without error
rs <- gbm.fit(x = bm[,-1],y = bm[,1])

Not using the formula interface for randomForest is fairly common advice for large data sets; it can be quite inefficient. If you read ?gbm, you'll see a similar recommendation steering you towards gbm.fit for large data as well.

نصائح أخرى

It is often the case that the memory occupied by numeric objects is more than the disk space. Each "double" element in a vector or matrix takes 8 bytes. When you coerce an object to a data.frame, it may need to be copied in RAM. You should avoid trying to use functions and data structures that are outside those supported by the bigmemory/big*** suite of packages. "biglm" is available, but I doubt that you can expect gbm() or randomForest() to recognize and use the facilities in the "big"-family.

مرخصة بموجب: CC-BY-SA مع الإسناد
لا تنتمي إلى StackOverflow
scroll top