Question

I tried several times to apply the pmml function from package pmml to a random forest model ('model.rf') created by package randomForest:

> library(randomForest)
> dim(data)
[1]  32000 76
> model.rf <- randomForest(x=data[,2:76],y=data[,1],type='regression',ntree=150)
> library(pmml)
> model.rf.pmml<-pmml(model.rf)

Each time it took several hours on my Windows 8 system (i7-4500U / 8gb RAM) until R would crash.

The model is quite large. The .RData file (with only the model) is approx. 10mb on disk and:

> model.rf$forest$nrnodes
[1] 5819

Is the crash due to memory insufficiency? I realized that the R process occupied virtually all of the available memory before crashing. If so, what system would be required to convert my model to pmml?

Also from the iris example it seems the size on disk increases by factor ~15, because XML is not a compressed format as opposed to R data files:

> library(randomForest)
> iris.rf <- randomForest(Species ~ ., data=iris, ntree=20)
> save(iris.rf,file='iris.rf.RData')
> iris.rf.pmml<-pmml(iris.rf)
> saveXML(iris.rf.pmml,file='iris.rf.xml')

iris.rf.RData --> 4kb iris.rf.xml --> 59kb

Is this factor constant? Will the pmml version of my model be ~150mb on disk?

Was it helpful?

Solution

Unfortunately, the R pmml package does have memory as well as speed limitations. When I released the present version, I did not realize how big "big data" could be! I should add that Windows is not very good at memory efficiency. There have been many models I could not output in a Windows machine....but was able to produce the exact same model faster and with better usage of memory in a linux or mac computer. I have been making improvements on both for the next release version, but for now, based on an experiment for a RF model with 500 trees, applied to a dataset with 50 variables and 50000 rows (~18Mb), the time taken to create a pmml model was 5hrs (linux machine). The average number of nodes in a tree was 4000. A general rule of thumb would be that the memory used to save a pmml object ~2.5x the R object....as you found. The memory used just to save the object as an xml file is a major factor. In the present state of the package, (not yet released), instead of 5hrs, it took 1hr15min. The numbers above are for a linux machine....I expect them to be more than double for a windows machine. Please consider using a non-windows machine for analysis of large datasets; I am sure this applies to most R packages...not just PMML!

OTHER TIPS

You could use the r2pmml package when working with large Random Forest models. This package relies on Java PMML class model and XML libraries. As a result, it is a thousand times faster than the standard pmml package. The performance is the same whether you use it on Windows or *NIX. All things considered, your model should be exportable in a couple of seconds time.

I have used the r2pmml library to export a 5 GB Random Forest PMML file in about one minute on my laptop. The trick is to give JVM enough heap size so that it doesn't need to do much garbage collection:

options(java.parameters = c("-Xms8G", "-Xmx16G"))

library("r2pmml")

model.rf <-randomForest(x = data[,2:76], y = data[,1], type = 'regression', ntree = 150)

r2pmml(model.rf, "/tmp/rf.pmml")
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top