Question

I have researched this extensively without finding a solution. I have cleaned my data set as follows:

library("raster")
impute.mean <- function(x) replace(x, is.na(x) | is.nan(x) | is.infinite(x) , 
mean(x, na.rm = TRUE))
losses <- apply(losses, 2, impute.mean)
colSums(is.na(losses))
isinf <- function(x) (NA <- is.infinite(x))
infout <- apply(losses, 2, is.infinite)
colSums(infout)
isnan <- function(x) (NA <- is.nan(x))
nanout <- apply(losses, 2, is.nan)
colSums(nanout)

The problem arises running the predict algorithm:

options(warn=2)
p  <-   predict(default.rf, losses, type="prob", inf.rm = TRUE, na.rm=TRUE, nan.rm=TRUE)

All the research says it should be NA's or Inf's or NaN's in the data but I don't find any. I am making the data and the randomForest summary available for sleuthing at [deleted] Traceback doesn't reveal much (to me anyway):

4: .C("classForest", mdim = as.integer(mdim), ntest = as.integer(ntest), 
       nclass = as.integer(object$forest$nclass), maxcat = as.integer(maxcat), 
       nrnodes = as.integer(nrnodes), jbt = as.integer(ntree), xts = as.double(x), 
       xbestsplit = as.double(object$forest$xbestsplit), pid = object$forest$pid, 
       cutoff = as.double(cutoff), countts = as.double(countts), 
       treemap = as.integer(aperm(object$forest$treemap, c(2, 1, 
           3))), nodestatus = as.integer(object$forest$nodestatus), 
       cat = as.integer(object$forest$ncat), nodepred = as.integer(object$forest$nodepred), 
       treepred = as.integer(treepred), jet = as.integer(numeric(ntest)), 
       bestvar = as.integer(object$forest$bestvar), nodexts = as.integer(nodexts), 
       ndbigtree = as.integer(object$forest$ndbigtree), predict.all = as.integer(predict.all), 
       prox = as.integer(proximity), proxmatrix = as.double(proxmatrix), 
       nodes = as.integer(nodes), DUP = FALSE, PACKAGE = "randomForest")
3: predict.randomForest(default.rf, losses, type = "prob", inf.rm = TRUE, 
       na.rm = TRUE, nan.rm = TRUE)
2: predict(default.rf, losses, type = "prob", inf.rm = TRUE, na.rm = TRUE, 
       nan.rm = TRUE)
1: predict(default.rf, losses, type = "prob", inf.rm = TRUE, na.rm = TRUE, 
       nan.rm = TRUE)
Was it helpful?

Solution

Your code is not entirely reproducible (there's no running of the actual randomForest algorithm) but you are not replacing Inf values with the means of column vectors. This is because the na.rm = TRUE argument in the call to mean() within your impute.mean function does exactly what it says -- removes NA values (and not Inf ones).

You can see this, for example, by:

impute.mean <- function(x) replace(x, is.na(x) | is.nan(x) | is.infinite(x), mean(x, na.rm = TRUE))
losses <- apply(losses, 2, impute.mean)
sum( apply( losses, 2, function(.) sum(is.infinite(.))) )
# [1] 696

To get rid of infinite values, use:

impute.mean <- function(x) replace(x, is.na(x) | is.nan(x) | is.infinite(x), mean(x[!is.na(x) & !is.nan(x) & !is.infinite(x)]))
losses <- apply(losses, 2, impute.mean)
sum(apply( losses, 2, function(.) sum(is.infinite(.)) ))
# [1] 0

OTHER TIPS

One cause of the error message:

NA/NaN/Inf in foreign function call (arg X)

When training a randomForest is having character-class variables in your data.frame. If it comes with the warning:

NAs introduced by coercion

Check to make sure that all of your character variables have been converted to factors.

Example

set.seed(1)
dat <- data.frame(
  a = runif(100),
  b = rpois(100, 10),
  c = rep(c("a","b"), 100),
  stringsAsFactors = FALSE
)

library(randomForest)
randomForest(a ~ ., data = dat)

Yields:

Error in randomForest.default(m, y, ...) : NA/NaN/Inf in foreign function call (arg 1) In addition: Warning message: In data.matrix(x) : NAs introduced by coercion

But switch it to stringsAsFactors = TRUE and it runs.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top