Question

I am running in R a multivariate Bayesian regression (a numerical variable depends on 3 explanatory factor variables) with the MCMCregress function of the MCMCpack package.

Unfortunately an error "Error in eval(expr, envir, enclos) : NA/NaN/Inf in foreign function call (arg 17)" is thrown with my dataset.

Here's an example code which reproduces the my regression task and the error:

require(MCMCpack)

# Function for creation of a reproducable example
set.seed(0)
example.dataframe= function(size) {
  y= runif(size, 1, 25)
  x1=paste(letters[runif(size, min=1, max=25)])
  x2=paste(letters[runif(size, min=1, max=25)])
  x3=paste(letters[runif(size, min=1, max=25)])
  df= data.frame(y, x1=as.factor(x1), x2=as.factor(x2), x3=as.factor(x3))
  df
}

### Bayesian linear regression with small dataset
df= example.dataframe(10)
model <- MCMCregress(y ~ x1 + x2 + x3 - 1, data= df)
# Fails !
# Error in eval(expr, envir, enclos) :
# NA/NaN/Inf in foreign function call (arg 17)

When the data frame is bigger the error is not thrown:

### Bayesian linear regression with bigger dataset
df= example.dataframe(100)
model <- MCMCregress(y ~ x1 + x2 + x3 - 1, data= df)
# Works !

summary(model)
# Iterations = 1001:11000
# Thinning interval = 1 
# Number of chains = 1 
# Sample size per chain = 10000 
# 
# 1. Empirical mean and standard deviation for each variable,
# plus standard error of the mean:
#   
#             Mean     SD Naive SE Time-series SE
# x1a      5.13964  7.823  0.07823        0.07520
# x1b     14.05264  7.289  0.07289        0.07289
# ...

I was looking into the CRAN documentation of the package but did not find a clear hint on the error and it's cause.

Any suggestions why the error is thrown in the first case and not in the second?

Was it helpful?

Solution

The basic problem is that with the smaller data set you don't have enough information to estimate the parameters in the model (that is, you have no degrees of freedom). If you run a classical linear regression, you'll see the R-squared of your model with the smaller data is 1. In other words, 100% of the variation in the outcome around its mean is accounted for by the regression model.

Just to be clear, this problem has nothing to do with MCMCregress. Here's your smaller data set using the linear regression function in R, which shows a similar error message:

# data set
set.seed(0)
example.dataframe= function(size) {
y = runif(size, 1, 25)
x1 = paste(letters[runif(size, min=1, max=25)])
x2 = paste(letters[runif(size, min=1, max=25)])
x3 = paste(letters[runif(size, min=1, max=25)])
df = data.frame(y, x1=as.factor(x1), x2=as.factor(x2), x3=as.factor(x3))
df
}

# classical linear regression with small data set
df = example.dataframe(10)
model <- lm(y ~ x1 + x2 + x3 - 1, data= df)
# notice the R-squared is 1
# also notice a similar error message as with MCMCregress

So what's the solution? Either use the full data set or decrease the number of parameters estimated (that is, don't use as many inputs on the right-hand side of the equation). Both of these procedures will increase the degrees of freedom in your model.

Here's an error-free example using either approach:

# (1) solution 1: fewer parameters estimated
df = example.dataframe(10)
model <- MCMCregress(y ~ x1, data= df)

# (2) solution 2: more data used 
df = example.dataframe
model <- lm(y ~ x1 + x2 + x3 - 1, data= df)

For more information you might want to read up on the concept of degrees of freedom from statistics.

Update: There's also another solution of sorts. You can combine variables on the right-hand side of the equation into a smaller set using a dimension-reduction technique such as factor analysis. Here's a crude example:

# (3) solution 3: dimension reduction (e.g., factor analysis)
require(psych) # for "fa" function
df$x1 <- as.numeric(df$x1); df$x2 <- as.numeric(df$x2)
df$x3 <- as.numeric(df$x3)
fa <- fa(df[,2:4], rotate="varimax") 
model <- lm(y ~ fa$scores)

Ultimately trying to estimate more parameters than data is like turning water into wine or straw into gold -- it's not possible. Your only hope is that you can estimate fewer parameters, acquire more data, and realize that some of your variables are in fact proxies for each other (or combine to form a smaller set of latent variables).

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top