R: multivariate Bayesian regression with MCMCregress throws an error

Question

The basic problem is that with the smaller data set you don't have enough information to estimate the parameters in the model (that is, you have no degrees of freedom). If you run a classical linear regression, you'll see the R-squared of your model with the smaller data is 1. In other words, 100% of the variation in the outcome around its mean is accounted for by the regression model.

Just to be clear, this problem has nothing to do with MCMCregress. Here's your smaller data set using the linear regression function in R, which shows a similar error message:

# data set
set.seed(0)
example.dataframe= function(size) {
y = runif(size, 1, 25)
x1 = paste(letters[runif(size, min=1, max=25)])
x2 = paste(letters[runif(size, min=1, max=25)])
x3 = paste(letters[runif(size, min=1, max=25)])
df = data.frame(y, x1=as.factor(x1), x2=as.factor(x2), x3=as.factor(x3))
df
}

# classical linear regression with small data set
df = example.dataframe(10)
model <- lm(y ~ x1 + x2 + x3 - 1, data= df)
# notice the R-squared is 1
# also notice a similar error message as with MCMCregress

So what's the solution? Either use the full data set or decrease the number of parameters estimated (that is, don't use as many inputs on the right-hand side of the equation). Both of these procedures will increase the degrees of freedom in your model.

Here's an error-free example using either approach:

# (1) solution 1: fewer parameters estimated
df = example.dataframe(10)
model <- MCMCregress(y ~ x1, data= df)

# (2) solution 2: more data used 
df = example.dataframe
model <- lm(y ~ x1 + x2 + x3 - 1, data= df)

For more information you might want to read up on the concept of degrees of freedom from statistics.

Update: There's also another solution of sorts. You can combine variables on the right-hand side of the equation into a smaller set using a dimension-reduction technique such as factor analysis. Here's a crude example:

# (3) solution 3: dimension reduction (e.g., factor analysis)
require(psych) # for "fa" function
df$x1 <- as.numeric(df$x1); df$x2 <- as.numeric(df$x2)
df$x3 <- as.numeric(df$x3)
fa <- fa(df[,2:4], rotate="varimax") 
model <- lm(y ~ fa$scores)

Ultimately trying to estimate more parameters than data is like turning water into wine or straw into gold -- it's not possible. Your only hope is that you can estimate fewer parameters, acquire more data, and realize that some of your variables are in fact proxies for each other (or combine to form a smaller set of latent variables).