Question

A data frame which has invalid characters in the column names is causing an error in rlm().

Taking a deeper look, it appears that within rlm() the variable xvars contains the names of the formula's explanatory variables, but it puts backticks around the offending names. Then when xvars is used as an index to a data frame, namesly mf[xvars] it causes the following error:

Error in `[.data.frame`(mf, xvars) : undefined columns selected

Is this the expected behavior? (I realize the keyword phrase invalid characters). Curiously, calling lm() on the same model and dataframe causes no problems.

# SAMPLE DATA
mydf <- data.frame(matrix(rnorm(36),ncol=6))
colnames(mydf) <- c("y", "x1", "x2", "x1^2", "x2^2", "x1:x2")

rlm(y~., data=mydf)  # Error

lm(y~., data=mydf)   # No Problem

# Clean up column names
colnames(mydf) <- make.names(colnames(mydf))
rlm(y~., data=mydf) # No Problem 

Taking a look at MASS:::rlm.formula, it appears the error is
caused by mf[xvars] in the following lines:

xlev <- if (length(xvars) > 0L) {
    xlev <- lapply(mf[xvars], levels)
    xlev[!sapply(xlev, is.null)]
}

Any thoughts why the backticks are being added but then causing an error?


Additional Info

I copied the rlm() function, added dput(mf) & dput(xvars) and got the following values. Note that the value of xvars is different than the names assigned above (ie, backticks are added). Also, the names of mf are the same as the names given above.

# dput yielded
mf <- structure(list(y = c(-0.242914027018629, 0.724255425682537, -0.0578467214604185, -0.274193999595702, -0.38985000750839, 0.406046200943395), x1 = c(1.53071709960635, -1.87493297716611, 1.0936519723035, -0.977011182431237, -0.510890461021046, 1.20136627562427), x2 = c(-0.801995963036553, 1.30590232081605, 0.635922235436178, -1.86824341731708, -2.76797814532917, -0.497992681627495), `x1^2` = c(0.914146279518207, 0.103458073891876, -1.29818230391818, -0.629048606358592, 1.71534374557621, 0.922690967521984), `x2^2` = c(-0.0879726513660469, 1.05299413769867, 1.01955640371072, 0.546413685721721, 0.947757793667223, -0.0998700630220064), `x1:x2` = c(-0.757490494166813, 1.31307393014016, 1.90233916482184, 0.68844011701049, -1.28717997826724, -0.581800325341162)), .Names = c("y", "x1", "x2", "x1^2", "x2^2", "x1:x2"), terms = y ~     x1 + x2 + `x1^2` + `x2^2` + `x1:x2`, row.names = c(NA, 6L), class = "data.frame")
xvars <- c("x1", "x2", "`x1^2`", "`x2^2`", "`x1:x2`")

mf[xvars]  
# Error in `[.data.frame`(mf, xvars) : undefined columns selected


# Removing the backticks from xvars eliminates the error.
xvars <- sapply(xvars, function(x) gsub("`", "", x))
mf[xvars2]  # No Error
Was it helpful?

Solution

Your issue boils down to the fact you are using non-syntatic variable names.

These should be used with caution, and without expectation that package authors will be able to anticipate any issues that may arise.

To quote from the help for formula

Variable names can be quoted by backticks like this in formulae, although there is no guarantee that all code using formulae will accept such non-syntactic names.

The issue in how xvars is created rlm.formula

xvars <- as.character(attr(mt, "variables"))[-1L]

and then the use later on

xlev <- if (length(xvars) > 0L) {
        xlev <- lapply(mf[xvars], levels)
        xlev[!sapply(xlev, is.null)]
    }

Which, as you show, does not work

This will create quoted back-ticked variables for non-syntatic names. If they are already backticked, then they will create double back-ticked names

i.e. if the column name was "x1^2", the element in xvar becomes "`x1^2`".

This fails with [.data.frame for example

x <- data.frame(`a` = 1)
> x[,'`a`']

Error in `[.data.frame`(x, , "`a`") : undefined columns selected

Because the column name is 'a' not `a`

If you backtick the column name

i.e. if the column name was "`x1^2`", the element in xvar becomes "``x1^2``".

which again is not a column in your data.frame

The reason lm works is that it does not attempt this definition and use of xvars, instead it uses model.matrix to define the design matrix x directly to pass to lm.fit

If you want to fit the model y ~ x1 + x2 + x1:x2 +x1^2 + y1^2 then you can using

rlm(y ~ x1*x2 + I(x1^2) + I(x2^2)

In this case you only need three columns in your data.frame (or objects in your evaluation environment) y, x1 and x2. as the I() function allows to perform arithmetic operations on a variable, as I is parsed as a symbol by terms.formula

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top