How to retrieve a list of the original variable names from a GLM call in R?

https://stackoverflow.com/questions/21115226

28-09-2022
|

Question

When using the glm function in R one can use functions like addNA or log inside the formula argument. Let's say we have a dataframe Data with 4 columns: Class, var1 which are factors and var2, var3 which are numeric variables and we fit:

Model <- glm(data  = Data, 
         formula   = Class ~ addNA(var1) + var2+ log(var3),  
         family    = binomial)

In the glm output variable 1 will now be called addNA(var1) (e.g. in Model$xlevels), while variable 3 will be called log(var3).

Is it possible to retrieve a list from the glm output that indicates that var1, var2 and var3 were extracted from the dataframe, without addNA(var1) or log(var3) appearing in the variable names?

More general, is it possible to infer which columns were extracted from the input dataframe by glm prior to any transformations / cross terms etc being generated inside the glm function, after the call to glm has been made?

Solution

This works:

all.vars(formula(Model)[-2])
## [1] "var1" "var2" "var3"

The [-2] indexing removes the response variable from the formula. However, you may be disappointed that the internally stored model frame does not have the original variables, but the transformed variables ...

names(model.frame(Model))
## [1] "Class"       "addNA(var1)" "var2"        "log(var3)"

If you want the raw names, then all.vars(getCall(Model)$formula) should work.

OTHER TIPS

The returned list includes call, formula, and terms items. You should be able to extract whichever specific parts you want from those elements. If you really want just the source names (which is pretty obvious from the returned terms, then run a gsub to remove everything prior to "(" in the names, and to remove the trailing ")" .

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow