I can confirm Dan Brown's answer, the error seem to be caused by having factors in the data. I wrote the following code to turn factors into dummy variables. It is not especially pretty but it does the job.
library("foreach")
# Helper function, use the other one
# takes a column name (pointing to a factor variable) and a dataset
# returns a dataframe containing a 1-in-K coding for this factor variable
col_to_dummy <- function(colname, data) {
# tmp is a dataframe of K columns, where K is the number of levels of the factor in colname
# it is a 1-in-K dummy variable coding
levelnames <- levels(data[[colname]])
dummy <- foreach(i=1:length(levelnames), .combine=cbind) %do% {
as.numeric(as.numeric(data[[colname]])==i)
}
dummy <- as.data.frame(dummy)
names(dummy) <- paste0(colname, ":", levelnames)
dummy
}
factor_to_dummy <- function(obsdata) {
# finding the columns containing a factor variable
col_factor <- unlist(lapply(FUN=is.factor, obsdata))
# if they are none, then nothing to do
if(!any(col_factor)) {
return(obsdata)
}
# otherwise
# for each of these, convert it to dummy variables using col_to_dummy
foreach(colname=names(which(col_factor)), .combine = cbind,
.init = obsdata[,-which(col_factor)]) %do% {
col_to_dummy(colname, obsdata)
}
# each resulting data.frame is c-bound with the dataset without factors
}
Some solution out there use model.matrix
, but realize that by default, model.matrix
uses a reference level (intercept) and then use a 1-of-(K-1) coding scheme for all factors. You will need to tinker with the contrast arguments to maybe get what you want.
This code is really easy to use. Once the function definitions have been ran, you can simply do:
df_with_dummy_vars <- factor_to_dummy(original_df)
All factor columns will be converted to dummy variables.