Pergunta

I am doing just a regular logistic regression using the caret package in R. I have a binomial response variable coded 1 or 0 that is called a SALES_FLAG and 140 numeric response variables that I used dummyVars function in R to transform to dummy variables.

data <- dummyVars(~., data = data_2, fullRank=TRUE,sep="_",levelsOnly = FALSE )
dummies<-(predict(data, data_2))
model_data<- as.data.frame(dummies)

This gives me a data frame to work with. All of the variables are numeric. Next I split into training and testing:

trainIndex <- createDataPartition(model_data$SALE_FLAG, p = .80,list = FALSE)
train <- model_data[ trainIndex,]
test  <- model_data[-trainIndex,]

Time to train my model using the train function:

model <- train(SALE_FLAG~. data=train,method = "glm")

Everything runs nice and I get a model. But when I run the predict function it does not give me what I need:

predict(model, newdata =test,type="prob")

and I get an ERROR:

Error in dimnames(out)[[2]] <- modelFit$obsLevels : 


length of 'dimnames' [2] not equal to array extent

On the other hand when I replace "prob" with "raw" for type inside of the predict function I get prediction but I need probabilities so I can code them into binary variable given my threshold.

Not sure why this happens. I did the same thing without using the caret package and it worked how it should:

model2 <- glm(SALE_FLAG ~ ., family = binomial(logit), data = train)
predict(model2, newdata =test, type="response")

I spend some time looking at this but not sure what is going on and it seems very weird to me. I have tried many variations of the train function meaning I didn't use the formula and used X and Y. I used method = 'bayesglm' as well to check and id gave me the same error. I hope someone can help me out. I don't need to use it since the train function to get what I need but caret package is a good package with lots of tools and I would like to be able to figure this out.

Foi útil?

Solução

Show us str(train) and str(test). I suspect the outcome variable is numeric, which makes train think that you are doing regression. That should also be apparent from printing model. Make it a factor if you want to do classification.

Max

Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top