Question

I am using the Titanic dataset from Kaggle and want to learn a simple logistic regression model.

I read in the train and test data and both train$Survived, train$Sex, test$Survived and test$Sex are factors.

I would like to perform a very simple logistic regression with Sex being the only independent variable.

fit <- glm(formula = Survived ~ Sex, family = binomial)

It seems to look okay to me:

> fit

Call:  glm(formula = Survived ~ Sex, family = binomial)

Coefficients:
(Intercept)      Sexmale  
      1.057       -2.514  

Degrees of Freedom: 890 Total (i.e. Null);  889 Residual
Null Deviance:      1187 
Residual Deviance: 917.8    AIC: 921.8

Problem is, I am unable to apply this learned model to the test data. When I do the following:

predict(fit, train$Sex)

I get a vector with 891 values which is the amount of training examples in the training set.

I can't seem to find any information on how to do this right.

Any help would be greatly appreciated!

Was it helpful?

Solution

I'm posting an answer to correct a couple points that seem to have gotten confused. There really is no predict-function as such. That is what is meant where the help page says "predict" is a "generic function". Sometimes generic functions do have a fun.default method, but in the case of predict.*, there is no default method. So dispatch is on the basis of the class of the first argument. There will be separate help pages for each method and the help page for "predict" lists several. Package authors need to write their own predict methods for new classes.

Logistic regression predates the machine learning paradigm, so expecting it to "predict classes" is somewhat unrealistic. Even the fact that you can get a "response" prediction is a gift over what the software would have provided 30 years ago when some of us were taking our regression classes. One needs to understand that probabilities are generally not 0 or 1 but rather something in between. If the user wants to set a threshold and determine how many cases exceed the threshold then that is an analyst decision and the analysts need to make any transformations to categories they deem worthwhile.

Executing: predict(fit, train$Sex) would be expected to give a result that was as long as there were values from the training set, so I'm guessing that you perhaps meant to try predict(fit, test$Sex) and were disappointed. If that's the case then it should have been: predict(fit, list(Sex=test$Sex) ). R needs the argument to be a value that can be coerced to a dataframe, so a named list of values is a minimum requirement for predict-ors.

If predict.glm gets a malformed argument to the second argument, newdata, it falls back on the original data argument and uses the linear predictors that are retained in the model object.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top