Two-class model with predicted scores needed - classification or regression approach

https://datascience.stackexchange.com/questions/73910

11-12-2020
|

Pergunta

In my problem, step one is to build a model to classify cases as one of True or False (1 or 0 could also be used obviously). Once the optimum model is found, step two is to retrieve probabilities for these predictions and use these to work out the optimum threshold (based on an accuracy measure) to classify my data thereafter.

So I know for step two I need to use a classification accuracy measure e.g. f1 score to work out the optimum threshold however, I am debating on the best approach for step one. My initial thought was to do a two-class classification model. The model I'm using is a Radial SVM implemented in R. My issue with this however, is using caret, how I get predicted scores once the optimum model is found with classProbs=T (full code below). I don't know if this is ideal as I read somewhere (can't find link now sorry!), to do this, caret actually runs a second model which doesn't seem like a great idea to me. The only information on this is the package documentation seems to be

classProbs a logical; should class probabilities be computed for classification models (along with predicted values) in each resample?

I have also searched SO for another way to get predicted values using caret but have had no success.

My other idea was to use a regression model for step one using, 0 and 1 as the only y values and then perhaps implementing a custom error function that allows for predicted values outside of 0 and 1. My thought here is that I don't necessarily want to penalise a model if the prediction is above 1 and the actual is 1.

I'm not sure which approach is better or if there is a third way I'm not thinking of, any help/suggestions would be great! I think the first approach is probably better if the class probabilities calculation is not done through using a separate model so if anyone has found further documentation on what goes on with this setting that would also be great!

svm_tests <- train(x = x_train,
                   y = y_train,
                   method = "svmRadial",
                   scale=F,
                   tuneGrid = expand.grid("C"=0.1,"sigma"= 0.05),
                   trControl = trainControl("repeatedcv",  #Do 5 fold cv, repeated 5 times with different seeds
                                            repeats = 5,
                                            number = 5,
                                            seeds = seeds,
                                            summaryFunction = m.c.c,
                                            classProbs=T, #Used to get probabilities when using predict
                   ))


scores_svm <-  predict(svm_tests, 
                       x_valid,
                       type = "prob")

Solução

This is a classification problem, not a regression problem, so your first instinct was correct. For most models, you can easily get the probability estimate for each prediction. If you did tree based methods or a linear model, you would have probabilities easily. However, because of how an SVM works, it does not automatically output a probability estimate. It makes sense that Caret has to do some extra computations to get your probabilities.

For the specifics, see: https://stats.stackexchange.com/questions/335527/what-are-the-predicted-probabilities-from-an-svm

Licenciado em: CC-BY-SA com atribuição

Não afiliado a datascience.stackexchange