How to choose a classifier after cross-validation?

https://datascience.stackexchange.com/questions/13960

16-10-2019
|

문제

When we do k-fold cross validation, should we just use the classifier that has the highest test accuracy? What is generally the best approach in getting a classifier from cross validation?

해결책

You do cross-validation when you want to do any of these two things:

Model Selection
Error Estimation of a Model

Model selection can come in different scenarios:

Selecting one algorithm vs others for a particular problem/dataset
Selecting hyper-parameters of a particular algorithm for a particular problem/dataset

(please notice that if you are both selecting an algorithm - better to call it model - and also doing hyper-parameters search, you need to do Nested Cross Validation . Is Nested-CV really necessary?)

Cross-validation ensures up to some degree that the error estimate is the closest possible as generalization error for that model (although this is very hard to approximate). When observing the average error among folds you can have a good projection of the expected error for a model built on the full dataset. Also is importance to observe the variance of the prediction, this is, how much the error varies from fold to fold. If the variation is too high (considerably different values) then the model will tend to be unstable. Bootstrapping is the other method providing good approximation in this sense. I suggest to read carefully the section 7 on "Elements of Statistical Learning" Book, freely available at: ELS-Standford

As it has been mentioned before you must not take the built model in none of the folds. Instead, you have to rebuild the model with the full dataset (the one that was split into folds). If you have a separated test set, you can use it to try this final model, obtaining a similar (and must surely higher) error than the one obtained by CV. You should, however, rely on the estimated error given by the CV procedure.

After performing CV with different models (algorithm combination, etc) chose the one that performed better regarding error and its variance among folds. You will need to rebuild the model with the whole dataset. Here comes a common confusion in terms: we commongly refer to model selection, thinking that the model is the ready-to-predict model built on data, but in this case it refers to the combination of algorithm+preprocesing procedures you apply. So, to obtain the actual model you need for making predictions/classification you need to build it using the winner combination on the whole dataset.

Last thing to note is that if you are applying any kind of preprocessing the uses the class information (feature selection, LDA dimensionality reduction, etc) this must be performed in every fold, and not previously on data. This is a critical aspect. Should do the same thing if you are applying preprocessing methods that involve direct information of data (PCA, normalization, standardization, etc). You can, however, apply preprocessing that is not depend from data (deleting a variable following expert opinion, but this is kinda obvious). This video can help you in that direction: CV the right and the wrong way

Here, a final nice explanation regarding the subject: CV and model selection

다른 팁

No. You don't select any of the k classifiers built during k-fold cross-validation. First of all, the purpose of cross-validation is not to come up with a predictive model, but to evaluate how accurately a predictive model will perform in practice. Second of all, for the sake of argument, let's say you were to use k-fold cross-validation with k=10 to find out which one of three different classification algorithms would be the most suitable in solving a given classification problem. In that case, the data is randomly split into k parts of equal size. One of the parts is reserved for testing and the rest k-1 parts will be used for training. The cross-validation process is repeated k (fold) times so that on every iteration different part is used for testing. After running the cross-validation you look at the results from each fold and wonder which classification algorithm (not any of the trained models!) is the most suitable. You don't want to choose the algorithm that has the highest test accuracy on one of the 10 iterations, because maybe it just happened randomly that the test data on that particular iteration contained very easy examples, which then lead to high test accuracy. What you want to do, is to choose the algorithm which produced the best accuracy averaged over all k folds. Now that you have chosen the algorithm, you can train it using your whole training data and start making predictions in the wild.

This is beyond the scope of this question, but you should also optimize model's hyperparameters (if any) to get the most out of the selected algorithm. People usually perform hyperparameter optimization using cross-validation.

So let us assume you have training out of which you are using 80% as training and rest 20% as validation data. We can train on the 80% and test on the remaining 20% but it is possible that the 20% we took is not in resemblance with the actual testing data and might perform bad latter. So, in order to prevent this we can use k-fold cross validation.

So let us say you have different models and want to know which performs better with your dataset, k-fold cross validation works great. You can know the validation errors on the k-validation performances and choose the better model based on that. This is generally the purpose for k-fold cross validation.

Coming to just one model and if you are checking with k-fold cross-validation, you can get an approximate of errors of test data, but when you are actually training it finally, you can use the complete training data.(Because it is assumed here that the whole data will together perform better than a part of it.It might not be the case sometimes, but this is the general assumption.)

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 datascience.stackexchange