K fold cross validation reduces accuracy

https://datascience.stackexchange.com//questions/62892

29-11-2019
|

Question

I am working on a machine learning classifier and when I arrive at the moment of dividing my data into training set and test set Iwant to confron two different approches. In one approch I just split the dataset into training set and test set, while with the other approch I use k fold cross validation.

The strange thing is that with the cross validation the accuracy decreases, so if I have 0.87 with the first approch, with cross validation I have 0.86.

Shouldn't cross validation increase my accuracy? Thank's in advance.

Solution

Chance plays a big role when the data is split. For example maybe the training set contains a particular combination of features, maybe it doesn't; maybe the test set contains a large proportion of regular "easy" instances, maybe it doesn't. As a consequence the performance varies depending on the split.

Let's imagine that the performance of your classifier would vary between 0.80 and 0.90:

In one approch I just split the dataset into training set and test set

With this approach you throw the dice only once: maybe you're lucky and the performance will be close to 0.9, or you're not and it will be close to 0.8.

while with the other approch I use k fold cross validation.

With this approach you throw the dice k times, and the performance is the average across these $k$ runs. It's more accurate than the previous one, because by averaging over several runs the performance is more likely to be close to the mean, i.e. the most common case.

Conclusion: k-fold cross-validation isn't meant to increase performance, it's meant to provide a more accurate measure of the performance.

OTHER TIPS

K-fold cross-validation trains k different models, each being tested on the observations not used in the learning procedure. There is no reason why you would get higher or lower scores in cross-validation, as you are not using the same model as in your reference case, neither the same test set. The approaches that you describe are different, although I would not recommend the cross-validation only.

Indeed, consider cross-validation as a way to validate your approach rather than test the classifier. Typically, the use of cross-validation would happen in the following situation: consider a large dataset; split it into train and test, and perform k-fold cross-validation on the train set only. The optimization of your model's hyperparameters is guided by the cross-validation score. Once you get the optimal hyperparameters setting, train your model with those parameters, on the full train set, and simply compute its accuracy on the test set.

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange