K fold cross validation reduces accuracy
-
29-11-2019 - |
Question
I am working on a machine learning classifier and when I arrive at the moment of dividing my data into training set and test set Iwant to confron two different approches. In one approch I just split the dataset into training set and test set, while with the other approch I use k fold cross validation.
The strange thing is that with the cross validation the accuracy decreases, so if I have 0.87 with the first approch, with cross validation I have 0.86.
Shouldn't cross validation increase my accuracy? Thank's in advance.
Solution
Chance plays a big role when the data is split. For example maybe the training set contains a particular combination of features, maybe it doesn't; maybe the test set contains a large proportion of regular "easy" instances, maybe it doesn't. As a consequence the performance varies depending on the split.
Let's imagine that the performance of your classifier would vary between 0.80 and 0.90:
In one approch I just split the dataset into training set and test set
With this approach you throw the dice only once: maybe you're lucky and the performance will be close to 0.9, or you're not and it will be close to 0.8.
while with the other approch I use k fold cross validation.
With this approach you throw the dice k times, and the performance is the average across these $k$ runs. It's more accurate than the previous one, because by averaging over several runs the performance is more likely to be close to the mean, i.e. the most common case.
Conclusion: k-fold cross-validation isn't meant to increase performance, it's meant to provide a more accurate measure of the performance.
OTHER TIPS
K-fold cross-validation trains k different models, each being tested on the observations not used in the learning procedure. There is no reason why you would get higher or lower scores in cross-validation, as you are not using the same model as in your reference case, neither the same test set. The approaches that you describe are different, although I would not recommend the cross-validation only.
Indeed, consider cross-validation as a way to validate your approach rather than test the classifier. Typically, the use of cross-validation would happen in the following situation: consider a large dataset; split it into train and test, and perform k-fold cross-validation on the train set only. The optimization of your model's hyperparameters is guided by the cross-validation score. Once you get the optimal hyperparameters setting, train your model with those parameters, on the full train set, and simply compute its accuracy on the test set.