Training Validation Testing set split for facial expression dataset

https://datascience.stackexchange.com/questions/14342

16-10-2019
|

문제

I am using Convolutional Neural Networks (CNN) and I just want to ask if the way I split my training/validation/testing set is correct.

I have a total of 55 subjects. I plan to split them into 80–10–10 for training (45 subjects), validation (unseen 5 subjects), testing (unseen 6 subjects).

Should the validation set consists of unseen subjects as well? Or can I shuffle the whole training set and use a part of it (10-20%) as validation set?

I have read that using N-Fold cross-validation, the whole training set (instances) are shuffled then split into N-folds and the model is trained and averaged N times. However, in the case for Neural Networks or CNN, we don't use cross-validation since it is very computationally expensive.

I'm just wondering which is correct since using a validation set of unseen subjects, my model starts to overfit after 3-5 epochs and doesn't learn at all. On the other hand, if I use 10-20% of the training set as my validation set, my model learns with reasonable accuracy (45-50%) using a 3-layer CNN but when tested with the unseen testing set, my top-1 accuracy is around 15-16% only.

Thank you very much.

해결책

NB This advice assumes your goal is to recognise expression in pictures of any person, and not just people from your training data.

Should the validation set consists of unseen subjects as well?

Yes. This will give you the most accurate measure of performance in the task you want to use the network in, in order to choose the best generalisation and take it forward to testing.

You would only use a simple random split if the end goal of your trained network is to recognise expressions from images of the people in the training set.

Or can I shuffle the whole training set and use a part of it (10-20%) as validation set?

No. If you take random samples where the same face appears in training and cv, you will get an over-estimate of generalisation. I have seen this effect first hand in the Kaggle State Farm Distracted Driver contest, and you should see it discussed in the forums there. It might be helpful for ideas to improve performance too.

if I use 10-20% of the training set as my validation set, my model learns with reasonable accuracy (45-50%) using a 3-layer CNN

This is a data leak between train and cross validation - the network has learnt to correctly classify expressions in images of people it has already seen, and that is what you are measuring by taking a split in this way. It is not surprising that the test result does not match the promising values from cross validation.

I'm just wondering which is correct since using a validation set of unseen subjects, my model starts to overfit after 3-5 epochs and doesn't learn at all

You are over-fitting regardless of how you split train,CV. When you split the CV incorrectly, then in addition to over-fitting, the data leak gives you bad guidance.

It looks like that you have very little diversity in the training data, whilst training an image classifier from scratch requires a lot of data. Consider:

Collecting more labelled data, perhaps some other dataset you can download.
Adding a lot of regularisation to your model (e.g. multiple dropout layers)
Using a pre-trained image classifier network (e.g. VGG-19 or Inception) as a starting point and fine-tuning it for your classification task.
Use full k-fold cross-validation regardless of the computational cost, to mitigate problems of using small training set. This won't help solve your training problems directly, but will give you a better shot at tuning your network hyper-parameters once you solve that issue.

다른 팁

Validation set implies it is obtained by taking a part of training dataset. It cannot be unseen(I am assuming by unseen, you mean whose labels you do not know or you want to check on) subjects because then we will not be able to validate(check accuracy/precision to test our model) in the first place. I think what you are doing is probably taking validation set from training dataset which you are using for training as well leading to overfitting. Let me be elaborate:

Let us assume you have 100 documents. Use 80 for training and rest 20 for validation. DO NOT use these 20 documents for training. If you use it, you will lead to overfitting(I think this is probably the answer for your overfitting). Now, the other test dataset(unseen documents) can be tested on.

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 datascience.stackexchange