Training, Testing and Validation Dataset [closed]

https://datascience.stackexchange.com/questions/85012

14-12-2020
|

Question

I'm training a Unet model for tumor segmentation. I have a dataset of 400 patients for that. The used images are CT scans (3D images) that I divide into 2D images (for a total of 30k 2D images).

I am actually splitting the dataset into: 10% test data, 18% validation data, 72% actual training data. I'm dividing the test and training data over patients (i.e. the patients used for testing are not the same as the one for training). Afterwards, I shuffle the 2D images and split in training/testing dataset (i.e. the same patients can be found in training dataset and validation dataset but not same stack images).

I have two questions:

Should I split the train/validation dataset according to patients too ?
Are the division percentages in train/test/validation adapted for my problem ?

Solution

Generally numbers (percentages) do not matter.

What matters is that your Splitting (Train/test/Validation) does 2 things. Represent the real world sitatution and making sure the model can generalise given that ist evaluated on the holdout sets.

So what does that mean here exactly? You have 30k Images and 400 patients. Most likely patients(scans) will differ from each other so you should split according to patients also to make sure the model can generalise on slightly different distributions of images.

And according to percentages. You Need to make sure that Things you find in Train test and Validation represent your Problem. This can mean Splitting by Patient, Splitting by some other feature, checking the Distribution of data etc. but what it does not mean is that only cause you have 12% in one set you are sure.

What does that mean. Lets say you have 1000 rows of data. You split 90% 10% so in holdout you have 100 data Points. But in the Train set out of the 900, the majority of them are same similiar. And they differ from the 100 Points in holdout. Is this a good split? obviously not cause your model is learning Nothing.

OTHER TIPS

Generally you should have a 60% train dataset and a 20% validation as well tests set. I'm not familiar with the tumor segmentation thing but as long as the images for the same pacient are different and with a relevant level of difference that must be enough.

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange