How much data should I allocate for my training and and test sets? (in R)

https://datascience.stackexchange.com/questions/75380

11-12-2020
|

Frage

I have a matrix of 358.367 data. Each row is a DNA sequence from the human genome. I want to build a classification model in R, using XGBoost algorithm and 83 features (dinucleotides, trinucleotides, etc.).

How should I split the data for the train and test set?

For example 70% for the train set and 30% for the test set? 30% for the train set and 70% for the test set?

Lösung

There is no "golden rule" here. Your data set is very handy - neither too large nor too small. Sounds like a very exciting project!

Here is how I often proceed in comparable settings.

Do all splits stratified by response or, if the rows are not independent but rather clustered by some grouping variable (e.g. the family etc), grouped sampling. Important rule: avoid any leakage across splits.
Set aside 10%-15% of rows for testing. Don't touch them until the analysis is complete. Act as you would never utilize this test set.
Select loss function and relevant performance measure.
Fit a random forest without tuning and use its OOB error as benchmark.
Choose parameters of XGB by 5-fold cross-validation iteratively by grid search, first starting with very wide parameter ranges and then making those ranges smaller and smaller. The number of boosting rounds are automatically optimized by early stopping.
Choose model and present cross-validation performance.
At the very last, reveal test performance.

Lizenziert unter: CC-BY-SA mit Zuschreibung

Nicht verbunden mit datascience.stackexchange