Can a Gradient Boosting Regressor be tuned on a subset of the data and achieve the same result?

https://datascience.stackexchange.com/questions/11327

16-10-2019
|

문제

I am working with a large data set (~9M rows with 20+ features). Is it ok to tune via grid search on a fraction of the data (~100k rows) to determine optimal hyperparameters? This is mostly for choosing max_features, min_samples, max_depth. Trees and learning rate come later. Will I get different results tuning the fraction versus the whole data set?

해결책

You should never train or do grid search on your entire data set, since it will lead to overfitting and reduce the accuracy of your model on new data. What you have described is actually the ideal approach: do grid search / training on a subset of your data. Yes, your model will get different results vs if you used the entire set of data, but your model will be much stronger because of it.

For more details on why you would want to split up / sample your data, see this quesiton: https://stats.stackexchange.com/questions/19048/what-is-the-difference-between-test-set-and-validation-set

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 datascience.stackexchange