문제

My company has recently engaged a consultant firm to develop a predictive model to detect defective works.

I understand that there are many ways to validate the model, for example, using k-fold cross-validation and I believe that the consultant firm will carry out the validation before submitting the model to us.

However, at the employer's side, how can I check the accuracy of the model developed by the consultant firm ??

Someone suggested that I can give 2000-2015 data to the consultant firm and keep 2016 data for our own checking. However, a model with good accuracy on 2016 data does not imply that it will have good predictive power in the future. In my view, keeping 2016 data for checking seems like adding one more test set for validation, which in my view, is unnecessary since I already hv "k-fold" cross validation.

Could someone advise what the employer can do to check the consultant's model?

도움이 되었습니까?

해결책

Cross validation can be used in parameter tuning or model selection, but it does not evaluate the performance of a model.

When developing a model, you divide your data between train, validation and testing. In the best case scenario, testing is only used once at the end to score the model. You should definitely keep the 2016 data.

If you give all your data, it is easy to have a model learning "by heart" your expected data but it will not generalize well to future years. This is overfiting. The only way to know is by testing it on unknown data, here, 2016

Employed for measuring model performance, cross validation can measure more than just the average accuracy and you can select your features to answer the best accuracy score

다른 팁

I would agree with the suggestion with holding out the 2016 data to check the external agency's work. Without inspecting the code, you just can't be sure that the k-fold cross-validation process has been performed properly.

Another benefit of using the 2016 held out set is that you can see if the model trained on past years' data work well in future years data. There might be a concept drift in the new year as the true relationship between Y and Xs changes. In cross validation, each fold belongs to the same time period as the training data set.

A lot of people are suggesting holding 2016 data, but what you should hold out as a test set depends on what is being predicted. If defective works do not depend on date/time, and the date/time is not being used in the model, it may make sense to hold a random sample (at least w.r.t. date/time) for your tests.

If you have features that you want to avoid extrapolating from, then split by those features e.g. if there is some common "location" property for your items, you may want to split by identity of location, because you want to use the model to make predictions in new locations (and the location as a one-hot feature would not have any predictive value for a new location, even if your target class was correlated with existing locations).

Ideally the consultancy firm will be helping you here to identify correlations in the data set that you don't want to use in predictions, because they could affect the generalisation you want when using the model in production. If any such thing is identified, then it clearly points to using that feature to split test data by. If there is no such issue, then you may as well split randomly.

라이센스 : CC-BY-SA ~와 함께 속성
제휴하지 않습니다 datascience.stackexchange
scroll top