Interpretation for test score , training score and validation score in machine learning?

https://datascience.stackexchange.com/questions/70054

10-12-2020
|

Question

Interpretation for test score , training score and validation score ? what they actually tell us?

What's an acceptable difference between cross test score , validation score and test score?

If difference between test score and training score is small mean it is a good model/fit?

overfitting and under fitting on the basis of test score , training score and validation score?

whether any of these score or difference among them tells us , if we need more data(observations)

Solution

Interpretation for test score , training score and validation score ? what they actually tell us?

We usually divide our data-set in 3 parts. Training-data, validation-data and test-data. Then we analyze the score:

Training Score: How the model generalized or fitted in the training data. If the model fits so well in a data with lots of variance then this causes over-fitting. This causes poor result on Test Score. Because the model curved a lot to fit the training data and generalized very poorly. So, generalization is the goal.

Validation Score This is still a experimental part. We keep exploring our model with this data-set. Our model is yet to call the final model in this phase. We keep changing our model until we are satisfied with the validation score we get.

Test Score This is when our model is ready. Before this step we have not touched this data-set. So, this represents real life scenario. Higher the score, better the model generalized.

What's an acceptable difference between cross test score , validation score and test score?

I think the "cross test score, Validation score" there are no difference. The right naming is "cross validation score". Cross validation is used by interchanging the training and validation data in different ways multiple times. A image that describes (Copyright reserved by towardsdatascience.com):

So your question should be "What's an acceptable difference between cross validation score and test score?" There's no straightforward answer to this question. Here's a long discussion you might want to consider: Cross validation vs Test set.

If difference between test score and training score is small mean it is a good model/fit?

Yes!! This is what we strive for. And often times to achieve this, We require various engineering techniques, calculations and parameter tuning.

overfitting and under fitting on the basis of test score , training score and validation score?

Usually, high training score and low test score is over-fitting. Very low training score and low test score is under-fitting. First example here, in technical term is called low bias and high variance which is over-fitting. The latter example, high variance and high bias called under-fitting. In other words, A model that is underfit will have high training and high testing error while an overfit model will have extremely low training error but a high testing error.

whether any of these score or difference among them tells us , if we need more data(observations)

This is a very good question. you can refer here. And the best explanation of this I got from Andrew NG. This is the link. These will give you a solid understanding. but in simple words I can give example for intuition (won't be accurate): suppose I have 500 rows of data. I will first take 100 of my data. Then I will analyse my bias and variance. Next I will take 100+100=200 data from those 500. The again I will record my bias and variance. I will keep doing this by taking 300, 400 and 500 data. If I observe that my bias and variance is improving (if model is improving) it means adding more data might actually help. This is a very low level example. But hope gives an intuition.

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange