semi-supervised learning's testing data

https://stackoverflow.com/questions/13468848

30-11-2021
|

题

Semi-supervised learning uses a set of labeled data(L) to train a model to predict a set of unlabeled data(U), and then group the new labeled data(L') and original labeled data(L) as the complete labeled data.

I want to ask that how to extract the testing data.

I should extract testing data from (L union L')
I should extract testing data from (L)

Which one is right?

If the testing data are extracted from (L union L'), the result does not make sense, because the answer in L' may be wrong...?

========================================================== edit new

I have another idea.....

3. I should split the labeled data(L) to training data(L_train) and test data(L_test) at the beginning.

Then use L_train to train a model and use it to predict a set of unlabeled data(U), and then group the predicted result(L') and L_train.

And, use (L_train union L') to train a model to test on the L_test.

Which one is right of 1,2,3? Thanks for the replies.

解决方案

You train your classifier on L. You can firstly perform cross-validation to fit some method parameters P. With parameters P you construct model M, from labeled data L. You then use the model M to label unlabeled data U. You join the examples from U (with heighest confidence in assigned class) with L. You then repeat the procedure until all the examples are classiied.

-edit-

I think the most appropriate approach is the third one. But I may not understand it right, so here goes.

You split L to L_train and L_test. You train your classifier using L_train and you also use this classifier to classify U (as per methodology I described above). From union of labeled U and L_train you construct a new classifier, and with it you classify L_test. The differences in these classification can be used for evaluation measures (classification accuracy, ...).

许可以下： CC-BY-SA 和归因

不隶属于 StackOverflow