Mahout - Item exists in test data but not training data

https://stackoverflow.com/questions/18147897

24-06-2022
|

Вопрос

I am trying to evaluate a simple item-based recommender using PearsonCorrelationSimilarity. I load the DataModel from a file that contains userid, itemid, preference, timestamp (in this order) My code looks something like that:

DataModel model = new FileDataModel(new File("FILE_NAME"));
RecommenderEvaluator evaluator = new AverageAbsoluteDifferenceRecommenderEvaluator();

RecommenderBuilder recommenderBuilder = new RecommenderBuilder() {
                    @Override
                    public Recommender buildRecommender(DataModel model) throws TasteException {
                        ItemSimilarity similarity = new PearsonCorrelationSimilarity(model);
                        Optimizer optimizer = new ConjugateGradientOptimizer();
                        return new KnnItemBasedRecommender(model, similarity, optimizer, N);
                    }

                };
score = evaluator.evaluate(recommenderBuilder, null, model, 0.7, 1.0);

When I run it I am getting lot's of

INFO eval.AbstractDifferenceRecommenderEvaluator: Item exists in test data but not training data:

Does this have to do something with my DataModel or with the evaluator. I've tried with both RMSRecommenderEvaluator and AverageAbsoluteDifferenceRecommenderEvaluator but I am getting the same INFO notice. I also tried using RandomUtils.useTestSeed();. When I run the same using UserSimilarity metrics, I don't have this issue.

My question is will this affect my evaluation results?

Thank you. Dragan

Решение

Basically, you are seeing the Item exists in test data but not training data message because of the way evaluation happens. The data is split into 2, a training set and a test set. The recommender is trained on the training data and then results are validated against the test set. This partition into training and test is done randomly, so yes, some items might be in the training set and not in the test set, and viceversa. For more significant results you should run the test around 3 or more times and average the result.

Ideally you would not use RandomUtils.useTestSeed(); in production evaluation code, it's mostly for testing purposes given that is set the random seed to be the same every time you run your test, hence you get repeatability (good for testing the internal evaluator code)

Also, knn recommender is deprecated in Mahout 0.8 (recently released) and will be removed in 0.9

Лицензировано под: CC-BY-SA с атрибуция

Не связан с StackOverflow