Pergunta

I am running some RecommenderJob (org.apache.mahout.cf.taste.hadoop.item.RecommenderJob) based job from Mahout 0.7 and notice that there are options like startPhase and endPhase. I am guessing these are to run only portions of the pipeline assuming you have necessary input data from prior run(s). But I am having a hard time understanding what kinds of phases there are in RecommenderJob. I am in the middle of reading the source code but it looks like it will take a while. In the meantime I am wondering if anybody can shed light on how to use these options (startPhase in particular) with RecommenderJob class?

Foi útil?

Solução

Here is what I found:

phase 0 is about PreparePreferenceMatrixJob and it has 3 hadoop jobs:

PreparePreferenceMatrixJob-ItemIDIndexMapper-Reducer
PreparePreferenceMatrixJob-ToItemPrefsMapper-Reducer
PreparePreferenceMatrixJob-ToItemVectorsMapper-Reducer

phase 1 is about RowSimilarityJob and it has 3 jobs:

RowSimilarityJob-VectorNormMapper-Reducer
RowSimilarityJob-CooccurrencesMapper-Reducer
RowSimilarityJob-UnsymmetrifyMapper-Reducer

phase 2 is about RecommenderJob and it has 3 jobs:

RecommenderJob-SimilarityMatrixRowWrapperMapper-Reducer
RecommenderJob-UserVectorSplitterMapper-Reducer
RecommenderJob-Mapper-Reducer

phase 3 is the last one and it has only one job:

RecommenderJob-PartialMultiplyMapper-Reducer

Also output from phase 1 here in RecommenderJob class is exactly the same as the output from phase 0 and 1 of ItemSimilarityJob (but the temp directory names are different).

Outras dicas

Yes, that's correct. It's a fairly crude mechanism. Really it controls which of a series of MapReduce jobs are run. You have to read the code to know what they are, yes. They vary by job.

If I'd done it over again I would have just made it detect the presence of output to know to skip the jobs. (That's what I've done in my next-gen recommender project.)

Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top