Вопрос

I am having performance issues with precomuted item-item similarities in Mahout.

I have 4 million users with roughly the same amount of items, with around 100M user-item preferences. I want to do content-based recommendation based on the Cosine similarity of the TF-IDF vectors of the documents. Since computing this on the fly is slow, I precomputed the pairwise similarity of the top 50 most similar documents as follows:

  1. I used seq2sparse to produce TF-IDF vectors.
  2. I used mahout rowId to produce mahout matrix
  3. I used mahout rowSimilarity -i INPUT/matrix -o OUTPUT -r 4587604 --similarityClassname SIMILARITY_COSINE -m 50 -ess to produce the top 50 most similar documents

I used hadoop to precompute all of this. For 4 million items, the output was only 2.5GB.

Then I loaded the content of the files produced by the reducers into Collection<GenericItemSimilarity.ItemItemSimilarity> corrMatrix = ... using the docIndex to decode the ids of the documents. They were already integers, but rowId have decoded them starting from 1, so I have to get it back.

For recommendation I use the following code:

ItemSimilarity similarity = new GenericItemSimilarity(correlationMatrix);

CandidateItemsStrategy candidateItemsStrategy = new SamplingCandidateItemsStrategy(1, 1, 1, model.getNumUsers(),  model.getNumItems());
MostSimilarItemsCandidateItemsStrategy mostSimilarItemsCandidateItemsStrategy = new SamplingCandidateItemsStrategy(1, 1, 1, model.getNumUsers(),  model.getNumItems());

Recommender recommender = new GenericItemBasedRecommender(model, similarity, candidateItemsStrategy, mostSimilarItemsCandidateItemsStrategy);

I am trying it with limited data model (1.6M items), but I loaded all the item-item pairwise similarities in memory. I manage to load everything in main memory using 40GB.

When I want to do recommendation for one user

Recommender cachingRecommender = new CachingRecommender(recommender);
List<RecommendedItem> recommendations = cachingRecommender.recommend(userID, howMany);

The elapsed time for the recommendation process is 554.938583083 seconds, and besides it did not produce any recommendation. Right now I am really concern about the performance of the recommendation. I played with the numbers of CandidateItemsStrategy and MostSimilarItemsCandidateItemsStrategy, but I didn't get any improvements in performance.

Isn't it the idea of precomputing everything suppose to speed up the recommendation process? Could someone please help me and tell me where I am doing wrong and what I am doing wrong. Also why loading the parwise similarities in main memory explodes exponentially? 2.5GB of files was loaded in 40GB of main memory in Collection<GenericItemSimilarity.ItemItemSimilarity> mahout matrix?. I know that the files are serialized using IntWritable, VectorWritable hashMap key-values, and the key has to repeat for every vector value in the ItemItemSimilarity matrix, but this is little too much, don't you think?

Thank you in advance.

Это было полезно?

Решение

I stand corrected about the time needed for computing the recommendation using Collection for precomputed values. Apparently I have put the long startTime = System.nanoTime();on the top of my code, not before List<RecommendedItem> recommendations = cachingRecommender.recommend(userID, howMany);. This counted the time needed to load the dataset and the precomputed item-item similarities into the main memory.

However I stand behind the memory consumptions. I improved it though using custom ItemSimilarity and loading a HashMap<Long, HashMap<Long, Double> of the precomputed similarity. I used the trove library in order to reduce the space requirements.

Here is a detail code. The custom ItemSimilarity:

public class TextItemSimilarity implements ItemSimilarity{

    private TLongObjectHashMap<TLongDoubleHashMap> correlationMatrix;

    public WikiTextItemSimilarity(TLongObjectHashMap<TLongDoubleHashMap> correlationMatrix){
        this.correlationMatrix = correlationMatrix;
    }

    @Override
    public void refresh(Collection<Refreshable> alreadyRefreshed) {
    }

    @Override
    public double itemSimilarity(long itemID1, long itemID2) throws TasteException {
        TLongDoubleHashMap similarToItemId1 = correlationMatrix.get(itemID1);   
        if(similarToItemId1 != null && !similarToItemId1.isEmpty() &&  similarToItemId1.contains(itemID2)){
            return similarToItemId1.get(itemID2);
        }   
        return 0;
    }
    @Override
    public double[] itemSimilarities(long itemID1, long[] itemID2s) throws TasteException {
        double[] result = new double[itemID2s.length];
        for (int i = 0; i < itemID2s.length; i++) {
            result[i] = itemSimilarity(itemID1, itemID2s[i]);
        }
        return result;
    }
    @Override
    public long[] allSimilarItemIDs(long itemID) throws TasteException {
        return correlationMatrix.get(itemID).keys();
    }
}

The total memory consumption together with my data set using Collection<GenericItemSimilarity.ItemItemSimilarity> is 30GB, and when using TLongObjectHashMap<TLongDoubleHashMap> and the custom TextItemSimilarity the space requirement is 17GB. The time performance is 0.05 sec using Collection<GenericItemSimilarity.ItemItemSimilarity>, and 0.07 sec using TLongObjectHashMap<TLongDoubleHashMap>. Also I believe that big role in the performance plays using CandidateItemsStrategy and MostSimilarItemsCandidateItemsStrategy

I guess if you want to save some space use trove HashMap, and if you want just little better performance, you can use Collection<GenericItemSimilarity.ItemItemSimilarity>.

Лицензировано под: CC-BY-SA с атрибуция
Не связан с StackOverflow
scroll top