Question

I have a collection of documents in mongo db. I am using Pymongo to access and insert into this collection. What I want to do is :

In python, use map reduce to efficiently query the number of times a n -gram phrase is used across the entire corpus.

I know how to do this for single words, but struggling to extend it to n-grams. What I dont want to do is tokenize using the NLTK library and then run map reduce. I believe that will take the efficiency out of the solution. Thanks.

Was it helpful?

Solution

If you want an efficient system, you'll need to break down the n-grams ahead of time and index them. When I wrote the 5-Gram Experiment (unfortunately the backend is offline now as I had to give back the hardware), I've created a map of word => integer id, and then stored in MongoDB a hex id sequence in the document key field of a collection (for example, [10, 2] => "a:2"). Then, randomly distributing the ~350 million 5-grams to 10 machines running MongoDB offered sub-second query times for the whole data set.

You can a similar scheme. With a document such as:

{_id: "a:2", seen: [docId1, docId2, ...]}

You'll be able to find where the given n-gram was found.

Update: Actually, a small correction: in the system that went live I ended up using the same scheme, but encoding the n-gram keys in a binary format for space efficiency (~350M is a lot of 5-grams!), but otherwise the mechanics were all the same.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top