De-Duplicating sets of n-grams

Question 1

This is a hard problem and solutions tend to be very application specific. Typically you'd collect more than just the n-grams and counts. For example, it usually matters if a particular n-gram is used a lot by a single person, or by a lot of people. That is, if I'm a frequent poster and I'm passionate about wood carving, then the n-gram "wood carving" might show up as a common term. But I'm the only person who cares about it. On the other hand, there might be many people who are into oil painting, but they post relatively infrequently and so the count for the n-gram "oil painting" is close to the count for "wood carving." But it should be obvious that "oil painting" will be relevant to your users and "wood carving" would not be. Without information about what pages the n-grams come from, it's impossible to say which would be relevant to more users.

A common way to identify the most relevant phrases across a corpus of documents is called TF-IDF: Term frequency-inverse document frequency. Most descriptions you see concern themselves with individual words, but it's simple enough to extend that to n-grams.

This assumes, of course, that you can identify individual documents of some sort. You could consider each individual post as a document, or you could group all of the posts from a user as a larger document. Or maybe all of the posts from a single day are considered a document. How you identify documents is up to you.

A simple TF-IDF model is not difficult to build and it gives okay results for a first cut. You can run it against a sample corpus to get a baseline performance number. Then you can add refinements (see the Wikipedia article and related pages), always testing their performance against your pure TF-IDF baseline.

Given the information I have, that's where I would start.

Question 2

Consider using a graph database, having a table of words, containing the elements of the N-Grams; and a tabe of N-Grams containing arcs to the words that are contained in the N-Grams.

As implementation, you can use neo4j that has also a Python library: http://www.coolgarif.com/brain-food/getting-started-with-neo4j-in-python