Best way to find document similarity

Question

The answer to your question is twofold: (a) syntactic and (b) semantic similarity.

Syntactic similarity You have already discovered Shingling, so I will focus on other aspects. Recent approaches use latent variable models to describe syntactic patterns. The basic idea is to use conditional probability: P (f| f_c ), where f is some feature, and f_c is its context. The simplest example of such models is a Markov model with words as features, and the previous words as context. These models answer the question: *what is the probability of a word w_ n, given that words w1, ... w_ n-1 occur before it in a document? This avenue will lead you to building language models, thereby measuring document similarity based on perplexity. For purely syntactic similarity measures, one may look at parse tree features instead of words.

Semantic similarity This is a much harder problem, of course. State-of-the-art in this direction involves understanding distributional semantics. Distributional semantics essentially says, "terms which occur in similar contexts over large amounts of data are bound to have similar meanings". This approach is data-intensive. The basic idea is to build vectors of "contexts", and then measure the similarity of these vectors.

Measuring document similarity based on natural language is not easy, and an answer here will not do justice, so I point you to this ACL paper, which, in my opinion, provides a pretty good picture.