Question

I'm writing a piece of java software that has to make the final judgement on the similarity of two documents encoded in UTF-8.

The two documents are very likely to be the same, or slightly different from each other, because they have many features in common like date, location, creator, etc., but their text is what decides if they really are.

I expect the text of the two documents to be either very similar or not at all, so I can be rather strict about the threshold to set for similarity. For example I could say that the two documents are similar only if they have 90% of their words in common, but I would like to have something more robust, which would work for texts short and long alike.

To sum it up I have:

  • two documents, either very similar or not similar at all, but:
  • it is more likely for the two documents to be similar than not
  • documents can be both long (some paragraphs) and short (a few sentences)

I've experimented with simmetrics, which has a large array of string matching function, but I'm most interested in suggestion about possible algorithms to use.

Possible candidates I have are:

  • Levenshtein: its output is more significant for short texts
  • overlapping coefficient: maybe, but will it discriminate well for documents of different lenght?

Also considering two texts similar only when they are exactly the same would not work well, because I'd like for documents that differ only for a few words to pass the similarity test.

Was it helpful?

Solution

Levenshtein distance is the standard measure for a reason: it's easy to compute and easy to grasp the meaning of. If you are wary of the number of characters in a long document, you can just compute it on words or sentences or even paragraphs instead of characters. Since you expect the similar pairs to be very similar, that should still work well.

OTHER TIPS

Levenshtein is appropriate for the edit distance between two words; if you are comparing documents, something like diff will probably be more along the lines of what you need.

I would start here: http://c2.com/cgi/wiki?DiffAlgorithm. They provide links to a number of diff-style algorithms you can look into.

Levenshtein distance is used to compare two words. When it's documents, popular ways are cosine similarity or Latent Semantic Analysis.

Levenshtein seems to be the best solution here. If you are trying to get a weighted similiarity ranking - which I guess is the case because you mentioned that the output of Levenshten is more significant for shorter texts - then just weight the result of the levenshtein algorithm by dividing by the number of characters in the document.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top