Another off the wall suggestion:
The source, and hence the algorithm, for diff
and similar programs is easily available. These compare input on a line-by-line basis and detect insertions, deletions and changes.
When comparing text strings for "closeness" then the insertion, deletion or changing of words seems as good a measure as any.
So:
- Break each string into "words" (white space separated should be sufficient).
- Compare the two lists using the
diff
algorithm, treating each "word" as a "line", use a re-sync length of 1 (the number of "lines" that need to be the same to treat the two inputs as back in sync) - Calculate the "closeness" as the number of insertions/deletions/changes compared to the total word count.
For the two example strings this would give 1:4 changes or 75% similar.
If you want greater granularity for each change split the two words into characters and repeat the algorithm giving you a fraction the word is similar by (as opposed to the whole word).
For the two example strings this would give 3 6/7 words out of 4, or 96% similar.