How to get a % difference of two NSStrings

Question 1

Another off the wall suggestion:

The source, and hence the algorithm, for diff and similar programs is easily available. These compare input on a line-by-line basis and detect insertions, deletions and changes.

When comparing text strings for "closeness" then the insertion, deletion or changing of words seems as good a measure as any.

So:

Break each string into "words" (white space separated should be sufficient).
Compare the two lists using the diff algorithm, treating each "word" as a "line", use a re-sync length of 1 (the number of "lines" that need to be the same to treat the two inputs as back in sync)
Calculate the "closeness" as the number of insertions/deletions/changes compared to the total word count.

For the two example strings this would give 1:4 changes or 75% similar.

If you want greater granularity for each change split the two words into characters and repeat the algorithm giving you a fraction the word is similar by (as opposed to the whole word).

For the two example strings this would give 3 6/7 words out of 4, or 96% similar.

Question 2

The question is a little vague, but I would assume that the most satisfactory results will come from using NSLinguisticTagger. If you parse each for tags with the NSLinguisticTagSchemeLexicalClass scheme then your string will be broken down into verbs, nouns, adjectives, etc. In your example, even if you weren't spotting that singin' and singing are the same, you'd spot the other three words are the same and that the thing at the end is a noun, so they're both about doing something in the same thing.

It'd probably be wise to use something like a BK-Tree to compare individual words where you suspect there may be a match (a noun obviously doesn't match an adverb but two nouns may match even if spellings differ).

Question 3

I'd recommend dynamic time warping for such comparisons:

http://en.wikipedia.org/wiki/Dynamic_time_warping

This will however return distance between two strings (so you'll get 0 for identical), but this the best starting point I can think of.