Question

I'm thinking this may be impossible to do resonably, but I figured I would take a shot at it. So lets say I have two NSStrings. One is @"Singin' In The Rain" and the other is @"Singing In The Rain". These strings are very similar, but have a small difference. I'm trying to find a way where I could write something like the following:

NSString *stringOne = @"Singin' In The Rain";
NSString *stringTwo = @"Singing In The Rain";

float dif = [stringOne differenceFrom:stringTwo];
//dif = .9634 or something like that

One project that I did find similar to this was taken from the previous similar question on Stack Overflow: Check if two NSStrings are similar. However, this simply returns a BOOL which isn't as accurate as I need it to be. I also tried looking into the compare: documentation for NSString but it all looked too basic. Another similar thing I found is at https://gist.github.com/iloveitaly/1515464. However, this gives varying results, even saying two of the same string are different occasionally. Any advice would be much appreciated.

Was it helpful?

Solution 2

Another off the wall suggestion:

The source, and hence the algorithm, for diff and similar programs is easily available. These compare input on a line-by-line basis and detect insertions, deletions and changes.

When comparing text strings for "closeness" then the insertion, deletion or changing of words seems as good a measure as any.

So:

  1. Break each string into "words" (white space separated should be sufficient).
  2. Compare the two lists using the diff algorithm, treating each "word" as a "line", use a re-sync length of 1 (the number of "lines" that need to be the same to treat the two inputs as back in sync)
  3. Calculate the "closeness" as the number of insertions/deletions/changes compared to the total word count.

For the two example strings this would give 1:4 changes or 75% similar.

If you want greater granularity for each change split the two words into characters and repeat the algorithm giving you a fraction the word is similar by (as opposed to the whole word).

For the two example strings this would give 3 6/7 words out of 4, or 96% similar.

OTHER TIPS

The question is a little vague, but I would assume that the most satisfactory results will come from using NSLinguisticTagger. If you parse each for tags with the NSLinguisticTagSchemeLexicalClass scheme then your string will be broken down into verbs, nouns, adjectives, etc. In your example, even if you weren't spotting that singin' and singing are the same, you'd spot the other three words are the same and that the thing at the end is a noun, so they're both about doing something in the same thing.

It'd probably be wise to use something like a BK-Tree to compare individual words where you suspect there may be a match (a noun obviously doesn't match an adverb but two nouns may match even if spellings differ).

I'd recommend dynamic time warping for such comparisons:

http://en.wikipedia.org/wiki/Dynamic_time_warping

This will however return distance between two strings (so you'll get 0 for identical), but this the best starting point I can think of.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top