Question

I put "chunk transposition" in quotes because I don't know whether or what the technical term should be. Just knowing if there is a technical term for the process would be very helpful.

The Wikipedia article on edit distance gives some good background on the concept.

By taking "chunk transposition" into account, I mean that

Turing, Alan.

should match

Alan Turing

more closely than it matches

Turing Machine

I.e. the distance calculation should detect when substrings of the text have simply been moved within the text. This is not the case with the common Levenshtein distance formula.

The strings will be a few hundred characters long at most -- they are author names or lists of author names which could be in a variety of formats. I'm not doing DNA sequencing (though I suspect people that do will know a bit about this subject).

Was it helpful?

Solution

Have a look at the Jaccard distance metric (JDM). It's an oldie-but-goodie that's pretty adept at token-level discrepancies such as last name first, first name last. For two string comparands, the JDM calculation is simply the number of unique characters the two strings have in common divided by the total number of unique characters between them (in other words the intersection over the union). For example, given the two arguments "JEFFKTYZZER" and "TYZZERJEFF," the numerator is 7 and the denominator is 8, yielding a value of 0.875. My choice of characters as tokens is not the only one available, BTW--n-grams are often used as well.

OTHER TIPS

In the case of your application you should probably think about adapting some algorithms from bioinformatics.

For example you could firstly unify your strings by making sure, that all separators are spaces or anything else you like, such that you would compare "Alan Turing" with "Turing Alan". And then split one of the strings and do an exact string matching algorithm ( like the Horspool-Algorithm ) with the pieces against the other string, counting the number of matching substrings.

If you would like to find matches that are merely similar but not equal, something along the lines of a local alignment might be more suitable since it provides a score that describes the similarity, but the referenced Smith-Waterman-Algorithm is probably a bit overkill for your application and not even the best local alignment algorithm available.

Depending on your programming environment there is a possibility that an implementation is already available. I personally have worked with SeqAn lately, which is a bioinformatics library for C++ and definitely provides the desired functionality.

Well, that was a rather abstract answer, but I hope it points you in the right direction, but sadly it doesn't provide you with a simple formula to solve your problem.

I think you're looking for Jaro-Winkler distance which is precisely for name matching.

You might find compression distance useful for this. See an answer I gave for a very similar question.

Or you could use a k-tuple based counting system:

  1. Choose a small value of k, e.g. k=4.
  2. Extract all length-k substrings of your string into a list.
  3. Sort the list. (O(knlog(n) time.)
  4. Do the same for the other string you're comparing to. You now have two sorted lists.
  5. Count the number of k-tuples shared by the two strings. If the strings are of length n and m, this can be done in O(n+m) time using a list merge, since the lists are in sorted order.
  6. The number of k-tuples in common is your similarity score.

With small alphabets (e.g. DNA) you would usually maintain a vector storing the count for every possible k-tuple instead of a sorted list, although that's not practical when the alphabet is any character at all -- for k=4, you'd need a 256^4 array.

One of the easiest and most effective modern alternatives to edit distance is called the Normalized Compression Distance, or NCD. The basic idea is easy to explain. Choose a popular compressor that is implemented in your language such as zlib. Then, given string A and string B, let C(A) be the compressed size of A and C(B) be the compressed size of B. Let AB mean "A concatenated with B", so that C(AB) means "The compressed size of "A concatenated with B". Next, compute the fraction

(C(AB) - min(C(A),C(B))) / max(C(A), C(B))

This value is called NCD(A,B) and measures similarity similar to edit distance but supports more forms of similarity depending on which data compressor you choose. Certainly, zlib supports the "chunk" style similarity that you are describing. If two strings are similar the compressed size of the concatenation will be near the size of each alone so the numerator will be near 0 and the result will be near 0. If two strings are very dissimilar the compressed size together will be roughly the sum of the compressed sizes added and so the result will be near 1. This formula is much easier to implement than edit distance or almost any other explicit string similarity measure if you already have access to a data compression program like zlib. It is because most of the "hard" work such as heuristics and optimization has already been done in the data compression part and this formula simply extracts the amount of similar patterns it found using generic information theory that is agnostic to language. Moreover, this technique will be much faster than most explicit similarity measures (such as edit distance) for the few hundred byte size range you describe. For more information on this and a sample implementation just search Normalized Compression Distance (NCD) or have a look at the following paper and github project:

http://arxiv.org/abs/cs/0312044 "Clustering by Compression"

https://github.com/rudi-cilibrasi/libcomplearn C language implementation

There are many other implementations and papers on this subject in the last decade that you may use as well in other languages and with modifications.

I'm not sure that what you really want is edit distance -- which works simply on strings of characters -- or semantic distance -- choosing the most appropriate or similar meaning. You might want to look at topics in information retrieval for ideas on how to distinguish which is the most appropriate matching term/phrase given a specific term or phrase. In a sense what you're doing is comparing very short documents rather than strings of characters.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top