Question

I have this problem. I have a very large set (in millions) of key-value pairs with a certain unique id as a key and a string as a value (the strings might be exactly similar for 2 or more keys). I have to group these key-value pairs together as group 1 contains some id-string pairs group 2 contains some other pairs etc. The grouping needs to be done on the similarity between the strings which are actually values of the pairs. I have already implemented Levenshtein Distance between these strings and grouped the pairs with distance less than a threshold distance together. And I have implemented it the traditional (very bad) way: compare each string with every other.

I need some tips on how to optimize this. Can I actually group key-value pairs together using Map-Reduce in Hadoop? I think the input for map and reduce functions are individual and independent and hence can't be 'grouped' together. And is this a k-means clustering problem? Can you suggest some other faster and efficient techniques? Thanks.

Was it helpful?

Solution

Spell checkers make use of Burkhard-Keller trees (BK-Tree) an example is found here https://github.com/mkarlesky/csharp-bk-tree. This is very fast at testing a new word against the existing list, but also giving a "distance" measure that is based on the number of operations needed to change the string to the next. Unlike a simple "contains" test that gives you a boolean, this gives you a way to organize your available options. You can read more about it here: http://blog.notdot.net/2007/4/Damn-Cool-Algorithms-Part-1-BK-Trees. I suspect that you can use the distances to help with clustering.

I guess the main thing about bk trees is that you can continue to use the Levenshtein distance. But then maybe you were already using that? This technique isn't ideal for picking an arbitrary number of clusters like k-means does. But I did see an interesting article on taking advantage of some new parallel processing in the context of k-means that might help you speed things up in C# here:

http://www.codethinked.com/multi-threaded-k-means-clustering-in-net-40

The example isn't using strings, but I maybe the AsParallel concept will help with performance for the solution you already have?

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top