Question

I have an array of items (~5000 items, each item is an English word) and a distance function between pairs of items. I want to find groups of items within the array where all the items within a group satisfy a distance criterion (e.g. every pair of items have a distance smaller than 2). The groups should generally be as large as possible, but there's no formal definition or hard requirement for this.

My implementation language is PHP, but I'm looking for general advice regarding algorithms that can handle this efficiently.

Update: I think I can solve this by building a graph where the vertices are the items, and there's an edge between items that satisfy the distance constraint. Once I build the graph I can run an algorithm like Bron–Kerbosch to list all the maximal cliques. I'll update if this works out but feel free to add your thoughts in the meantime.

Was it helpful?

Solution

How is this function defined? Is it pre-computed? If so you can iterate over the computed function representation and retrieve pairs of words based on your criterion. If not, you have no other option but to compute the function between all pairs of words (This is necessary in your graph approach). Instead of the Bron-Kerbosch algorithm, I would go for a randomization+approximation strategy like,

http://dimacs.rutgers.edu/Workshops/Challenge10/abstracts.html#ovelgonne

http://dl.acm.org/citation.cfm?id=1933306.1934471

It is based on an modularity maximization approach. Modularity is the ratio between number of outgoing edges in a cluster to the number of edges within the cluster. In your case, you will be looking for the clusters with ratio 0, and choose the biggest one of them. This algorithm is really fast and works for most datasets that I have worked with. Though a modularity based method might be overkill for this problem, IMHO I feel this is a nice way to think about the problem + the implementation of this algorithm is available online (C implementation by the authors of the paper).

OTHER TIPS

What I understand the problem is:

"A-G" stand for the words. "#" stands for the distance between the two words. And you need to find out the pair which the distance <= 2.

  A B C D E F G
A * # # # # # #
B   * # # # # #
C     * # # # #
D       * # # #
E         * # #
F           * #
G             *

Basically it needs 2-level loop, what we can do is to reduce comparison times, only compare "#" part in matrix above. Here is the code in PHP:

$result = array();
while ( ($word = array_shift($arrWordList)) !== NULL ) {
    foreach ($arrWordList as $otherWord ) {
        if ( calc_dist($word, $otherWord) <= 2 ) {
            $result[] = array($word, $otherWord);
        }
    }
}

And you can use $result to continue doing something.

Given that you want your criteria to apply to all members of a group, you need a clustering algorithm that allows for overlapping groups. That's a pretty complex subject, so the best I can do is direct you to the literature on clustering algorithms, and in particular C-Means Clustering which is designed for overlapping groups.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top