Question

I have a long (> 1000 items) list of words, from which I would like to remove words that are "too similar" to other words, until the remaining words are all "significantly different". For example, so that no two words are within an edit distance D.

I do not need a unique solution, and it doesn't have to be exactly optimal, but it should be reasonably quick (in Python) and not discard way too many entries.

How can I achieve this? Thanks.

Edit: to be clear, I can google for a python routine that measures edit distance. The problem is how to do this efficiently, and, perhaps, in some way that finds a "natural" value of D. Maybe by constructing some kind of trie from all words and then pruning?

Was it helpful?

Solution

You can use a bk-tree, and before each item is added check that it is not within distance D of any others (thanks to @DietrichEpp in the comments for this idea.

You can use this recipe for a bk-tree (though any similar recipes are easily modified). Simply make two changes: change the line:

def __init__(self, items, distance, usegc=False):

to

def __init__(self, items, distance, threshold=0, usegc=False):

And change the line

        if el not in self.nodes: # do not add duplicates

to

        if (el not in self.nodes and
            (threshold == None or len(self.find(el, threshold)) == 0)):

This makes sure there are no duplicates when an item is added. Then, the code to remove duplicates from a list is simply:

from Levenshtein import distance
from bktree import BKtree
def remove_duplicates(lst, threshold):
    tr = BKtree(iter(lst), distance, threshold)
    return tr.nodes.keys()

Note that this relies on the python-Levenshtein package for its distance function, which is much faster than the one provided by bk-tree. python-Levenshtein has C-compiled components, but it's worth the installation.


Finally, I set up a performance test with an increasing number of words (grabbed randomly from /usr/share/dict/words) and graphed the amount of time each took to run:

import random
import time
from Levenshtein import distance
from bktree import BKtree

with open("/usr/share/dict/words") as inf:
    word_list = [l[:-1] for l in inf]

def remove_duplicates(lst, threshold):
    tr = BKtree(iter(lst), distance, threshold)
    return tr.nodes.keys()

def time_remove_duplicates(n, threshold):
    """Test using n words"""
    nwords = random.sample(word_list, n)
    t = time.time()
    newlst = remove_duplicates(nwords, threshold)
    return len(newlst), time.time() - t

ns = range(1000, 16000, 2000)
results = [time_remove_duplicates(n, 3) for n in ns]
lengths, timings = zip(*results)

from matplotlib import pyplot as plt

plt.plot(ns, timings)
plt.xlabel("Number of strings")
plt.ylabel("Time (s)")
plt.savefig("number_vs_time.pdf")

enter image description here

Without confirming it mathematically, I don't think it's quadratic, and I think it might actually be n log n, which would make sense if inserting into a bk-tree is a log time operation. Most notably, it runs pretty quickly with under 5000 strings, which hopefully is the OP's goal (and it's reasonable with 15000, which a traditional for loop solution would not be).

OTHER TIPS

Tries will not be helpful, nor will hash maps. They are simply not useful for spatial, high-dimensional problems like this one.

But the real problem here is the ill-specified requirement of "efficient". How fast is "efficient"?

import Levenshtein

def simple(corpus, distance):
    words = []
    while corpus:
        center = corpus[0]
        words.append(center)
        corpus = [word for word in corpus
                  if Levenshtein.distance(center, word) >= distance]
    return words

I ran this on 10,000 words selected uniformly from the "American English" dictionary I have on my hard drive, looking for sets with a distance of 5, yielding around 2,000 entries.

real    0m2.558s
user    0m2.404s
sys     0m0.012s

So, the question is, "How efficient is efficient enough"? Since you didn't specify your requirements, it's really hard for me to know if this algorithm works for you or not.

The rabbit hole

If you want something faster, here's how I would do it.

Create a VP tree, BK tree, or other suitable spatial index. For each word in the corpus, insert that word into the tree if it has a suitable minimum distance from every word in the index. Spatial indexes are specifically designed to support this kind of query.

At the end, you will have a tree containing nodes with the desired minimum distance.

Your trie thought is definitely and interesting one. This page has a great setup for fast edit distance calculations in a trie and would definitely be efficient if you needed to expand your wordlist to millions rather than a thousand, which is pretty small in the corpora linguistics business.

Good luck, it sounds like a fun representation of the problem!

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top