You can use a bk-tree
, and before each item is added check that it is not within distance D of any others (thanks to @DietrichEpp in the comments for this idea.
You can use this recipe for a bk-tree (though any similar recipes are easily modified). Simply make two changes: change the line:
def __init__(self, items, distance, usegc=False):
to
def __init__(self, items, distance, threshold=0, usegc=False):
And change the line
if el not in self.nodes: # do not add duplicates
to
if (el not in self.nodes and
(threshold == None or len(self.find(el, threshold)) == 0)):
This makes sure there are no duplicates when an item is added. Then, the code to remove duplicates from a list is simply:
from Levenshtein import distance
from bktree import BKtree
def remove_duplicates(lst, threshold):
tr = BKtree(iter(lst), distance, threshold)
return tr.nodes.keys()
Note that this relies on the python-Levenshtein package for its distance function, which is much faster than the one provided by bk-tree. python-Levenshtein has C-compiled components, but it's worth the installation.
Finally, I set up a performance test with an increasing number of words (grabbed randomly from /usr/share/dict/words
) and graphed the amount of time each took to run:
import random
import time
from Levenshtein import distance
from bktree import BKtree
with open("/usr/share/dict/words") as inf:
word_list = [l[:-1] for l in inf]
def remove_duplicates(lst, threshold):
tr = BKtree(iter(lst), distance, threshold)
return tr.nodes.keys()
def time_remove_duplicates(n, threshold):
"""Test using n words"""
nwords = random.sample(word_list, n)
t = time.time()
newlst = remove_duplicates(nwords, threshold)
return len(newlst), time.time() - t
ns = range(1000, 16000, 2000)
results = [time_remove_duplicates(n, 3) for n in ns]
lengths, timings = zip(*results)
from matplotlib import pyplot as plt
plt.plot(ns, timings)
plt.xlabel("Number of strings")
plt.ylabel("Time (s)")
plt.savefig("number_vs_time.pdf")
Without confirming it mathematically, I don't think it's quadratic, and I think it might actually be n log n
, which would make sense if inserting into a bk-tree is a log time operation. Most notably, it runs pretty quickly with under 5000 strings, which hopefully is the OP's goal (and it's reasonable with 15000, which a traditional for loop solution would not be).