문제

I have an application that needs to compute billions of levenshtein distance between pairs of strings. The strings are short (70 in length) DNA sequences, consisting only of 4 characters. Also it can be assumed that one of the strings is fixed, i.e., we are comparing one fixed string to a billion other strings.

I know that the dynamic programming implementation of the levenshtein distance is $\mathcal{O}(m n)$, would like to know if there are any room for improvement. I found these two algorithms:

  • $\mathcal{O}(n + d^2)$ algorithm, in which $d$ is the edit distance by Berghel et al. However I can't make the assumption that $d$ is small so it might not give any advantage
  • $log(n)^{\mathcal{O}(1/\epsilon)}$ approximation in $n^{1+\epsilon}$ time by Andoni et al. But I have two concerns regarding this:
    • Is this algorithm also fast in practice?
    • Does $log(n)^{\mathcal{O}(1/\epsilon)}$ mean the computed edit distance in worst case is $log(n)^{\mathcal{O}(1/\epsilon)}$ times the actual one? In that case it's too much.

Do you know of any other algorithm/idea/approach that might be applicable?

올바른 솔루션이 없습니다

라이센스 : CC-BY-SA ~와 함께 속성
제휴하지 않습니다 cs.stackexchange
scroll top