Efficient algorithm for edit distance for short sequences
-
04-11-2019 - |
문제
I have an application that needs to compute billions of levenshtein distance between pairs of strings. The strings are short (70 in length) DNA sequences, consisting only of 4 characters. Also it can be assumed that one of the strings is fixed, i.e., we are comparing one fixed string to a billion other strings.
I know that the dynamic programming implementation of the levenshtein distance is $\mathcal{O}(m n)$, would like to know if there are any room for improvement. I found these two algorithms:
- $\mathcal{O}(n + d^2)$ algorithm, in which $d$ is the edit distance by Berghel et al. However I can't make the assumption that $d$ is small so it might not give any advantage
- $log(n)^{\mathcal{O}(1/\epsilon)}$ approximation in $n^{1+\epsilon}$ time by Andoni et al. But I have two concerns regarding this:
- Is this algorithm also fast in practice?
- Does $log(n)^{\mathcal{O}(1/\epsilon)}$ mean the computed edit distance in worst case is $log(n)^{\mathcal{O}(1/\epsilon)}$ times the actual one? In that case it's too much.
Do you know of any other algorithm/idea/approach that might be applicable?
올바른 솔루션이 없습니다
제휴하지 않습니다 cs.stackexchange