使用哈希表缓慢找到最长的常见子序列算法

https://cs.stackexchange.com/questions/7851

16-10-2019
|

题

我设计了一种算法来找到最长的常见子序列。这些步骤：

以。。开始 i = 0

从Ith字母开始的第一个字符串开始的第一个字母。
转到第二个字符串，寻找那个挑选的字母。
如果找不到，请返回第一个字符串，然后选择下一个字母并重复1到3，直到找到第二个字符串中的字母为止。
现在，在第二个字符串中找到了一个公共字母，将其添加到 $common_subsequence.
将其位置存储在 $index.
从第一个字符串中挑选下一个字母并执行步骤2，但是这次从 $index.
重复3到6，直到到达字符串1或字符串2的末端。
如果长度 $common_subsequence 到目前为止，大于常见子序列的长度，补充说，将LCS更改为 $common_subsequence.
将1添加到i，重复1到9，而我的第一个字符串的长度则较小。

这是一个例子：
‫‪

X=A, B, C, B, D, A, B‬‬  
‫‪Y=B, D, C, A, B, A‬‬

第一次选择 A.
寻找 A 在 Y.
现在找到了 A 添加到 $common_subsequence.
然后选择 B 从 X.
寻找 B 在 Y 但是这次开始搜索 A.
现在选择 C. 。它不在字符串2中，所以选择下一个字母 X 那是 B.
...
...
...

该算法的复杂性是theta（n*m）。

我在两种方法上实现了它。第二个使用哈希表，但是在实施后，我发现它比第一个算法要慢得多。我不能解开为什么。

这是我的实施：

第一算法：

import time
def lcs(xstr, ystr):
    if not (xstr and ystr): return # if string is empty
    lcs = [''] #  longest common subsequence
    lcslen = 0 # length of longest common subsequence so far
    for i in xrange(len(xstr)):
        cs = '' # common subsequence
        start = 0 # start position in ystr
        for item in xstr[i:]:
            index = ystr.find(item, start) # position at the common letter
            if index != -1: # if common letter has found
                cs += item # add common letter to the cs
                start = index + 1
            if index == len(ystr) - 1: break # if reached end of the ystr
        # update lcs and lcslen if found better cs
        if len(cs) > lcslen: lcs, lcslen = [cs], len(cs) 
        elif len(cs) == lcslen: lcs.append(cs)
    return lcs

file1 = open('/home/saji/file1')
file2 = open('/home/saji/file2')
xstr = file1.read()
ystr = file2.read()

start = time.time()
lcss = lcs(xstr, ystr)
elapsed = (time.time() - start)
print elapsed

第二个使用哈希表：

import time
from collections import defaultdict
def lcs(xstr, ystr):
    if not (xstr and ystr): return # if strings are empty
    lcs = [''] #  longest common subsequence
    lcslen = 0 # length of longest common subsequence so far
    location = defaultdict(list) # keeps track of items in the ystr
    i = 0
    for k in ystr:
        location[k].append(i)
        i += 1
    for i in xrange(len(xstr)):
        cs = '' # common subsequence
        index = -1
        reached_index = defaultdict(int)
        for item in xstr[i:]:
            for new_index in location[item][reached_index[item]:]:
                reached_index[item] += 1
                if index < new_index:
                    cs += item # add item to the cs
                    index = new_index
                    break
            if index == len(ystr) - 1: break # if reached end of the ystr
        # update lcs and lcslen if found better cs
        if len(cs) > lcslen: lcs, lcslen = [cs], len(cs) 
        elif len(cs) == lcslen: lcs.append(cs)
    return lcs

file1 = open('/home/saji/file1')
file2 = open('/home/saji/file2')
xstr = file1.read()
ystr = file2.read()

start = time.time()
lcss = lcs(xstr, ystr)
elapsed = (time.time() - start)
print elapsed

解决方案

一种优化是替换访问 reached_index[item] 通过局部变量，该变量是从哈希表初始化的 for item 循环，并在循环末端存放回哈希表。

将加快事物的另一种优化是用256号尺寸替换哈希表。

许可以下： CC-BY-SA 和归因

不隶属于 cs.stackexchange