سؤال

I built a function which finds the longest common sub-string of two text files in ascending order based on Rabin–Karp algorithm. the main function is "find_longest" and the inner functions are: "make_hashtable","extend_fingerprints" and "has_match". I'm having trouble analyzing the average case complexity of has_match.

Denote n1,n2 as text1,text2 and l as the size of the currunt "window". fingers are the hash table of the substring.

def has_match(text1,text2,fingers1,fingers2,l,r):
h = make_hashtable(fingers2,r)
for i in range(len(fingers1)):
    for j in h[fingers1[i]]:
        if text1[i:i+l] == text2[j:j+l]:
            return text1[i:i+l]
return None

this is "make_hashtable", here I'm pretty sure that the complexcity is O(n2-l+1):

def make_hashtable(fingers, table_size):
hash_table=[[] for i in range(table_size)]
count=0
for f in fingers:
    hash_table[f].append(count)
    count+=1
return hash_table

this is "find_longest", im adding this function despite the fact that i dont need it for the complexity analyzing.

def find_longest(text1,text2,basis=2**8,r=2**17-1):
match = ''
l = 0 #initial "window" size
#fingerprints of "windows" of size 0 - all are 0
fingers1 = [0]*(len(text1)+1)
fingers2 = [0]*(len(text2)+1)

while match != None: #there was a common substring of len l
    l += 1
    extend_fingerprints(text1, fingers1, l, basis, r)
    extend_fingerprints(text2, fingers2, l, basis, r)
    match = has_match(text1,text2,fingers1,fingers2,l,r)
    print(match)
return l-1

and this is "extend_fingerprints":

def extend_fingerprints(text, fingers, length, basis=2**8, r=2**17-1):
count=0
for f in fingers:
    if count==len(fingers)-1:
        fingers.pop(len(fingers)-1)
        break
    fingers[count]=(f*basis+ord(text[length-1+count]))%r
    count+=1

I'm having doubts between this two options:

1.O(n_2-l+1)+O(n_1-l+1)*O(l) Refer to r as a constant number while n1,n2 are very large therefore a lot of collisions would be made at the hash table (let's say O(1) items at every 'cell', yet, always some "false-positives")

2.O(n_2-l+1)+O(n_1-l+1)+O(l)
Refer to r as optimal for a decent hash function, therefore almost no collisions which means that if two texts are the same cell at the hash table we may assume they are actually the same text?

Personally I lean towards the Bold statement. tnx.

هل كانت مفيدة؟

المحلول

I think the answer is

O((n_2-l) + l*(n_1-l))

. (n_2-l) represents the complexity of make_hashtable for the second text. l*(n_1-l) represents the two nested loops who go through every item in the finger prints of the first text and perform 1 comparison operation (for l length slice), for some constant 'm' if there are some items of the same index in the hash table.

مرخصة بموجب: CC-BY-SA مع الإسناد
لا تنتمي إلى StackOverflow
scroll top