Algorithm: String Similarity [closed]
-
03-07-2021 - |
سؤال
I am trying to solve this challenge on InterviewStreet: https://www.interviewstreet.com/challenges/dashboard/#problem/4edb8abd7cacd
I already have a working algorithm but I would to improve its performance. Do you have any suggestions how to do so?
# Enter your code here. Read input from STDIN. Print output to STDOUT
N = gets.to_i
words = []
while words.length < N do
words << gets.sub(/\\n$/, '').strip
end
words.each do |word|
count = 0
(word.length).times do |i|
sub = word[i..-1]
j=0
while j < sub.length && sub[j] == word[j] do
count += 1
j+=1
end
end
puts count
end
Thanks, Greg
المحلول
Your algorithm is in the worst case quadratic. For most normal words, there is no quadratic behaviour, and it works well enough (due to its simplicity, it runs probably faster than more sophisticated algorithms with better worst-case behaviour).
One algorithm with linear worst-case behaviour is the Z-algorithm. I don't speak much ruby, so for the time being, the Python version will have to do:
def zarray(str):
Z = [0]*len(str)
Z[0] = len(str)
left, right, i = 0, 0, 1
while i < len(str):
if i > right:
j, k = 0, i
while k < len(str) and str[j] == str[k]:
j += 1
k += 1
Z[i] = j
if j > 0:
left, right = i, i+j-1
else:
z = Z[i-left]
s = right-i+1
if z < s:
Z[i] = z
else:
j, k = s, s+i
while k < len(str) and str[j] == str[k]:
j += 1
k += 1
Z[i] = j
left, right = i, i+j-1
i += 1
return Z
def similarity(s):
return sum(zarray(s))
Explanation of the algorithm:
The idea is simple (but, like most good ideas, not easy to have). Let us call a (non-empty) substring that is also a prefix of the string a prefix-substring. To avoid recomputation, the algorithm uses a window of the prefix-substring starting before the currently considered index that extends farthest to the right (initially, the window is empty).
Variables used and invariants of the algorithm:
i
, the index under consideration, starts at 1 (for 0-based indexing; the entire string is not considered) and is incremented tolength - 1
left
andright
, the first and last index of the prefix-substring window; invariants:left < i
,left <= right < length(S)
, eitherleft > 0
orright < 1
,- if
left > 0
, thenS[left .. right]
is the maximal common prefix ofS
andS[left .. ]
, - if
1 <= j < i
andS[j .. k]
is a prefix ofS
, thenk <= right
- An array
Z
, invariant: for1 <= k < i
,Z[k]
contains the length of the longest common prefix ofS[k .. ]
andS
.
The algorithm:
- Set
i = 1
,left = right = 0
(any values withleft <= right < 1
are allowed), and setZ[j] = 0
for all indices1 <= j < length(S)
. - If
i == length(S)
, stop. - If
i > right
, find the lengthl
of the longest common prefix ofS
andS[i .. ]
, store it inZ[i]
. Ifl > 0
we have found a window extending farther right than the previous, then setleft = i
andright = i+l-1
, otherwise leave them unchanged. Incrementi
and go to 2. Here
left < i <= right
, so the substringS[i .. right]
is known - sinceS[left .. right]
is a prefix ofS
, it is equal toS[i-left .. right-left]
.Now consider the longest common prefix of
S
with the substring starting at indexi - left
. Its length isZ[i-left]
, henceS[k] = S[i-left + k]
for0 <= k < Z[i-left]
and
S[Z[i-left]] ≠ S[i-left+Z[i-left]]
. Now, ifZ[i-left] <= right-i
, theni + Z[i-left]
is inside the known window, thereforeS[i + Z[i-left]] = S[i-left + Z[i-left]] ≠ S[Z[i-left]] S[i + k] = S[i-left + k] = S[k] for 0 <= k < Z[i-left]
and we see that the length of the longest common prefix of
S
andS[i .. ]
has lengthZ[i-left]
. Then setZ[i] = Z[i-left]
, incrementi
, and go to 2.Otherwise,
S[i .. right]
is a prefix ofS
and we check how far it extends, starting the comparison of characters at the indicesright+1
andright+1 - i
. Let the length bel
. SetZ[i] = l
,left = i
,right = i + l - 1
, incrementi
, and go to 2.
Since the window never moves left, and the comparisons always start after the end of the window, each character in the string is compared at most once successfully to an earlier character in the string, and for each starting index, there is at most one unsuccessful comparison, therefore the algorithm is linear.