Algorithm: String Similarity [closed]

https://stackoverflow.com/questions/12572786

03-07-2021
|

Question

I am trying to solve this challenge on InterviewStreet: https://www.interviewstreet.com/challenges/dashboard/#problem/4edb8abd7cacd

I already have a working algorithm but I would to improve its performance. Do you have any suggestions how to do so?

# Enter your code here. Read input from STDIN. Print output to STDOUT
N = gets.to_i
words = []

while words.length < N do
  words << gets.sub(/\\n$/, '').strip
end 

words.each do |word|
  count = 0
  (word.length).times do |i|
    sub = word[i..-1]
    j=0
    while j < sub.length && sub[j] == word[j] do
      count += 1 
      j+=1
    end
  end
  puts count
end

Thanks, Greg

Solution

Your algorithm is in the worst case quadratic. For most normal words, there is no quadratic behaviour, and it works well enough (due to its simplicity, it runs probably faster than more sophisticated algorithms with better worst-case behaviour).

One algorithm with linear worst-case behaviour is the Z-algorithm. I don't speak much ruby, so for the time being, the Python version will have to do:

def zarray(str):
    Z = [0]*len(str)
    Z[0] = len(str)
    left, right, i = 0, 0, 1
    while i < len(str):
        if i > right:
            j, k = 0, i
            while k < len(str) and str[j] == str[k]:
                j += 1
                k += 1
            Z[i] = j
            if j > 0:
                left, right = i, i+j-1
        else:
            z = Z[i-left]
            s = right-i+1
            if z < s:
                Z[i] = z
            else:
                j, k = s, s+i
                while k < len(str) and str[j] == str[k]:
                    j += 1
                    k += 1
                Z[i] = j
                left, right = i, i+j-1
        i += 1
    return Z

def similarity(s):
    return sum(zarray(s))

Explanation of the algorithm:

The idea is simple (but, like most good ideas, not easy to have). Let us call a (non-empty) substring that is also a prefix of the string a prefix-substring. To avoid recomputation, the algorithm uses a window of the prefix-substring starting before the currently considered index that extends farthest to the right (initially, the window is empty).

Variables used and invariants of the algorithm:

i, the index under consideration, starts at 1 (for 0-based indexing; the entire string is not considered) and is incremented to length - 1
left and right, the first and last index of the prefix-substring window; invariants:
1. left < i, left <= right < length(S), either left > 0 or right < 1,
2. if left > 0, then S[left .. right] is the maximal common prefix of S and S[left .. ],
3. if 1 <= j < i and S[j .. k] is a prefix of S, then k <= right
An array Z, invariant: for 1 <= k < i, Z[k] contains the length of the longest common prefix of S[k .. ] and S.

The algorithm:

Set i = 1, left = right = 0 (any values with left <= right < 1 are allowed), and set Z[j] = 0 for all indices 1 <= j < length(S).
If i == length(S), stop.
If i > right, find the length l of the longest common prefix of S and S[i .. ], store it in Z[i]. If l > 0 we have found a window extending farther right than the previous, then set left = i and right = i+l-1, otherwise leave them unchanged. Increment i and go to 2.
Here left < i <= right, so the substring S[i .. right] is known - since S[left .. right] is a prefix of S, it is equal to S[i-left .. right-left].

Now consider the longest common prefix of S with the substring starting at index i - left. Its length is Z[i-left], hence S[k] = S[i-left + k] for 0 <= k < Z[i-left] and
S[Z[i-left]] ≠ S[i-left+Z[i-left]]. Now, if Z[i-left] <= right-i, then i + Z[i-left] is inside the known window, therefore
```
S[i + Z[i-left]] = S[i-left + Z[i-left]] ≠ S[Z[i-left]]
S[i + k]         = S[i-left + k]         = S[k]   for 0 <= k < Z[i-left]
```
and we see that the length of the longest common prefix of S and S[i .. ] has length Z[i-left]. Then set Z[i] = Z[i-left], increment i, and go to 2.

Otherwise, S[i .. right] is a prefix of S and we check how far it extends, starting the comparison of characters at the indices right+1 and right+1 - i. Let the length be l. Set Z[i] = l, left = i, right = i + l - 1, increment i, and go to 2.

Since the window never moves left, and the comparisons always start after the end of the window, each character in the string is compared at most once successfully to an earlier character in the string, and for each starting index, there is at most one unsuccessful comparison, therefore the algorithm is linear.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow