Question

I read this article Retiring a Great Interview Problem, the author came up with a work break problem and gave three solutions. The efficient one uses memoization algorithm and the author said its worst case time complexity is O(n^2) since the key insight is that SegmentString is only called on suffixes of the original input string, and that there are only O(n) suffixes.

However, I find it difficult for me to understand why it is O(n^2). Could someone please give me a hint or proof?

Work Break Problem: 
    Given an input string and a dictionary of words,
    segment the input string into a space-separated
    sequence of dictionary words if possible. For
    example, if the input string is "applepie" and
    dictionary contains a standard set of English words,
    then we would return the string "apple pie" as output.

The memoization algorithm from Retiring a Great Interview Problem

Map<String, String> memoized;

String SegmentString(String input, Set<String> dict) {
      if (dict.contains(input)) 
          return input;
      if (memoized.containsKey(input) {
          return memoized.get(input);
      }
      int len = input.length();
      for (int i = 1; i < len; i++) {
          String prefix = input.substring(0, i);
          if (dict.contains(prefix)) {
              String suffix = input.substring(i, len);
              String segSuffix = SegmentString(suffix, dict);
              if (segSuffix != null) {
                  return prefix + " " + segSuffix;
              }
          }
      }
      memoized.put(input, null);
      return null;
}
Was it helpful?

Solution

Intuition

(it is clear that the worst case is the one where there are no solutions, so I focus on that)

Since the recursive call is made before putting values in the memoization-cache, the last (shortest) suffixes will get cached first. This is because the function is first called with a string of length N, which then calls itself with a string of length N-1, which then .... , with a string of len 0, which is cached and returns, then the string of length 1 is cached and returns, ..., length N is cached and returns.

As the hint suggests, only suffixes get cached, and there are only N of them. This means that by the time the top-level function gets the result of its first recursive call (i.e. on a suffix of length N-1), the cache is already populated with the N-1 suffixes.

Now, assuming the N-1 last suffixes are already cached, the for-loop needs to make N-1 recursive calls, each taking O(1) (since the answer is already cached), totalling O(N). However, (pre-) building the cache of the last N-1 takes O(N^2) (explained below), totalling with O(N)+O(N^2) = O(N^2).


Proof by mathematical induction

This explanation can easily be translated to a formal proof using induction. Here is the gist of it:

(f(N) being the number of operations required for the function to complete on an input of length N)

Induction hypothesis -- there exists a constant c s.t. f(N) < c*N^2.

The base case is trivial -- for a string of length 1, you can find a constant c such that f(1) < c*N^2 = c

Induction step

Recap of the order things happen:

Step 1: the function first calls itself on the suffix of length N-1, building a cache containing the answer for the last N-1 suffixes

Step 2: the function then calls itself O(N) more times, each taking O(1) (thanks to this test: if (memoized.containsKey(input)), and the fact that the cache has already been populated in Step 1).

So we get:

f(N+1) = f(N) + a*N <   (by the hypothesis)
       < c*N^2 + a*N <  (for c large enough)
       < c*N^2 + 2*c*N + c =
       = c*(N+1)^2

Thus we got f(N+1) < c*(N+1)^2, which completes the proof.

OTHER TIPS

At first, for a given string with length N, we can break it into N*(N-1)/2 segments, and check whether each segment is contained in the dictionary. This cost is O(N^2)

Come back to the your code, begin with the string N, we break it down into two smaller strings, length 'a', and 'N - a'. And for each sub string (or prefix) start from 0 and end at 'a', we only check it once!

From each segment N - a, it also check each of its prefix once and store it into the memorized table, so, this step will make sure that the next time, when we made the same move, split the string at this exactly place, we only need to return the result, without further work (With assumption that a map will retrieve and return the result for a particular string in O(1)). This storing and retrieving step also makes sure that the conquer and dividing is only done once.

So, after first and second point, we conclude that there is only N*(N - 1)/2 segment to be examined at only once time, which lead to the fact that the cost is O(N^2)

Note: Only with the assumption for cost of both dict.contains(input) and memoized.containsKey(input) are O(1) so the complexity is O(N^2).

In General, the dimensions of the memoized table decide the complexity.

For a memoized table of only 1 dimension (memoized[n]) -> takes O(n) complexity, For a memoized table of only 2 dimensions (memoized[n][n]) -> takes O(n^2) complexity and so on.

Reason: In case of memoization, worst case complexity is the running time of the inputs for which none of its cases (overlapping sub-problems) are not cached (pre-calculated) yet.

Now, lets say the memoization table has 2 dimensions (memoization[n][n]). Worst case complexity can only be the maximum of dimensions of the memoization table. Hence, its worst case can be only reach upto O(n^2).

Thus the dimensions of the Memoization table basically decides the worst case time complexity.

Answer of @shx2 explains this mathematically.

If the input is of length N, the then maximum number of dictionary words this could contain would be N. so u need to check all different combinations of length 1..N. which is N(N-1)/2

take a input string of, aaaaa

there are N x 'a' strings

N-1 'aa' strings

and so on

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top