What is the expected time complexity of checking equality of two arbitrary strings?

https://cs.stackexchange.com/questions/127899

29-09-2020
|

Question

The simple (naive?) answer would be O(n) where n is the length of the shorter string. Because in the worst case you must compare every pair of characters.

So far so good. I think we can all agree that checking equality of two equal length strings requires O(n) runtime.

However many (most?) languages (I'm using Python 3.7) store the lengths of strings to allow for constant time lookups. So in the case of two unequal length strings, you can simply verify len(string_1) != len(string_2) in constant time. You can verify that Python 3 does indeed make this optimization.

Now, if we're checking the equality of two truly arbitrary strings (of arbitrary length) then it is much more likely (infinitely, I believe) that the strings will be of unequal length than of equal length. Which (statistically) ensures we can nearly always compare them in constant time.

So we can compare two arbitrary strings at O(1) average, with a very rare worst-case of O(n). Should we consider strings comparisons then to be O(1) in the same way we consider hash table lookups to be O(1)?

Solution

In order to discuss the expected time complexity of an operation, you have to specify a distribution on the inputs, and also explain what you mean by $n$.

One has to be careful, however. For example, consider the suggestion in the comments, to consider some kind of distribution over words of length at most 20. In this case, string comparison is clearly $O(1)$, since 20 is just a constant. There are several ways to avoid it:

Ask for a non-asymptotic time complexity. Since time complexity is highly dependent on the computation model, you can count (for example) the number of input memory cells accessed.
You can specify an input distribution which depends on a parameter $m$, and then ask for the asymptotic complexity in terms of $m$.

Here is an example. Given two random binary strings of length $n$, there will be roughly 4 accesses in expectation. In contrast, if the strings are chosen at random from the collection $0^i1^{n-i}$, the number of accesses will be roughly $(2/3)n$. These two distributions can be separated even if we use asymptotic notation: the algorithm runs in $O(1)$ on the first distribution, and in $\Theta(n)$ on the second.

Another issue is the meaning of $n$. Consider for example a string $0^m$, where $m \sim G(1/2)$ is a geometric random variable. When run on inputs of lengths $a,b$, the running time is $\Theta(\min(a,b))$. How should we express this in terms of $n = a+b$? One choice is to ask for the expected running time given that the input length is $n$. In this case, $$ \mathbb{E}[\min(a,b)] = \sum_{a=1}^{n-1} \frac{(1/2)^a (1/2)^{n-1-a}}{\sum_{a'=1}^{n-1} (1/2)^{a'} (1/2)^{n-1-a'}} \min(a,n-a) = \frac{1}{n-1} \sum_{a=1}^{n-1} \min(a,n-a) \approx \frac{n}{4}, $$ so the expected running time is $\Theta(n)$.

Licensed under: CC-BY-SA with attribution

Not affiliated with cs.stackexchange