Test if there exists an integer k to add to one sequence to make it a subsequence of another sequence

https://cs.stackexchange.com/questions/117706

28-09-2020
|

Question

Suppose that sequence $A$ contains $n$ integers $a_1,a_2,a_3,\ldots,a_n$ and sequence $B$ contains $m$ integers $b_1,b_2,b_3,\ldots,b_m$. We know that $m \geq n$. We assume without loss of generality that both sequences $A$ and $B$ are sorted in ascending order.

What's the fastest algorithm to determine if there exists an integer $k$ such that the sequence $a_1+k,a_2+k,a_3+k,\ldots,a_n+k$ is a subsequence of the sequence $B$?

Here is a naive algorithm, which would take $O(n(m-n))$ time. Store the sequence $B$ as a hashtable. For each element $b_j$ of the $B$ (except the largest $n$ elements), use $b_j-a_1$ as your guess at $k$; you can verify this guess by checking whether each of $a_1+k,\dots,a_n+k$ are in the hashtable $B$. This takes $O(n)$ expected time per guess at $k$, and you make $m-n$ guesses, so in total the expected running time is $O(n(m-n))$. Can we do better?

I came across this problem while trying to match two binary images (testing whether one image contains the other).

Solution

Here is a heuristic that won't always work, but should work with high probability if the integers in the arrays are chosen randomly from a large enough space.

Initialize a hashtable of counts $C$ to all zeros. Then, repeat $t$ times: randomly pick $i,j$, compute $b_j-a_i$, and increment $C[b_j-a_i]$. Finally, sort $C$ by counts, from largest count to smallest; then for each of the largest few values of $C[k']$, try $k'$ as your guess at $k$, and verify each guess.

Notice that in each iteration, the probability of incrementing $C[k]$ is at least $1/m$; whereas if $l\ne k$, we expect $C[l]$ to be incremented much more rarely (assuming the integers in the arrays are random and large enough). Thus, after $t$ iterations, we expect $\mathbb{E}[C[k]] \ge t/m$ but $\mathbb{E}[C[l]] \ll t/m$. So, once $t$ is large enough, $C[k]$ should be larger than every other entry in $C$.

How large does $t$ need to be? I expect that $t = O(m \log m)$ should suffice, based on a central limit theorem approximation for the counts $C[l]$, assuming we are willing to accept a small probability of error (which can be driven exponentially small). The constant factor hidden by the big-O notation might be non-trivial.

This is only a heuristic, and there will certainly be inputs where it fails.

OTHER TIPS

Here is an algorithm running in $\mathcal{O}(n\sqrt{n} + m\log m)$ time.

Let $w$ denote the function that for integer $t$, counts the number of pairs that their difference is $t$: $w(t)=\lvert\{(x,y): x\in A, y\in B, y-x=t\}\rvert$. If we had access to $w(t)$ we could simply find its maximum and see if it's $n$ or not. The main idea is to estimate $w$ using the Fast Fourier transform. If the numbers are bounded it will produce the exact solution, otherwise, one can use modulus to a sufficiently big number and then verify the solutions once they're found.

Let $N,M$ be integers (to be defined later), and $u,v\in \mathbb{R}^N$ be vectors defined as $$u[i] = \lvert\{x\in A\colon M-x\equiv i \pmod N \}\rvert$$ $$v[i] = \lvert\{y\in B\colon M+y\equiv i \pmod N \}\rvert$$ Let $w=u * v$ denote the circular convolution of these two vectors. Then if there is $k$ such that $$\forall x\in A \exists y\in B: y-x=k,$$ then we can conclude $$ w[k \bmod N]=\sum_{i: v[i]\neq 0} v[i] u[i-k \bmod N]= n$$ which by construction is the maximum value that $w$ can attain. Therefore, we only need check if $\max_i w(i)=n$ or not. Then, we verify the correctness of the solution by checking the original elements. Computing $w$ can be done by FFT and inverse FFT in $\mathcal{O}(N \log N)$ time, and then finding the maximum element and verifying it takes $n$ steps, so overall $\mathcal{O}(N \log N)$ time and space.

If the numbers in both sets are bounded by $N$ this is an exact solution. But if you pick $N$ too small, $w(i)=n$ can happen because of collisions. So we can verify all elements for all the indices that $w(i)\ge n$; there might be several of them, but their number can be bounded. To have $\ell$ such indices, one must have at least $1+2+\dots + \ell$ collisions, which implies $$P[\lvert\{i\colon w[i]\ge n\}\rvert \ge \ell] \le P[\text{# collisions}\ge (\ell+1)\ell/2].$$ There are $nm$ pairings of elements of $A$ and $B$. If we pick a prime number $N$ such that $N>2m$, and pick $M$ uniformly at random from $\{1,\dots,N\}$, the collision probability is bounded by $1/2m$, so by Markov's inequality is $$\le \frac{nm/N}{\ell^2/2}\le \frac{n}{\ell^2}$$ So with probability as close to $1$ as you want, $\ell=\mathcal{O}(\sqrt{n})$. Therefore, the overall time complexity of the algorithm is $$\mathcal{O}(n\sqrt{n} + m\log m)$$ in which $m\log m$ is the FFT and iFFT step (since we set $N=2m$), and $n \sqrt{n}$ is the verification step.

There are two ways I see to improve this:

One can run $\log n$ separate instances of the algorithm without the verification, and take the intersection of maximum indices that $w[i]\ge n$ (after shifting by $M$). If one can show that the number of shared collisions drops by $1/2$ or some other constant every time, this would show a total running time of $\mathcal{O}(m\log^2 m)$.
One can construct a better hashing mechanism for $u$ and $v$ and use higher moments for Markov and make the concentration sharper.

Nevertheless, if you're looking for a practical solution this algorithm could work just fine. For instance, the worst-case behavior $\ell\approx\sqrt{n}$ only occurs when the sets are nearly arithmetic progressions. If you pick elements nearly randomly, the guarantee will be much stronger. Furthermore, you can stop the verification step as soon as you find a mistake.

This is a completely different algorithm, which I believe works in $O(m\log m)$ worst case, and should work for integer or real numbers.

Let us assume that $A$ and $B$ are already in ascending order, otherwise spend $O(n\log n+ m\log m)$ to sort them. We slightly strengthen the requirement for algorithm $\mathcal{A}(A, B)$ to return all indices $i$ such that $A$ can be mapped to $B$ with offset $k=b_i-a_1$, meaning that the mapping starts at $b_i$ onwards. The high-level idea is to solve the sub-problems corresponding to a subarray of $A$ and merge the indices in a way that only valid solutions remain.

The recursion, however, depends on how close $A$ is to an arithmetic progression. Formally, let periodicity $\tau(A)$ be defined as: $$\tau(A) = \min \{s\in\mathbb{N}^+: a_{i+s+1}-a_{i+s}= a_{i+1} - a_i \text{ for all valid } i, s\} $$ In words, this means elements of $A$, are periodic with a minimum cycle $\tau(A)$, up to some offset.

Case I ($\tau(A)<n/3):$ Let $s=\tau(A)$ and $\ell = a_s - a_1$. Recursively compute $I=\mathcal{A}(A[1:s],B)$. One observation is that if $i,j\in I$, corresponding to index sets $J_i, J_j$, and $b_j - b_i = \ell$, the index sets $J_i, J_j$ can be concatenated to show $i\in \mathcal{A}(A[1:2s],B)$. This is a simple consequence of $A$ being $s$ periodic, and $b_j = b_i + \ell$ ensures that the index set $J_j$ starts where $J_i$ ends. Let $$R[i] = \lvert\{j \in I\colon j>i, b_j - b_i \text{ divisible by } \ell\}\rvert$$ Then, $R[i]$ can be computed based on $R[i']$ that $i'>i$, and the new index set $I'$, is the indices that $R[i]\ge n/s$. The cost for this step is bounded by $O(m)$.

Case II ($\tau(A)>n/3$): By definition, for $s=n/3$ there should be an index $i$ that $a_{i+1}-a_i \neq a_{i+1+s}-a_{i+s}$. If $i\le n/3$, we will have $i,i+s\le 2n/3$ which certifies that $\tau(A[1:2n/3])>n/3$. Otherwise, $i>n/3$ implies that $\tau(A[n/3:n])>n/3$.

Wlog assume $\tau(A[1:2n/3)>n/3$, and choose the lower half $A'=A[1:n/2]$ to recurse on (otherwise choose the upper half, same arguments would follow). Recursively compute $I=\mathcal{A}(A',B)$. For each index $i\in I$, check if the rest of the sequence can be found in $B$. Since both sequences are sorted this can be done in $O(n)$ step for each index, which implies an overall $O(|I|\cdot n)$ time to compute the valid indices, and return them as $\mathcal{A}(A, B)$. The efficiency of this step relies on the following claim:

Claim: $|I|\le 6m/n$, meaning that solutions are not too much overlapping.

Proof of claim: We show $|I|> 6m/n$ leads to a contradiction. Each index $i\in I$ is the starting point of a set of indices $J_i=\{i=j_1,\dots,j_{n/2}\}\subseteq B$, that map $A'$ to $B$ up to some offset. Collectively, there are at least $3m$ indices: $$\sum_{i\in I} |J_i| = |I|n/2\ge 6m/n \cdot n/2=3m$$ Since $|B|=m$, by pigeonhole principle, there is at least one index $x\in B$ appears in 3 separate solutions: $$\exists x\in B, r,s,p\in I\colon\; x\in J_r\cap J_s\cap J_p$$

Let $s $ be the median of the three $r<s<p$. Since $x\in J_s$, and $|J_s|=n/2$, $x$ partitions $J_s$ to two parts, one of which should have less than $n/4$ indices, which we assume is the lower part: $$J_s=\{j_1=s, j_2, \dots, j_\ell=x\}, \ell\le n/4$$ By construction, $s=j_1$ is mapped to $a_1$, up to $a_\ell$ up to some offset. But we also have $x\in J_p$, this implies a periodict less than $\ell\le n/4$ in $A'$, which contradicts $\tau(A')>n/3$. (there's a few details I will add later)

Overall Complexity At each step of the recursion, we pay $O(m)$. The periodicity $\tau(A)$ can also be computed in $O(n)$, by computing the longest suffix which is also prefix, of $\mathrm{diff}(A)$, that is the increment array $a_2-a_1, \dots, a_n-a_{n-1}$. However, the size of the problem reduces by at least $1/2$ in each recursive step. There will be $\log n$ steps in worst case, which implies the time complexity is bounded by $O(m \log n)$. Adding the sorting costs, and since $m>n$, the overall complexity is bounded by the sorting time $O(m\log m)$

Licensed under: CC-BY-SA with attribution

Not affiliated with cs.stackexchange