Determine missing number in data stream
-
16-10-2019 - |
Pergunta
We receive a stream of $n-1$ pairwise different numbers from the set $\left\{1,\dots,n\right\}$.
How can I determine the missing number with an algorithm that reads the stream once and uses a memory of only $O(\log_2 n)$ bits?
Solução
You know $\sum_{i=1}^n i = \frac{n(n+1)}{2}$, and because $S = \frac{n(n+1)}{2}$ could be coded in $O(\log(n))$ bits this can be done in $O(\log n)$ memory and in one path (just find $S - \mathrm{currentSum}$, this is missing number).
But this problem could be solved in general case (for constant $k$): we have $k$ missing numbers, find out all of them. In this case instead of calculating just sum of $y_i$, calculate sum of j'st power of $x_i$ for all $1\le j \le k$ (I assumed $x_i$ is missing numbers and $y_i$ is input numbers):
$\qquad \displaystyle \begin{align} \sum_{i=1}^k x_i &= S_1,\\ \sum_{i=1}^k x_i^2 &= S_2,\\ &\vdots \\ \sum_{i=1}^k x_i^k &= S_k \end{align}$ $\qquad (1)$
Remember that you can calculate $S_1,...S_k$ simply, because $S_1 = S - \sum y_i$, $S_2 = \sum i^2 - \sum y_i^2$, ...
Now for finding missing numbers you should solve $(1)$ to find all $x_i$.
You can compute:
$P_1 = \sum x_i$, $P_2 = \sum x_i\cdot x_j$, ... , $P_k = \prod x_i$ $(2)$.
For this remember that $P_1 = S_1$, $P_2 = \frac{S_1^2 - S_2}{2}$, ...
But $P_i$ is coefficients of $P=(x-x_1)\cdot (x-x_2) \cdots (x-x_k)$ but $P$ could be factored uniquely, so you can find missing numbers.
These are not my thoughts; read this.
Outras dicas
From the comment above:
Before processing the stream, allocate $\lceil \log_2 n \rceil$ bits, in which you write $x:= \bigoplus_{i=1}^n \mathrm{bin}(i)$ ($\mathrm{bin}(i)$ is the binary representation of $i$ and $\oplus$ is pointwise exclusive-or). Naively, this takes $\mathcal{O}(n)$ time.
Upon processing the stream, whenever one reads a number $j$, compute $x := x \oplus \mathrm{bin}(j)$. Let $k$ be the single number from $\{ 1, ... n\}$ that is not included in the stream. After having read the whole stream, we have $$ x = \left(\bigoplus_{i=1}^n \mathrm{bin}(i)\right) \oplus \left(\bigoplus_{i \neq k } \mathrm{bin}(i)\right) = \mathrm{bin}(k) \oplus \bigoplus_{i \neq k } (\mathrm{bin}(i) \oplus \mathrm{bin}(i)) = \mathrm{bin}(k), $$ yielding the desired result.
Hence, we used $\mathcal{O}(\log n)$ space, and have an overall runtime of $\mathcal{O}(n)$.
HdM's solution works. I coded it in C++ to test it. I can't limit the value
to $O(\log_2 n)$ bits, but I'm sure you can easily show how only that number of bits is actually set.
For those that want pseudo code, using a simple $\text{fold}$ operation with exclusive or ($\oplus$):
$$\text{Missing} = \text{fold}(\oplus, \{1,\ldots,N\} \cup \text{InputStream})$$
Hand-wavey proof: A $\oplus$ never requires more bits than its input, so it follows that no intermediate result in the above requires more than the maximum bits of the input (so $O(\log_2 n)$ bits). $\oplus$ is commutative, and $x \oplus x = 0$, thus if you expand the above and pair off all data present in the stream you'll be left only with a single un-matched value, the missing number.
#include <iostream>
#include <vector>
#include <cstdlib>
#include <algorithm>
using namespace std;
void find_missing( int const * stream, int len );
int main( int argc, char ** argv )
{
if( argc < 2 )
{
cerr << "Syntax: " << argv[0] << " N" << endl;
return 1;
}
int n = atoi( argv[1] );
//construct sequence
vector<int> seq;
for( int i=1; i <= n; ++i )
seq.push_back( i );
//remove a number and remember it
srand( unsigned(time(0)) );
int remove = (rand() % n) + 1;
seq.erase( seq.begin() + (remove - 1) );
cout << "Removed: " << remove << endl;
//give the stream a random order
std::random_shuffle( seq.begin(), seq.end() );
find_missing( &seq[0], int(seq.size()) );
}
//HdM's solution
void find_missing( int const * stream, int len )
{
//create initial value of n sequence xor'ed (n == len+1)
int value = 0;
for( int i=0; i < (len+1); ++i )
value = value ^ (i+1);
//xor all items in stream
for( int i=0; i < len; ++i, ++stream )
value = value ^ *stream;
//what's left is the missing number
cout << "Found: " << value << endl;
}