Block detection in repeated stream

https://cs.stackexchange.com/questions/2064

16-10-2019
|

Pergunta

I need to recover a data block from a repeated stream of data. I'm looking to see what algorithms may already exist for this as it does not feel like a novel situation.

Here are the specifics:

There is an N-length block of data contained in a stream
The block is repeated many times in the stream
the data is highly corrupted, some bytes could just be wrong, where as others can be detected as missing (erasures)
There is a function F(data) which can say if a block represents valid data (the probability of a false positive is virtually zero)
F can also provide a probability value that even if the block is not valid data whether the block itself is valid (but just has too much corruption to be recovered)
The chance of corrupted data is very low compared to missing data

For example, say I have this data stream and wish to recover the 10 length sequence 1234567890. The data is just a rough visual example (I can't guarantee recovery is actually possible from this bit). A . represents a missing byte, and <break> indicates an unknown block of data (no data and not length known). Note also the Qs as an example of corrupt data.

23.5678901.3456789<break>2345678..1..4567QQ012345678..3456

How can I take such a stream of data and recovery probably blocks of N data? As the actual data includes forward error recovery the block recovery need not be perfect. All it needs to do is give probable reconstructed blocks of data and the F function will attempt to do error recovery. Thus I expect F fill have to be called several times.

I'd like to find something better than simply calling F at each point in the stream since the error rate could be high enough that no single run block of N can be recovered -- the repetitions in the stream must be used somehow.

Solução

I think what you want is impossible to do without further assumptions on F.

Consider the following, rather simple data stream for a target data length of 5:

123441233512245113451255513555 ...

Note that not a single block contains 12345 (our original data). In order to find a proper combination of data parts, you would have check all combinations, in general. What is the probability that some combination other than 12345 is accepted by F? And this is on a stream without lost or corrupt symbols and ignoring the problem of overlaps between blocks which might by accident be valid data!

Therefore, I think keeping a buffer of size $N$ and running F in every time step is the best you can do. Note that you can skip evaluation of F if there is a corrupt symbol in the buffer (provided you can identify those). You still have the problem of overlap errors and wrong valid blocks, though.

You have more potential if F has nice properties, for example if it is able to identify valid parts of data. Then you can store only maximal valid parts and piece those together. Also, formatting the data stream can help; a simple example would be markers for the beginning of a data block (that would get rid of problems with overlaps and reduce the amount of calls to F). Furthermore, usual assumptions such as "error probability is small" can make waiting for a full valid copy of the data viable; many protocols break down if virtually every other symbol is corrupt.

All these are not new problems. I suggest further reading on

Licenciado em: CC-BY-SA com atribuição

Não afiliado a cs.stackexchange