Encoding / Error Correction Challenge

https://stackoverflow.com/questions/2393362

25-09-2019
|

Question

Is it mathematically feasible to encode and initial 4 byte message into 8 bytes and if one of the 8 bytes is completely dropped and another is wrong to reconstruct the initial 4 byte message? There would be no way to retransmit nor would the location of the dropped byte be known.

If one uses Reed Solomon error correction with 4 "parity" bytes tacked on to the end of the 4 "data" bytes, such as DDDDPPPP, and you end up with DDDEPPP (where E is an error) and a parity byte has been dropped, I don't believe there's a way to reconstruct the initial message (although correct me if I am wrong)...

What about multiplying (or performing another mathematical operation) the initial 4 byte message by a constant, then utilizing properties of an inverse mathematical operation to determine what byte was dropped. Or, impose some constraints on the structure of the message so every other byte needs to be odd and the others need to be even.

Alternatively, instead of bytes, it could also be 4 decimal digits encoded in some fashion into 8 decimal digits where errors could be detected & corrected under the same circumstances mentioned above - no retransmission and the location of the dropped byte is not known.

I'm looking for any crazy ideas anyone might have... Any ideas out there?

EDIT:

It may be a bit contrived, but the situation that I'm trying to solve is one where you have, let's say, a faulty printer that prints out important numbers onto a form, which are then mailed off to a processing firm which uses OCR to read the forms. The OCR isn't going to be perfect, but it should get close with only digits to read. The faulty printer could be a bigger problem, where it may drop a whole number, but there's no way of knowing which one it'll drop, but they will always come out in the correct order, there won't be any digits swapped.

The form could be altered so that it always prints a space between the initial four numbers and the error correction numbers, ie 1234 5678, so that one would know whether a 1234 initial digit was dropped or a 5678 error correction digit was dropped, if that makes the problem easier to solve. I'm thinking somewhat similar to how they verify credit card numbers via algorithm, but in four digit chunks.

Hopefully, that provides some clarification as to what I'm looking for...

Solution

In the absence of "nice" algebraic structure, I suspect that it's going to be hard to find a concise scheme that gets you all the way to 10**4 codewords, since information-theoretically, there isn't a lot of slack. (The one below can use GF(5) for 5**5 = 3125.) Fortunately, the problem is small enough that you could try Shannon's greedy code-construction method (find a codeword that doesn't conflict with one already chosen, add it to the set).

Encode up to 35 bits as a quartic polynomial f over GF(128). Evaluate the polynomial at eight predetermined points x0,...,x7 and encode as 0f(x0) 1f(x1) 0f(x2) 1f(x3) 0f(x4) 1f(x5) 0f(x6) 1f(x7), where the alternating zeros and ones are stored in the MSB.

When decoding, first look at the MSBs. If the MSB doesn't match the index mod 2, then that byte is corrupt and/or it's been shifted left by a deletion. Assume it's good and shift it back to the right (possibly accumulating multiple different possible values at a point). Now we have at least seven evaluations of a quartic polynomial f at known points, of which at most one is corrupt. We can now try all possibilities for the corruption.

EDIT: bmm6o has advanced the claim that the second part of my solution is incorrect. I disagree.

Let's review the possibilities for the case where the MSBs are 0101101. Suppose X is the array of bytes sent and Y is the array of bytes received. On one hand, Y[0], Y[1], Y[2], Y[3] have correct MSBs and are presumed to be X[0], X[1], X[2], X[3]. On the other hand, Y[4], Y[5], Y[6] have incorrect MSBs and are presumed to be X[5], X[6], X[7].

If X[4] is dropped, then we have seven correct evaluations of f.

If X[3] is dropped and X[4] is corrupted, then we have an incorrect evaluation at 3, and six correct evaluations.

If X[5] is dropped and X[4] is corrupted, then we have an incorrect evaluation at 5, and six correct evaluations.

There are more possibilities besides these, but we never have fewer than six correct evaluations, which suffices to recover f.

OTHER TIPS

I think you would need to study what erasure codes might offer you. I don't know any bounds myself, but maybe some kind of MDS code might achieve this.

EDIT: After a quick search I found RSCode library and in the example it says that

In general, with E errors, and K erasures, you will need
* 2E + K bytes of parity to be able to correct the codeword
* back to recover the original message data.

So looks like Reed-Solomon code is indeed the answer and you may actually get recovery from one erasure and one error in 8,4 code.

Parity codes work as long as two different data bytes aren't affected by error or loss and as long as error isn't equal to any data byte while a parity byte is lost, imho.

Error correcting codes can in general handle erasures, but in the literature the position of the erasure is assumed known. In most cases, the erasure will be introduced by the demodulator when there is low confidence that the correct data can be retrieved from the channel. For instance, if the signal is not clearly 0 or 1, the device can indicate that the data was lost, rather than risking the introduction of an error. Since an erasure is essentially an error with a known position, they are much easier to fix.

I'm not sure what your situation is where you can lose a single value and you can still be confident that the remaining values are delivered in the correct order, but it's not a situation classical coding theory addresses.

What algorithmist is suggesting above is this: If you can restrict yourself to just 7 bits of information, you can fill the 8th bit of each byte with alternating 0 and 1, which will allow you to know the placement of the missing byte. That is, put a 0 in the high bit of bytes 0, 2, 4, 6 and a 1 in the high bits of the others. On the receiving end, if you only receive 7 bytes, the missing one will have been dropped from between bytes whose high bits match. Unfortunately, that's not quite right: if the erasure and the error are adjacent, you can't know immediately which byte was dropped. E.g., high bits 0101101 could result from dropping the 4th byte, or from an error in the 4th byte and dropping the 3rd, or from an error in the 4th byte and dropping the 5th.

You could use the linear code:

1 0 0 0  0 1 1 1
0 1 0 0  1 0 1 1
0 0 1 0  1 1 0 1
0 0 0 1  1 1 1 0

(i.e. you'll send data like (a, b, c, d, b+c+d, a+c+d, a+b+d, a+b+c) (where addition is implemented with XOR, since a,b,c,d are elements of GF(128))). It's a linear code with distance 4, so it can correct a single-byte error. You can decode with syndrome decoding, and since the code is self-dual, the matrix H will be the same as above.

In the case where there's a dropped byte, you can use the technique above to determine which one it is. Once you've determined that, you're essentially decoding a different code - the "punctured" code created by dropping that given byte. Since the punctured code is still linear, you can use syndrome decoding to determine the error. You would have to calculate the parity-check matrix for each of the shortened codes, but you can do this ahead of time. The shortened code has distance 3, so it can correct any single-byte errors.

In the case of decimal digits, assuming one goes with first digit odd, second digit even, third digit odd, etc - with two digits, you get 00-99, which can be represented in 3 odd/even/odd digits (125 total combinations) - 00 = 101, 01 = 103, 20 = 181, 99 = 789, etc. So one encodes two sets of decimal digits into 6 total digits, then the last two digits signify things about the first sets of 2 digits or a checksum of some sort... The next to last digit, I suppose, could be some sort of odd/even indicator on each of the initial 2 digit initial messages (1 = even first 2 digits, 3 = odd first two digits) and follow the pattern of being odd. Then, the last digit could be the one's place of a sum of the individual digits, that way if a digit was missing, it would be immediately apparent and could be corrected assuming the last digit was correct. Although, it would throw things off if one of the last two digits were dropped...

It looks to be theoretically possible if we assume 1 bit error in wrong byte. We need 3 bits to identify dropped byte and 3 bits to identify wrong byte and 3 bits to identify wrong bit. We have 3 times that many extra bits.

But if we need to identify any number of bits error in wrong byte, it comes to 30 bits. Even that looks to be possible with 32 bits, although 32 is a bit too close for my comfort.

But I don't know hot to encode to get that. Try turbocode?

Actually, as Krystian said, when you correct a RS code, both the message AND the "parity" bytes will be corrected, as long as you have v+2e < (n-k) where v is the number of erasures (you know the position) and e is the number of errors. This means that if you only have errors, you can correct up to (n-k)/2 errors, or (n-k-1) erasures (about the double of the number of errors), or a mix of both (see Blahut's article: Transform techniques for error control codes and A universal Reed-Solomon decoder).

What's even nicer is that you can check that the correction was successful: by checking that the syndrome polynomial only contains 0 coefficients, you know that the message+parity bytes are both correct. You can do that before to check if the message needs any correction, and also you can do the check after the decoding to check that both the message and the parity bytes were completely repaired.

The bound v+2e < (n-k) is optimal, you cannot do better (that's why Reed-Solomon is called an optimal error correction code). In fact it's possible to go beyond this limit using bruteforce approaches, up to a certain point (you can gain 1 or 2 more symbols for each 8 symbols) using list decoding, but it's still a domain in its infancy, I don't know of any practical implementation that works.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow