Prefix encoding of algebraic data types

https://cs.stackexchange.com/questions/74065

04-11-2019
|

Pergunta

I'm new to coding theory and formal proofs, and am struggling to understand how to construct and reason about prefix-free encoding algorithms on algebraic data types. I hope it's clear if I use notation from haskell below.

I'm tasked with writing a generic prefix encoding algorithm (encode) on ADTs, as well as giving guidance to my users in defining their own implementations of encode for their own types if they wish (this being a major reason I want to understand things better).

I've been jumping around sections of Berstel and Perrin's Theory of Codes which has been very helpful, but it's slow going and my goals are pretty narrow here.

One implementation

I believe I have one obvious implementation, which works like this:

for products, concatenate the results of applying encode to each of the children (arguments to the constructor):

data P a b c ... = P a b c ...

encode (P a b c...) = encode a <> encode b <> encode c <> ...

for coproducts (or a sum of products), assign to each constructor a distinct number tag (using a prefix encoding, to be pedantic) which will start the code word, then apply the product rule above

data XYZ a b c d... = X a b ... | Y c d ... | Z ...

encode (X a b...) = encode 1 <> encode a <> encode b <> ...
encode (Y c d...) = encode 2 <> encode c <> encode d <> ...
encode (Z ...)    = encode 3 <> ...
...

The product rule I'm comfortable with: I've proved to myself that the cartesian product of two independent prefix codes is also prefix.

The coproduct rule also seems to be obviously correct if we're thinking in terms of sums-of-products, and if the product rule holds.

Where I'm struggling, I think is in reasoning about recursion, for instance for inductively-defined lists:

List a = Cons a (List a) | Empty

I believe my algorithm works in the presence of recursion, but how can I go about proving this using induction? I think my trouble is that "prefix" is a property of the (infinite) set of code words, not any one.

Other possibly-better algorithms

A proof of the algorithm above would be helpful, but my goal is to have an algorithm that tends to generate compact code words (in fact I don't care about decoding at all; only that the range of encode be prefix), and so I want to understand this more fully.

For instance there is a better algorithm for List. An alternative would be:

encode l = encode (length l) <> map encode l

It occurred to me that maybe this is just a special case of a run-length encoding of a variation of the algorithm in section 1, where constructor tags are all concatenated before the concatenation of encoding the list elements.

But why should it be valid to do that? Can this algorithm for lists be generalized to produce better (more compact) prefix codes for arbitrary ADTs?

Nenhuma solução correta

Licenciado em: CC-BY-SA com atribuição

Não afiliado a cs.stackexchange