Est l'efficacité de codage de Huffman Limited?

https://stackoverflow.com/questions/6305573

huffman-code

26-10-2019
|

Question

Mon problème est que j'ai 100.000+ éléments différents et que je comprends Huffman fonctionne en attribuant l'élément le plus commun d'un code 0, et la suivante 10, la prochaine 110, 1110, 11110 et ainsi de suite. Ma question est, si le code pour l'élément n-ième est n-bits de long alors sûrement une fois que je l'ai passé le 32e terme, il est plus efficace de l'espace juste envoyé types de données 32 bits comme ils sont, comme ints par exemple? Ai-je manqué quelque chose dans la méthodologie?

Merci pour toute aide que vous pouvez offrir. Mon implémentation actuelle fonctionne en faisant

code = (code << 1) + 2;

pour générer chaque nouveau code (ce qui semble être correct!), Mais la seule façon je pourrais coder plus de 100.000 éléments serait d'avoir un int [] dans un nouveau type de données de fortune, où accéder à la valeur que nous lisions à partir du tableau int comme un symbole à long continue ... ce n'est pas aussi efficace que l'espace juste transporter un int 32 bits? Ou est-il plus d'un cas de Huffmans utiliser être avec ses codes préfixes, et être en mesure de déterminer chaque valeur unique dans un flux binaire continu sans ambiguïté?

Merci

La solution

Your understanding is a bit off - take a look at http://en.wikipedia.org/wiki/Huffman_coding. And you have to pack the encoded bits into machine words in order to get compression - Huffman encoded data can best be thought of as a bit-stream.

Autres conseils

You seem to understand the principle of prefix codes.

Could you tell us a little more about these 100,000+ different elements you mention?

The fastest prefix codes -- universal codes -- do, in fact, involve a series of bit sequences that can be pre-generated without regard to the actual symbol frequencies. Compression programs that use these codes, as you mentioned, associate the most-frequent input symbol to the shortest bit sequence, the next-most-frequent input symbol to the next-shorted bit sequence, and so on.

What you describe is one particular kind of prefix code: unary coding. Another popular variant of the unary coding system assigns elements in order of frequency to the fixed codes "1", "01", "001", "0001", "00001", "000001", etc.

Some compression programs use another popular prefix code: Elias gamma coding. The Elias gamma coding assigns elements in order of frequency to the fixed set of codewords

The 32nd Elias gamma codeword is about 10 bits long, about half as long as the 32nd unary codeword. The 100,000th Elias gamma codeword will be around 32 bits long.

If you look carefully, you can see that each Elias gamma codeword can be split into 2 parts -- the first part is more or less the unary code you are familiar with. That unary code tells the decoder how many more bits follow afterward in the rest of that particular Elias gamma codeword.

There are many other kinds of prefix codes. Many people (confusingly) refer to all prefix codes as "Huffman codes".

When compressing some particular data file, some prefix codes do better at compression than others. How do you decide which one to use? Which prefix code is the best for some particular data file?

The Huffman algorithm -- if you neglect the overhead of the Huffman frequency table -- chooses exactly the best prefix code for each data file. There is no singular "the" Huffman code that can be pre-generated without regard to the actual symbol frequencies. The prefix code choosen by the Huffman algorithm is usually different for different files.

The Huffman algorithm doesn't compress very well when we really do have 100,000+ unique elements -- the overhead of the Huffman frequency table becomes so large that we often can find some other "suboptimal" prefix code that actually gives better net compression. Or perhaps some entirely different data compression algorithm might work even better in your application.

The "Huffword" implementation seems to work with around 32,000 or so unique elements, but the overwhelming majority of Huffman code implementations I've seen work with around 257 unique elements (the 256 possible byte values, and the end-of-text indicator).

You might consider somehow storing your data on a disk in some raw "uncompressed" format. (With 100,000+ unique elements, you will inevitably end up storing many of those elements in 3 or more bytes). Those 257-value implementations of Huffman compression will be able to compress that file; they re-interpret the bytes of that file as 256 different symbols.

My question is, if the code for the nth element is n-bits long then surely once I have passed the 32nd term it is more space efficient to just sent 32-bit data types as they are, such as ints for example? Have I missed something in the methodology?

One of the more counter-intuitive features of prefix codes is that some symbols (the rare symbols) are "compressed" into much longer bit sequences. If you actually have 2^8 unique symbols (all possible 8 bit numbers), it is not possible to gain any compression if you force the compressor to use prefix codes limited to 8 bits or less. By allowing the compressor to expand rare values -- to use more than 8 bits to store a rare symbol that we know can be stored in 8 bits -- that frees up the compressor to use less than 8 bits to store the more-frequent symbols.

Licencié sous: CC-BY-SA avec attribution

Non affilié à StackOverflow