를 디코딩하는 방법 허프만 코드를 신속하게?

https://stackoverflow.com/questions/2235208

19-09-2019
|

문제

가 implementated 간단한 압축기 를 사용하여 순수한 호프만 코드 Windows 에서.그러나 나는에 대해 많이 알고하지 마십시오를 디코딩하는 방법은 압축 파일을 신속하게,내가 나쁜 알고리즘은:

을 모두 열거 호프만 코드에 코드를 테이블과 비교하는 비트에 압축된 파일에 있습니다.그 밖으로 끔찍한 결과:3 메가바이트 압축 해제 파일을 필요한 6 시간입니다.

을 제공할 수는 훨씬 더 많은 효율적인 알고리즘?를 사용해야 하는 해시나요?

업데이트:가 implementated 디코더 상태와 테이블에 기초하여,내 친구 린의 조언입니다.이 방법보다 더해야한다 travesal 프리,3 메가바이트 내 6s.

감사합니다.

해결책

한 가지 방법을 최적화하는 바이너리-트리 접근 방식을 사용하여 검색이다.당신을 마련할 수 있도록 테이블보가 특정 인코딩된 비트 패턴을 직접 수 있도록 가능한 최대 비트-폭의 코드입니다.

대부분의 코드를 사용하지 않는 전체 최대 폭이 포함되어 있습니다 여러 위치에서 테이블에서나 위치에 대한 각각의 조합을 사용하지 않는 비트입니다.테이블의를 나타내는 비트 버리에서의 입력뿐만 아니라 디코딩된 출력됩니다.

는 경우 가장 긴 코드를 너무 오래 테이블은 비실용적이,타협은 사용하는 나무의 작은 고정 폭-첨자 조회.예를 들어,사용할 수 있습니다 256-테이블 항목을 처리하는 바이트입니다.만약 입력 코드는 8 비트,테이블 항목을 나타내는 디코딩이 불완전하며 당신을 지시 표 처리하는 다음을 8 비트입니다.더 테이블 무역 메모리에 대한 속도 256 항목은 아마도 너무 작습니다.

저는 이것을 믿는 일반적인 접근 방식입니다"라는 접두어 테이블을",그리고 무엇 BobMcGees 인용하는 코드입니다.가능성 차이는 약간의 압축 알고리즘을 필요로 접두어 테이블을 업데이트 중에 압축 해제-이 필요하지 않은 간단한 잡는다.IIRC,내가 처음에 그것을보고에 대한 책트맵 그래픽 파일의 형식을 포함하는 GIF,시간 전에 특허의 공포.

그것은 쉬워야 하는 미리 계산느 전체 조회 테이블,해시 테이블에 해당하는,또는 나무의 작은 테이블에서 바이너리 모델이다.이진 나무는 여전히 중요한 표현의 코드가 이 조회 테이블이 단지 최적화할 수도 있습니다.

다른 팁

어떻게 지켜 보지 않겠습니까? GZIP 소스 특히 허프만 감압 코드가 특히 포장을 풀지 않습니까? 그것은 훨씬 더 빨리하고 있다는 것을 제외하고는 당신이 정확히 무엇을하고 있습니다.

내가 알 수있는 바에 따르면, 전체 단어에서 작동하는 조회 배열 및 시프트/마스크 작업을 사용하여 더 빨리 실행됩니다. 그래도 꽤 조밀 한 코드.

편집 : 여기에 완전한 소스가 있습니다

/* unpack.c -- decompress files in pack format.
 * Copyright (C) 1992-1993 Jean-loup Gailly
 * This is free software; you can redistribute it and/or modify it under the
 * terms of the GNU General Public License, see the file COPYING.
 */

#ifdef RCSID
static char rcsid[] = "$Id: unpack.c,v 1.4 1993/06/11 19:25:36 jloup Exp $";
#endif

#include "tailor.h"
#include "gzip.h"
#include "crypt.h"

#define MIN(a,b) ((a) <= (b) ? (a) : (b))
/* The arguments must not have side effects. */

#define MAX_BITLEN 25
/* Maximum length of Huffman codes. (Minor modifications to the code
 * would be needed to support 32 bits codes, but pack never generates
 * more than 24 bits anyway.)
 */

#define LITERALS 256
/* Number of literals, excluding the End of Block (EOB) code */

#define MAX_PEEK 12
/* Maximum number of 'peek' bits used to optimize traversal of the
 * Huffman tree.
 */

local ulg orig_len;       /* original uncompressed length */
local int max_len;        /* maximum bit length of Huffman codes */

local uch literal[LITERALS];
/* The literal bytes present in the Huffman tree. The EOB code is not
 * represented.
 */

local int lit_base[MAX_BITLEN+1];
/* All literals of a given bit length are contiguous in literal[] and
 * have contiguous codes. literal[code+lit_base[len]] is the literal
 * for a code of len bits.
 */

local int leaves [MAX_BITLEN+1]; /* Number of leaves for each bit length */
local int parents[MAX_BITLEN+1]; /* Number of parents for each bit length */

local int peek_bits; /* Number of peek bits currently used */

/* local uch prefix_len[1 << MAX_PEEK]; */
#define prefix_len outbuf
/* For each bit pattern b of peek_bits bits, prefix_len[b] is the length
 * of the Huffman code starting with a prefix of b (upper bits), or 0
 * if all codes of prefix b have more than peek_bits bits. It is not
 * necessary to have a huge table (large MAX_PEEK) because most of the
 * codes encountered in the input stream are short codes (by construction).
 * So for most codes a single lookup will be necessary.
 */
#if (1<<MAX_PEEK) > OUTBUFSIZ
    error cannot overlay prefix_len and outbuf
#endif

local ulg bitbuf;
/* Bits are added on the low part of bitbuf and read from the high part. */

local int valid;                  /* number of valid bits in bitbuf */
/* all bits above the last valid bit are always zero */

/* Set code to the next 'bits' input bits without skipping them. code
 * must be the name of a simple variable and bits must not have side effects.
 * IN assertions: bits <= 25 (so that we still have room for an extra byte
 * when valid is only 24), and mask = (1<<bits)-1.
 */
#define look_bits(code,bits,mask) \
{ \
  while (valid < (bits)) bitbuf = (bitbuf<<8) | (ulg)get_byte(), valid += 8; \
  code = (bitbuf >> (valid-(bits))) & (mask); \
}

/* Skip the given number of bits (after having peeked at them): */
#define skip_bits(bits)  (valid -= (bits))

#define clear_bitbuf() (valid = 0, bitbuf = 0)

/* Local functions */

local void read_tree  OF((void));
local void build_tree OF((void));

/* ===========================================================================
 * Read the Huffman tree.
 */
local void read_tree()
{
    int len;  /* bit length */
    int base; /* base offset for a sequence of leaves */
    int n;

    /* Read the original input size, MSB first */
    orig_len = 0;
    for (n = 1; n <= 4; n++) orig_len = (orig_len << 8) | (ulg)get_byte();

    max_len = (int)get_byte(); /* maximum bit length of Huffman codes */
    if (max_len > MAX_BITLEN) {
    error("invalid compressed data -- Huffman code > 32 bits");
    }

    /* Get the number of leaves at each bit length */
    n = 0;
    for (len = 1; len <= max_len; len++) {
    leaves[len] = (int)get_byte();
    n += leaves[len];
    }
    if (n > LITERALS) {
    error("too many leaves in Huffman tree");
    }
    Trace((stderr, "orig_len %ld, max_len %d, leaves %d\n",
       orig_len, max_len, n));
    /* There are at least 2 and at most 256 leaves of length max_len.
     * (Pack arbitrarily rejects empty files and files consisting of
     * a single byte even repeated.) To fit the last leaf count in a
     * byte, it is offset by 2. However, the last literal is the EOB
     * code, and is not transmitted explicitly in the tree, so we must
     * adjust here by one only.
     */
    leaves[max_len]++;

    /* Now read the leaves themselves */
    base = 0;
    for (len = 1; len <= max_len; len++) {
    /* Remember where the literals of this length start in literal[] : */
    lit_base[len] = base;
    /* And read the literals: */
    for (n = leaves[len]; n > 0; n--) {
        literal[base++] = (uch)get_byte();
    }
    }
    leaves[max_len]++; /* Now include the EOB code in the Huffman tree */
}

/* ===========================================================================
 * Build the Huffman tree and the prefix table.
 */
local void build_tree()
{
    int nodes = 0; /* number of nodes (parents+leaves) at current bit length */
    int len;       /* current bit length */
    uch *prefixp;  /* pointer in prefix_len */

    for (len = max_len; len >= 1; len--) {
    /* The number of parent nodes at this level is half the total
     * number of nodes at parent level:
     */
    nodes >>= 1;
    parents[len] = nodes;
    /* Update lit_base by the appropriate bias to skip the parent nodes
     * (which are not represented in the literal array):
     */
    lit_base[len] -= nodes;
    /* Restore nodes to be parents+leaves: */
    nodes += leaves[len];
    }
    /* Construct the prefix table, from shortest leaves to longest ones.
     * The shortest code is all ones, so we start at the end of the table.
     */
    peek_bits = MIN(max_len, MAX_PEEK);
    prefixp = &prefix_len[1<<peek_bits];
    for (len = 1; len <= peek_bits; len++) {
    int prefixes = leaves[len] << (peek_bits-len); /* may be 0 */
    while (prefixes--) *--prefixp = (uch)len;
    }
    /* The length of all other codes is unknown: */
    while (prefixp > prefix_len) *--prefixp = 0;
}

/* ===========================================================================
 * Unpack in to out.  This routine does not support the old pack format
 * with magic header \037\037.
 *
 * IN assertions: the buffer inbuf contains already the beginning of
 *   the compressed data, from offsets inptr to insize-1 included.
 *   The magic header has already been checked. The output buffer is cleared.
 */
int unpack(in, out)
    int in, out;            /* input and output file descriptors */
{
    int len;                /* Bit length of current code */
    unsigned eob;           /* End Of Block code */
    register unsigned peek; /* lookahead bits */
    unsigned peek_mask;     /* Mask for peek_bits bits */

    ifd = in;
    ofd = out;

    read_tree();     /* Read the Huffman tree */
    build_tree();    /* Build the prefix table */
    clear_bitbuf();  /* Initialize bit input */
    peek_mask = (1<<peek_bits)-1;

    /* The eob code is the largest code among all leaves of maximal length: */
    eob = leaves[max_len]-1;
    Trace((stderr, "eob %d %x\n", max_len, eob));

    /* Decode the input data: */
    for (;;) {
    /* Since eob is the longest code and not shorter than max_len,
         * we can peek at max_len bits without having the risk of reading
         * beyond the end of file.
     */
    look_bits(peek, peek_bits, peek_mask);
    len = prefix_len[peek];
    if (len > 0) {
        peek >>= peek_bits - len; /* discard the extra bits */
    } else {
        /* Code of more than peek_bits bits, we must traverse the tree */
        ulg mask = peek_mask;
        len = peek_bits;
        do {
                len++, mask = (mask<<1)+1;
        look_bits(peek, len, mask);
        } while (peek < (unsigned)parents[len]);
        /* loop as long as peek is a parent node */
    }
    /* At this point, peek is the next complete code, of len bits */
    if (peek == eob && len == max_len) break; /* end of file? */
    put_ubyte(literal[peek+lit_base[len]]);
    Tracev((stderr,"%02d %04x %c\n", len, peek,
        literal[peek+lit_base[len]]));
    skip_bits(len);
    } /* for (;;) */

    flush_window();
    Trace((stderr, "bytes_out %ld\n", bytes_out));
    if (orig_len != (ulg)bytes_out) {
    error("invalid compressed data--length error");
    }
    return OK;
}

허프만 코드를 압축하는 일반적인 방법은 이진 트리를 사용하는 것입니다. 코드에 코드를 삽입하여 코드의 각 비트가 왼쪽 (0) 또는 오른쪽 (1)의 분기를 나타내고 잎에 해독 된 바이트 (또는 가지고있는 값)를 나타내는 것입니다.

디코딩은 코딩 된 콘텐츠에서 비트를 읽는 경우에 불과하며 각 비트마다 나무를 걷습니다. 잎에 도달하면 해독 된 값을 방출하고 입력이 소진 될 때까지 계속 읽으십시오.

업데이트: 이 페이지 이 기술을 설명하고 멋진 그래픽을 가지고 있습니다.

일반적인 허프만 트리 조회에서 일종의 배치 조회를 수행 할 수 있습니다.

약간 깊이를 선택합니다 (깊이라고 부릅니다 N); 이것은 테이블을 구성하기위한 속도, 메모리 및 시간 투자 사이의 상충 관계입니다.
모든 2^에 대한 조회 테이블을 만듭니다N 길이의 비트 문자열 N. 각 항목은 여러 개의 완전한 토큰을 인코딩 할 수 있습니다. Huffman 코드의 접두사 일뿐 아니라 일반적으로 남은 비트도 있습니다. 이들 각각에 대해 해당 코드의 추가 조회 테이블에 대한 링크를 만드십시오.
추가 조회 테이블을 구축하십시오. 총 테이블 수는 허프만 트리에 코딩 된 항목 수보다 최대 1 개입니다.

예를 들어 깊이 8의 배수 인 깊이를 선택하는 것은 비트 변화 작업에 적합합니다.

추신 이것은 Unling의 답변에 대한 Potatoswatter의 의견과 여러 테이블 사용에 대한 Steve314의 답변에서 아이디어와 다릅니다. 이것은 모든 것을 의미합니다. N-비트 조회는 사용하기 위해 사용되므로 더 빠르지 만 테이블 구조와 조회를 훨씬 까다 롭게 만들고 주어진 깊이를 위해 훨씬 더 많은 공간을 소비합니다.

동일한 소스 모듈에서 압축 압축 알고리즘을 사용하지 않겠습니까? 괜찮은 알고리즘 인 것 같습니다.

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 StackOverflow