letter dictionary for javascript lzw compression, "only-use-these-chars"-string

https://stackoverflow.com/questions/15413153

23-03-2022
|

Pergunta

Good day to all of you readers and helpers, i want to make use of a javascript function i recently found, its LZW compressing a string.

function lzw_encode(s) {
    var dict = {};
    var data = (s + "").split("");
    var out = [];
    var currChar;
    var phrase = data[0];
    var code = 256;
    for (var i=1; i<data.length; i++) {
        currChar=data[i];
        if (dict[phrase + currChar] != null) {
            phrase += currChar;
        }
        else {
            out.push(phrase.length > 1 ? dict[phrase] : phrase.charCodeAt(0));
            dict[phrase + currChar] = code;
            code++;
            phrase=currChar;
        }
    }
    out.push(phrase.length > 1 ? dict[phrase] : phrase.charCodeAt(0));
    for (var i=0; i<out.length; i++) {
        out[i] = String.fromCharCode(out[i]);
    }
    return out.join("");
}

This function is actually working very well, the only problem ive got is that i want to transfer the encoded string via websockets and without additional encoding (base64 e.g.) but that doesnt work every time. Sometimes the compressed string is having chars which cant be transfered via websockets, it throws an javascript error that the string is having illegal chars. So my idea was to only use acceptable chars in the encoding process, like a "whitelist" of chars which should be used for compression. What i understood from the code is that its taking the charCode of some number, so i though i could just create my own charCodeSet but i dont really know how to implement it and if it would even work.

Q1: what can i do so my lzw encoding just uses chars of a string that i define?
Q2: how else could i "http/s" transfer these chinese, arabic and control chars which websocket dont want to transfer?

By the way, this is the error which Chrome is throwing:

Websocket message contains invalid character(s).
Uncaught Error: SYNTAX_ERR: DOM Exception 12

Update1: though its might helpfull if you see the decoding function aswell

function lzw_decode(s) {
    var dict = {};
    var data = (s + "").split("");
    var currChar = data[0];
    var oldPhrase = currChar;
    var out = [currChar];
    var code = 256;
    var phrase;
    for (var i=1; i<data.length; i++) {
        var currCode = data[i].charCodeAt(0);
        if (currCode < 256) {
            phrase = data[i];
        }
        else {
           phrase = dict[currCode] ? dict[currCode] : (oldPhrase + currChar);
        }
        out.push(phrase);
        currChar = phrase.charAt(0);
        dict[code] = oldPhrase + currChar;
        code++;
        oldPhrase = phrase;
    }
    return out.join("");
}

here i would have to implement my custom charset also, i guess?..

Solução

Determine what bytes you can and cannot send. (Hopefully from a reliable source of documentation, as opposed to testing, but verified with testing.)

Design an escape code where you use one of the valid characters as an escape character, and the next character, also one of the valid characters, encodes a byte you cannot send.

Apply that to the output of your compressor. It is best to leave the job of compression to the compressor, and not try to saddle it with encoding. You should encode as a separate step.

Don't use LZW. It is ineffective and obsolete as compared to modern methods (zlib, lz4, lzma, etc.)

Licenciado em: CC-BY-SA com atribuição

Não afiliado a StackOverflow