Question

I have a data structure with 6000 elements and for each element I need to store 7 bits of info. If I naively store it as an array of 6000 elements filled with numbers, it takes up around 22 KB. I am trying to reduce the size of the page - what is the best way to store 6000*7 bits of info (should be around 5 KB). I want a "bitstream" like data structure. I thought about encoding it into a string or even an image but not exactly sure. The reason I did not encode as string because I cannot mathemtically guarantee that none of the characters won't be one of unprintable ASCII characters (e.g. ASCII 1-25)

Was it helpful?

Solution

Let's consider two solutions.

Base 32

For fun, let's consider using base-32 numbers. Yes, you can do that in JavaScript.

First pack four 7-bit values into one integer:

function pack(a1,a2,a3,a4){
    return ((a1 << 8 | a2) << 8 | a3) << 8 | a4;
}

Now, convert to base 32.

function encode(n){
    var str = "000000" + n.toString(32);
    str = str.slice(0,6);
    return str;
}

That should be not more than six digits. We make sure it's exactly six.

Going the other direction:

function decode(s){
    return parseInt(s, 32);
}

function unpack(x){
    var a1 = x & 0xff0000>>24, a2 = x & 0x00ff0000>>16, a3 = x & 0x0000ff00>>8, a4 = x & 0x000000ff;
    return [a1, a2, a3, a4];
}

All that remains is to wrap the logic around this to handle the 6000 elements. To compress:

function compress(elts){
    var str = '';
    for(var i = 0; i < elts.length; i+=4){
        str += encode(pack(elts[i], elts[i+1], elts[i+2], elts[i+3])
    }
    return str;
}

And to uncompress:

function uncompress(str){
    var elts = [];
    for(var i = 0; i < str.length; i+=6){
        elts = elts.concat(unpack(decode(str.slice(i, i+6)));
    }
    return elts;
}

If you concatenate the results for all 6,000 elements, you'll have 1,500 packed numbers, which at six characters each will turn into about 9K. It's about 1.5 bytes per 7-bit value. It's by no means the information-theoretic maximum compression, but it's not that bad. To decode simply reverse the process:

Unicode

First we'll pack two 7-bit values into one integer:

function pack(a1,a2){
    return (a1 << 8 | a2) << 8;
}

We'll do this for all 6,000 inputs, then use our friend String.fromCharCode to turn all 3,000 values into a 3,000-character Unicode string:

function compress(elts){
    var packeds = [];
    for (var i = 0; i < elts.length; i+=2) {
        packeds.push(pack(elts[i], elts[i+1]);
    }
    return String.fromCharCode.apply(0, packeds);
}

Coming back the other way, it's quite easy:

function uncompress(str) {
    var elts = [], code;
    for (var i = 0; i < str.length; i++) {
        code=str.charCodeAt(i);
        elts.push(code>>8, code & 0xff);
    }
    return elts;
}

This will take up two bytes per two 7-bit values, so about 33% more efficient than the base 32 approach.

If the above string is going to be written out into a script tag as a Javascript assignment such as var data="HUGE UNICODE STRING";, then quotation marks in string will need to be escaped:

javascript_assignment = 'var data = "' + compress(elts).replace(/"/g,'\\"') + '";';

The above code is not meant to be production, and in particular does not handle edge cases where the the number of inputs is not a multiple of four or two.

OTHER TIPS

As dandavis said it is ok to encode unprintable ASCII characters into JSON-string. But for random data it gave me 13KB (because many characters must be escaped). You can encode string into base64 and then into JSON-string. It gave me 7.9KB for random data.

var randint = function (from, to) {
    return Math.floor(Math.random() * (to - from + 1)) + from;
}

var data = '';
for (var i = 0; i < 6000; ++i) {
    data += String.fromCharCode(randint(0, 127));
}
// encoding `data` as JSON-string at this point gave me 13KB

var b64data = btoa(data);
// encoding `b64data` as JSON-string gave me 7.9KB

to decode it

var data = atob(b64data);
var adata = [];
for (var i = 0; i < data.length; ++i) {
    adata.push(data.charCodeAt(i));
}

There definitely should be more efficient method to encode your data, but I believe this one is a compromise on complexity and efficiency. PS. In some browsers you might need to write atob and btoa by yourself.

actually, strings work fine if you use JSON to encode any potential nasties into a JS-escape code:

var codes=",Ñkqëgdß\u001f", // (10 chars JSON encoded to store all chars ranges)
mySet=codes[4].charCodeAt().toString(2).split("").map(Number).map(Boolean).reverse();

alert(mySet); // shows: [true,false,false,false,true,true,true] 


/*  broken down into bite-sized steps: (pseudo code)
char == "g" (codes[4])
"g".charCodeAt() == 103
(103).toString(2) == "1100111"
.split().map(Number) ==  [1,1,0,0,1,1,1]
.map(Boolean).reverse() == [true,true,true,false,false,true,true]  */

and to fill the array, reverse the process:

var toStore= [true, false, true, false, true, false, true];
var char= String.fromCharCode(parseInt(toStore.map(Number).reverse().join(""),2));
codes+=char;

//verify (should===true):   
codes[10].charCodeAt().toString(2).split("")
   .map(Number).map(Boolean).reverse().toString() === toStore.toString();

to export the results to an ascii file, JSON.stringify(codes), or if saving to localStrorage, you can just save the raw string variable since browsers use two bytes per char of localStorage...

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top