Let's consider two solutions.
Base 32
For fun, let's consider using base-32 numbers. Yes, you can do that in JavaScript.
First pack four 7-bit values into one integer:
function pack(a1,a2,a3,a4){
return ((a1 << 8 | a2) << 8 | a3) << 8 | a4;
}
Now, convert to base 32.
function encode(n){
var str = "000000" + n.toString(32);
str = str.slice(0,6);
return str;
}
That should be not more than six digits. We make sure it's exactly six.
Going the other direction:
function decode(s){
return parseInt(s, 32);
}
function unpack(x){
var a1 = x & 0xff0000>>24, a2 = x & 0x00ff0000>>16, a3 = x & 0x0000ff00>>8, a4 = x & 0x000000ff;
return [a1, a2, a3, a4];
}
All that remains is to wrap the logic around this to handle the 6000 elements. To compress:
function compress(elts){
var str = '';
for(var i = 0; i < elts.length; i+=4){
str += encode(pack(elts[i], elts[i+1], elts[i+2], elts[i+3])
}
return str;
}
And to uncompress:
function uncompress(str){
var elts = [];
for(var i = 0; i < str.length; i+=6){
elts = elts.concat(unpack(decode(str.slice(i, i+6)));
}
return elts;
}
If you concatenate the results for all 6,000 elements, you'll have 1,500 packed numbers, which at six characters each will turn into about 9K. It's about 1.5 bytes per 7-bit value. It's by no means the information-theoretic maximum compression, but it's not that bad. To decode simply reverse the process:
Unicode
First we'll pack two 7-bit values into one integer:
function pack(a1,a2){
return (a1 << 8 | a2) << 8;
}
We'll do this for all 6,000 inputs, then use our friend String.fromCharCode
to turn all 3,000 values into a 3,000-character Unicode string:
function compress(elts){
var packeds = [];
for (var i = 0; i < elts.length; i+=2) {
packeds.push(pack(elts[i], elts[i+1]);
}
return String.fromCharCode.apply(0, packeds);
}
Coming back the other way, it's quite easy:
function uncompress(str) {
var elts = [], code;
for (var i = 0; i < str.length; i++) {
code=str.charCodeAt(i);
elts.push(code>>8, code & 0xff);
}
return elts;
}
This will take up two bytes per two 7-bit values, so about 33% more efficient than the base 32 approach.
If the above string is going to be written out into a script tag as a Javascript assignment such as var data="HUGE UNICODE STRING";
, then quotation marks in string will need to be escaped:
javascript_assignment = 'var data = "' + compress(elts).replace(/"/g,'\\"') + '";';
The above code is not meant to be production, and in particular does not handle edge cases where the the number of inputs is not a multiple of four or two.