I am using lz-string library for JS compression whenever placing any data into localStorage
. I am just a user of this library - not the compression expert. But this is information which could be found around that tool...
The lz-string Goal:
lz-string was designed to fulfill the need of storing large amounts of data in
localStorage
, specifically on mobile devices. localStorage being usually limited to 5MB, all you can compress is that much more data you can store.... I (note: "I" means, Pieroxy, author of the lz-string) started out from an LZW implementation (no more patents on that), which is very simple...
So, the fundament, the base of this implemntation is LZW, which is mentioned here Javascript client-data compression by Andy E. Let me point out
- the link to Wikipedia article on LZW
- the LZW compression example.
An extract from Wikipedia - Algorithm:
The scenario described by Welch's 1984 encodes sequences of 8-bit data as fixed-length 12-bit codes. The codes from 0 to 255 represent 1-character sequences consisting of the corresponding 8-bit character, and the codes 256 through 4095 are created in a dictionary for sequences encountered in the data as it is encoded. At each stage in compression, input bytes are gathered into a sequence until the next character would make a sequence for which there is no code yet in the dictionary. The code for the sequence (without that character) is added to the output, and a new code (for the sequence with that character) is added to the dictionary.
A high level view of the encoding algorithm is shown here:
- Initialize the dictionary to contain all strings of length one.
- Find the longest string W in the dictionary that matches the current input.
- Emit the dictionary index for W to output and remove W from the input.
- Add W followed by the next symbol in the input to the dictionary.
- Go to Step 2.
How it works in case of the lz-string we can observer here:
- The source code: lz-string-1.3.3.js
Let me cite few steps from the already mentioned lz-string source:
What I did was:
- localStorage can only contain JavaScript strings. Strings in JavaScript are stored internally in UTF-16, meaning every character weight 16 bits. I modified the implementation to work with a 16bit-wide token space.
- I had to remove the default dictionary initialization, totally useless on a 16bit-wide token space.
- I initialize the dictionary with three tokens:
- An entry that produces a 16-bit token.
- An entry that produces an 8-bit token, because most of what I will store is in the iso-latin-1 space, meaning tokens below 256.
- An entry that mark the end of the stream.
- The output is processed by a bit stream that stores effectively 16 bits per character in the output string.
- Each token is stored with just as many bits that are needed according to the size of the dictionary. Hence, the first token takes 2 bits, the second to 7th three bits, etc....
Well, now we know, that by these compression techniques we get 16 bits information. We can test it in this demo: http://pieroxy.net/blog/pages/lz-string/demo.html (or/and another here)
Which converts the: Hello, world.
into
85 04 36 30 f6 60 40 03 0e 04 01 e9 80 39 03 26
00 a2
So we need the final step, let me cite again:
Well, this lib produces stuff that isn't really a string. By using all 16 bits of the UTF-16 bitspace, those strings aren't exactly valid UTF-16. By version 1.3.0, I added two helper encoders to produce stuff that we can do something with:
compress
produces invalid UTF-16 strings. Those can be stored inlocalStorage
only on webkit browsers (Tested on Android, Chrome, Safari). Can be decompressed withdecompress
to continue our example, the Hello, world.
would be converted into
҅〶惶̀Ў㦀☃ꈀ
And that's finally it. We can see, that the set of all the ...other then latin characters... comes from the final conversion into UTF-16. Hope, this will give some hints...