String table encoding vs. gzip compression

https://stackoverflow.com/questions/1415569

06-07-2019
|

Question

In my application, I need to store and transmit data that contains many repeating string values (think entity names in an XML document). I have two proposed solutions:

A) create a string table to be stored along the document, and then use index references (using multi-byte encoding) in the document body, or
B) simply compress the document using gzip or a similar compression algorithm.

Which one is likely going to perform better in terms of speed and data size? (Obviously, this depends on the quality of the implementations, but assume that option A builds an array of strings dynamically and encodes the document body in some reasonable fashion).

Also, if option B, do you recommend a more potentially suitable compression method other than gzip?

Solution

gzip is only a good algorithm when the transmission/storage cost is not too high compared to the cost of CPU time. You can get better compression ratios with bzip2, 7zip, and especialy for natural language, various PPM algorithms.

Of course, it's not only computation (and static vs. dynamic memory requirement) vs. compression ratio that matters - different compression formats allow varying degrees of efficient random access seeking, low latency stream decoding, and concatenation of zipped data (e.g. cat a.gz b.gz | gunzip -c is the same as gunzip -c a.gz;gunzip -c b.gz

OTHER TIPS

Simply using gzip would definitely be the easiest and probably sufficient. I'd recommend trying the string table and then gzipping that to see if you get slightly better compression than with gzip alone.

It's going to depend on a lot of things that aren't addressed in your post.

Why don't you try the zip method first as it's easy to implement. Then if it meets your speed/compression requirements you're done and can move on to the next feature.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow