Compression Algorithm for Encoding Word Lists

https://stackoverflow.com/questions/405433

03-07-2019
|

Question

I'm am looking for specific suggestions or references to an algorithm and/or data structures for encoding a list of words into what would effectively would turn out to be a spell checking dictionary. The objectives of this scheme would result in a very high compression ratio of the raw word list into the encoded form. The only output requirement I have on the encoded dictionary is that any proposed target word can be tested for existence against the original word list in a relatively efficient manner. For example, the application might want to check 10,000 words against a 100,000 word dictionary. It is not a requirement for the encoded dictionary form to be able to be [easily] converted back into the original word list form - a binary yes/no result is all that is needed for each word tested against the resulting dictionary.

I am assuming the encoding scheme, to improve compression ratio, would take advantage of known structures in a given language such as singular and plural forms, possessive forms, contractions, etc. I am specifically interested in encoding mainly English words, but to be clear, the scheme must be able to encode any and all ASCII text "words".

The particular application I have in mind you can assume is for embedded devices where non-volatile storage space is at a premium and the dictionary would be a randomly accessible read-only memory area.

EDIT: To sum up the requirements of the dictionary:

zero false positives
zero false negatives
very high compression ratio
no need for decompression

Solution

See McIlroy's "Development of a Spelling List" at his pubs page. Classic old paper on spellchecking on a minicomputer, which constraints map surprisingly well onto the ones you listed. Detailed analysis of affix stripping and two different compression methods: Bloom filters and a related scheme Huffman-coding a sparse bitset; I'd go with Bloom filters probably in preference to the method he picked, which squeezes a few more kB out at significant cost in speed. (Programming Pearls has a short chapter about this paper.)

See also the methods used to store the lexicon in full-text search systems, e.g. Introduction to Information Retrieval. Unlike the above methods this has no false positives.

OTHER TIPS

A Bloom Filter (http://en.wikipedia.org/wiki/Bloom_filter and http://www.coolsnap.net/kevin/?p=13) is a data structure used to store the dictionary words in a very compactly in some spell checkers. There is, however, a risk for false positives.

I'd suggest a padded suffix tree. Good compression on wordlists, and excellent lookup times.

http://en.wikipedia.org/wiki/Suffix_tree

To sum up:

zero false positives
zero false negatives
high compression ratio
no need for inverse (i.e. no uncompression necessary)

I was going to suggest Bloom filters, but these have non-zero false positives.

Instead, Programming Pearls talks of a similar set of requirements (/usr/share/dict/words in 41K).

This took the approach of contraction of stems: For example: sent was the root, so could have pre- and post-fixes added:

present
represent
representation
misrepresentation

You can get a 30%+ compression ratio out of storing words as successive suffixes in 7-bit format. I'm not sure what this is called, but it translates pretty effectively into a tree-structure.

ex.: a+n+d+s|an+d+y|and+es+roid

is 26 characters, compared to:

a an ad as and any andes android

which is 33.

Factoring in 12.5% compression ratio for storing as 7-bit content, that's about 31% compression total. Compression ratio depends, of course, on the size and content of your word list.

Turning this into a 26-root tree structure would probably result in lookups that are faster than a plaintext substring comparison against a flat file.

Come to think of it, if you're only using 26 characters plus two for delimiters, you can do everything in 5 bits, which is 37.5% compression in and of itself, bringing the above example to over a 50% compression rate.

I think your best bet is a Compressed Suffix Tree / Compressed Suffix Array. You can find a wealth of information in the above links. This is an ongoing research area, very interesting indeed.

I'm not an expert on this, but isn't prefix tree pretty much standard solution to this? That stores common prefixes of words only once.

For pure compression, the Maximum Compression site offers some results for a 4 MB english wordlist, best program compresses this to around 400 KB. Some other compression resources for text/word compression are the Hutter Prize page and the Large Text Compression Benchmark.

Knuth mentions a "Patricia trie" in The Art of Computer Programming vol. 3. I've never used it for any real work but maybe that would be helpful.

edit: what's your RAM constraint? If you have lots more RAM than ROM available, perhaps data compression in ROM (requiring decompression into RAM) is the right way to go. I suppose if you have a medium but not large amount of RAM, technically you could also store portions of the data structure as compressed blobs in memory, and a least-recently-used cache to keep several of them around, then dynamically decompress the appropriate blob when it's not in the cache.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow