Optimizing word count

https://stackoverflow.com/questions/1663416

12-09-2019
|

Question

(This is rather hypothetical in nature as of right now, so I don't have too many details to offer.)

I have a flat file of random (English) words, one on each line. I need to write an efficient program to count the number of occurrences of each word. The file is big (perhaps about 1GB), but I have plenty of RAM for everything. They're stored on permanent media, so read speeds are slow, so I need to just read through it once linearly.

My two off-the-top-of-my-head ideas were to use a hash with words => no. of occurrences, or a trie with the no. of occurrences at the end node. I have enough RAM for a hash array, but I'm thinking that a trie would have as fast or faster lookups.

What approach would be best?

Solution

I think a trie with the count as the leaves could be faster.

Any decent hash table implementation will require reading the word fully, processing it using a hash function, and finally, a look-up in the table.

A trie can be implemented such that the search occurs as you are reading the word. This way, rather than doing a full look-up of the word, you could often find yourself skipping characters once you've established the unique word prefix.

For example, if you've read the characters: "torto", a trie would know that the only possible word that starts this way is tortoise.

If you can perform this inline searching faster on a word faster than the hashing algorithm can hash, you should be able to be faster.

However, this is total overkill. I rambled on since you said it was purely hypothetical, I figured you'd like a hypothetical-type of answer. Go with the most maintainable solution that performs the task in a reasonable amount of time. Micro-optimizations typically waste more time in man-hours than they save in CPU-hours.

OTHER TIPS

I'd use a Dictionary object where the key is word converted to lower case and the value is the count. If the dictionary doesn't contain the word, add it with a value of 1. If it does contain the word, increment the value.

Given slow reading, it's probably not going to make any noticeable difference. The overall time will be completely dominated by the time to read the data anyway, so that's what you should work at optimizing. For the algorithm (mostly data structure, really) in memory, just use whatever happens to be most convenient in the language you find most comfortable.

A hash table is (if done right, and you said you had lots of RAM) O(1) to count a particular word, while a trie is going to be O(n) where n is the length of the word.

With a sufficiently large hash space, you'll get much better performance from a hash table than from a trie.

I think that a trie is overkill for your use case. A hash of word => # of occurrences is exactly what I would use. Even using a slow interpreted language like Perl, you can munge a 1GB file this way in just a few minutes. (I've done this before.)

I have enough RAM for a hash array, but I'm thinking that a trie would have as fast or faster lookups.

How many times will this code be run? If you're just doing it once, I'd say optimize for your time rather than your CPU's time, and just do whatever's fastest to implement (within reason). If you have a standard library function that implements a key-value interface, just use that.

If you're doing it many times, then grab a subset (or several subsets) of the data file, and benchmark your options. Without knowing more about your data set, it'd be dubious to recommend one over another.

Use Python!

Add these elements to a set data type as you go line by line, before asking whether it is in the hash table. After you know it is in the set, then add a dictionary value of 2, since you already added it to the set once before.

This will take some of the memory and computation away from asking the dictionary every single time, and instead will handle unique valued words better, at the end of the call just dump all the words that are not in the dictionary out of the set with a value of 1. (Intersect the two collections in respect to the set)

To a large extent, it depends on what you want you want to do with the data once you've captured it. See Why Use a Hash Table over a Trie (Prefix Tree)?

a simple python script:

import collections
f = file('words.txt')
counts = collections.defaultdict(int)
for line in f:
    counts[line.strip()] +=1

print "\n".join("%s: %d" % (word, count) for (word, count) in counts.iteritems())

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow