How to find high frequency words in a book in an environment low on memory?

https://stackoverflow.com/questions/742125

09-09-2019
|

Question

Recently in a technical interview, I was asked to write a program to find the high frequency words(Words which appear maximum number of times) in a text book. The program should be designed in such a way that, it processes the entire text book with minimum memory. Performance is not a concern. I was able to program to find the frequency of words, but it consumed a lot of memory.

How do you make this operation less memory intensive? Any strategies/solutions?

-Snehal

Solution

You probably used hash tables which are memory-intensive but have a constant-lookup time--so the performance/memory trade off is obvious. By the time you reach the end of the book you will know your answer. Also, incrementing counters for each word is fast (because of the quick hashtable lookups).

The other end of the spectrum is to look at the first word, then go through the entire book to see how many times that word occurs. This requires minimal memory. Then you do the same for the next word and go through the entire book. If that word occurs more times, you add that as the top word (or top N words). Of course, this is extremely inefficient--if the first and third word are the same you'll end up going through the whole book again even though you just did the same thing for the first word.

OTHER TIPS

OK, if you're only interested in the highest n occurring words, one way to do it is in two passes, with the first pass based on a modified Bloom Filter. Instead of using a bit map to track hash occurrences, use an integer array instead - either byte, 16 bit, 32 bit or even 64 bit depending on your input size. Where a Bloom filter simply sets the bit corresponding to each of the hash values of a word, you'll increment the count at the hash index in the array.

The problem with this approach is that two words will probably give the same hash values. So you need to do a second pass where you ignore words unless their hash totals are above a certain threshold, thus reducing the amount of memory you need to allocate to do accurate counting.

So just create a bit map with bits set for the highest occurring hash values. Then in the second pass of the words, if a word has "hits" in the bitmap for its hashes, look it up or add it to a hash table and increment its count. This minimises memory usage by creating a hash table of only the highest occurring words.

I'm a physicist, so my favourite approach is to approximate. You don't need to go through the entire text to get the most frequent words. Instead:

parse a chunk small enough to allow for your memory limitations,
skip a random amount of text,
repeat, combining accumulated results.
Stop when the list has satisfactorily converged.

If you use a memory-efficient algorithm for the smaller chunks (e.g. sorting) then you can get far faster performance than even the most efficient algorithm that reads every word.

Note: This does make the assumption that the most frequent words do occur most frequently throughout the text, not just at one place in the text. For english text, this assumption is true, because of the frequency of words like 'the' etc throughout. If you're worried about this requirement, require the algorithm to complete at least one pass of the entire text.

I'll probably get down-voted for this...

If the text is English and you just want to find the top 5 most frequent words, here is your program:

print "1. the\n";
print "2. of\n";
print "3. and\n";
print "4. a\n";
print "5. to\n";

Runs fast and consumes minimal memory!

If performance is really of no concern you could just go through each word in turn, check if it's in your "top N" and, if it isn't, count all it's occurrences. This way you're only storing N values. Of course, you'd be counting the same words many times, but, as you said, performance isn't an issue - and the code would be trivial (which is generally preferable - all other things being equal).

One way would be to sort the list first.

We can sort the words in-place without a lot of memory (traded with slow performance).

And then we can have a simple counting loops that finds words with maximum frequency without having to save everything in memory since they're in sorted form.

Do you mean a lot of process memory? If so, one way would be to use the disk as virtual memory (aka write a filesystem wrapper).

A possible solution is to use a trie data structure for storing all words associated to their number of occurrences.

Other solutions may be found in answers to this related question: Space-Efficient Data Structure for Storing a Word List?

Like many good interview questions, the question is phrased a little ambiguously/imprecisely, to force the interviewee to ask clarifying questions and state assumptions. I think a number of the other answers here are good, as they poke at these assumptions and demonstrate big-picture understanding.

I'm assuming the text is stored 'offline' somewhere, but there is a way to iterate over each word in the text without loading the whole text into memory.

Then the F# code below find the top N words. It's only data structure is a mapping of key-value pairs (word, frequency), and it only keeps the top N of those, so the memory use is O(N), which is small. The runtime is O(numWordsInText^2), which is poor, but acceptable given the problem constraints. The gist of the algorithm is straightforward, for each word in the text, count how many times it occurs, and if it's in the running best-N, then add it to the list and remove the previous minimum entry.

Note that the actual program below loads the entire text into memory, merely for convenience of exposition.

#light
// some boilerplate to grab a big piece of text off the web for testing
open System.IO 
open System.Net 
let HttpGet (url: string) = 
    let req = System.Net.WebRequest.Create(url) 
    let resp = req.GetResponse() 
    let stream = resp.GetResponseStream() 
    let reader = new StreamReader(stream) 
    let data = reader.ReadToEnd() 
    resp.Close() 
    data 
let text = HttpGet "http://www-static.cc.gatech.edu/classes/cs2360_98_summer/hw1"
let words = text.Split([|' ';'\r';'\n'|], System.StringSplitOptions.RemoveEmptyEntries)
// perhaps 'words' isn't actually stored in memory, but so long as we can 
// 'foreach' over all the words in the text we're good
let N = 5  // how many 'top frequency' words we want to find
let FindMin map =
    // key-value pair with mininum value in a map
    let (Some(seed)) = Map.first (fun k v -> Some(k,v)) map
    map |> Map.fold_left 
        (fun (mk,mv) k v -> if v > mv then (mk,mv) else (k,v)) 
        seed
let Main() =
    let mutable freqCounts = Map.of_list [ ("",0) ]
    for word in words do
        let mutable count = 0
        for x in words do
            if x = word then
                count <- count + 1
        let minStr,minCount = FindMin freqCounts
        if count >= minCount then
            freqCounts <- Map.add word count freqCounts
        if Seq.length freqCounts > N then
            freqCounts <- Map.remove minStr freqCounts
    freqCounts 
    |> Seq.sort_by (fun (KeyValue(k,v)) -> -v) 
    |> Seq.iter (printfn "%A")
Main()

Output:

[the, 75]
[to, 41]
[in, 34]
[a, 32]
[of, 29]

You could use combination of external merge sort and priority queue. Merge sort will make sure that your memory limits are honored and priority queue will maintain your top K searches. Obviously, the priority queue has to be small enough to fit into memory.

First, divide input strings into chunks, sort each chunk and store into secondary storage (external sorting) - O(n log n)
Read each chunk and within the chunk, calculate frequency of words, so at end of this step, each chunk is reduced to (unique word - frequency count) within the chunk. O(n)
Start reading elements across the chunks and aggregate for each word. Since chunks are sorted, you can do it in O(n)
Now, maintain a min priority heap (top of the heap is minimum element in the heap) of K elements. Populate priority heap by first K elements then for next (unique word -final count), if its count is greater than top element in the heap, the pop top and push current word. O(n log k)

So your final time complexity is O(n(log k + log n)) -

Well, if you want absolutely terrible performance...

Take the first word in the book, and count how many times it occurs. Take the second word in the book, count how many times it occurs. If it's more than the last word, discard the last word. And so forth... you'll end up counting the same words multiple times unless you keep a list of them somewhere, but if you really want to minimize memory, this should only require a few ints. Should run in O(n^2) time, where n is the number of words in the book.

How about create a binary tree of word keys ( as you keep reading the words from the file ). This helps to search the already repeated words in O(Log(n)). So finally you get O(nLog(n)) for top word search.

Basic algo would be

for each word in a file:

Create unique key for a given word ( weighted ascii char e.g. "bat" could be 1*'b' + 2*'a' + 3*'c';
Add this word to the tree. If the word already exists increment the new count.
Feed the word and the current count to maintainTop5(word, count). maintainTop5() maintains a dynamic list of top5 counts and associated words.

End of the file you have top 5 words.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow