what's the most frequent word across many files

Question 1

Assume you have K machines. From each machine request the most frequent word, send these K words to each machine and total their frequency over all machines. Let the frequency of the most common word be N.

In next step, from each machine request list of all words which have frequency of at least N/K. Aggregate this list and send to each machine. Collect back frequencies across machines, sum them and find the most frequent word overall. This word is guaranteed to be the most frequent word.

Question 2

One idea is to split the work done between the machines.

Have a hash function that hashes each word to a number between 1 and K - this corresponds to the machine that the frequency of that word will be sent to.

So, let each machine send their frequencies to the applicable machines.

After this, let each machine sum up the frequencies received.

Then let each machine send their most frequent word to some machine and let that machine simply find the maximum.

Since the calculated hash will always be same for some given word, all frequencies for that word will be sent to the same machine, so the sum that that machine calculates will be the total frequency for that word.

Running time:

With a perfect hash function, this would be done in O(N + K) time, where N is the number of word frequencies per machine and K is the total number of machines, since a lot of the work will be parallelized - more precisely, each machine will send O(N) frequencies, and each machine will receive O(N) frequencies and aggregate that. Then each machine will send its most frequent word to one machine, which that machine will find the maximum of, which would take O(K). This could perhaps be optimized to O(N + log K) if we were to cleverly pass the highest frequencies between the machines (think merge-sort).

In the worst case however, this would take O(NK), since one machine will receive O(NK) frequencies.