One idea is to split the work done between the machines.
Have a hash function that hashes each word to a number between 1
and K
- this corresponds to the machine that the frequency of that word will be sent to.
So, let each machine send their frequencies to the applicable machines.
After this, let each machine sum up the frequencies received.
Then let each machine send their most frequent word to some machine and let that machine simply find the maximum.
Since the calculated hash will always be same for some given word, all frequencies for that word will be sent to the same machine, so the sum that that machine calculates will be the total frequency for that word.
Running time:
With a perfect hash function, this would be done in O(N + K)
time, where N
is the number of word frequencies per machine and K
is the total number of machines, since a lot of the work will be parallelized - more precisely, each machine will send O(N)
frequencies, and each machine will receive O(N)
frequencies and aggregate that. Then each machine will send its most frequent word to one machine, which that machine will find the maximum of, which would take O(K)
. This could perhaps be optimized to O(N + log K)
if we were to cleverly pass the highest frequencies between the machines (think merge-sort).
In the worst case however, this would take O(NK)
, since one machine will receive O(NK)
frequencies.