Well... finding out the top K terms is not really a big problem. One of the key ideas in this fields have been the idea of "stream processing", i.e., to perform the operation in a single pass of the data and sacrificing some accuracy to get a probabilistic answer. Thus, assume you get a stream of data like the following:
A B K A C A B B C D F G A B F H I B A C F I U X A C
What you want is the top K items. Naively, one would maintain a counter for each item, and at the end sort by the count of each item. This takes O(U)
space and O(max(U*log(U), N))
time, where U
is the number of unique items and N
is the number of items in the list.
In case U
is small, this is not really a big problem. But once you are in the domain of search logs with billions or trillions of unique searches, the space consumption starts to become a problem.
So, people came up with the idea of "count-sketches" (you can read up more here: count min sketch page on wikipedia). Here you maintain a hash table A of length n
and create two hashes for each item:
h1(x) = 0 ... n-1
with uniform probability
h2(x) = 0/1
each with probability 0.5
You then do A[h1[x]] += h2[x]
. The key observation is that since each value randomly hashes to +/-1, E[ A[h1[x]] * h2[x] ] = count(x)
, where E
is the expected value of the expression, and count is the number of times x appeared in the stream.
Of course, the problem with this approach is that each estimate still has a large variance, but that can be dealt with by maintaining a large set of hash counters and taking the average or the minimum count from each set.
With this sketch data structure, you are able to get an approximate frequency of each item. Now, you simply maintain a list of 10 items with the largest frequency estimates till now, and at the end you will have your list.