Pergunta

I am trying to figure out system design behind Google Trends (or any other such large scale trend feature like Twitter).

Challenges:

  • Need to process large amount of data to calculate trend.

  • Filtering support - by time, region, category etc.

  • Need a way to store for archiving/offline processing. Filtering support might require multi dimension storage.

This is what my assumption is (I have zero practial experience of MapReduce/NoSQL technologies)

Each search item from user will maintain set of attributes that will be stored and eventually processed.

As well as maintaining list of searches by time stamp, region of search, category etc.

Example:

Searching for Kurt Cobain term:

Kurt-> (Time stamp, Region of search origin, category ,etc.)

Cobain-> (Time stamp, Region of search origin, category ,etc.)

Question:

  • How do they efficiently calculate frequency of search term ?

  • In other words, given a large data set, how do they find top 10 frequent items in distributed scale-able manner ?

Foi útil?

Solução

Well... finding out the top K terms is not really a big problem. One of the key ideas in this fields have been the idea of "stream processing", i.e., to perform the operation in a single pass of the data and sacrificing some accuracy to get a probabilistic answer. Thus, assume you get a stream of data like the following:

A B K A C A B B C D F G A B F H I B A C F I U X A C

What you want is the top K items. Naively, one would maintain a counter for each item, and at the end sort by the count of each item. This takes O(U) space and O(max(U*log(U), N)) time, where U is the number of unique items and N is the number of items in the list.

In case U is small, this is not really a big problem. But once you are in the domain of search logs with billions or trillions of unique searches, the space consumption starts to become a problem.

So, people came up with the idea of "count-sketches" (you can read up more here: count min sketch page on wikipedia). Here you maintain a hash table A of length n and create two hashes for each item:

h1(x) = 0 ... n-1 with uniform probability

h2(x) = 0/1 each with probability 0.5

You then do A[h1[x]] += h2[x]. The key observation is that since each value randomly hashes to +/-1, E[ A[h1[x]] * h2[x] ] = count(x), where E is the expected value of the expression, and count is the number of times x appeared in the stream.

Of course, the problem with this approach is that each estimate still has a large variance, but that can be dealt with by maintaining a large set of hash counters and taking the average or the minimum count from each set.

With this sketch data structure, you are able to get an approximate frequency of each item. Now, you simply maintain a list of 10 items with the largest frequency estimates till now, and at the end you will have your list.

Outras dicas

How exactly a particular private company does it is likely not publicly available, and how to evaluate the effectiveness of such a system is at the discretion of the designer (be it you or Google or whoever)

But many of the tools and research is out there to get you started. Check out some of the Big Data tools, including many of the top-level Apache projects, like Storm, which allows for the processing of streaming data in real-time

Also check out some of the Big Data and Web Science conferences like KDD or WSDM, as well as papers put out by Google Research

How to design such a system is challenging with no correct answer, but the tools and research are available to get you started

Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top