How does a query into a huge database return with negligible latency?

https://datascience.stackexchange.com/questions/89

16-10-2019
|

Question

For example, when searching something in Google, results return nigh-instantly.

I understand that Google sorts and indexes pages with algorithms etc., but I imagine it infeasible for the results of every single possible query to be indexed (and results are personalized, which renders this even more infeasible)?

Moreover, wouldn't the hardware latency in Google's hardware be huge? Even if the data in Google were all stored in TB/s SSDs, I imagine the hardware latency to be huge, given the sheer amount of data to process.

Does MapReduce help solve this problem?

EDIT: Okay, so I understand that popular searches can be cached in memory. But what about unpopular searches? Even for the most obscure search I have conducted, I don't think the search has ever been reported to be larger than 5 seconds. How is this possible?

Solution

Well, I'm not sure if it is MapReduce that solves the problem, but it surely wouldn't be MapReduce alone to solve all these questions you raised. But here are important things to take into account, and that make it feasible to have such low latency on queries from all these TBs of data in different machines:

distributed computing: by being distributed does not mean that the indexes are simply distributed in different machines, they are actually replicated along different clusters, which allows for lots of users performing different queries with low retrieval time (yes, huge companies can afford for that much of machines);
caching: caches tremendously reduce execution time, be it for the crawling step, for the retrieval of pages, or for the ranking and exihibition of results;
lots of tweaking: all the above and very efficient algorithms/solutions can only be effective if the implementation is also efficient. There are tons of (hard coded) optimizations, such as locality of reference, compression, caching; all of them usually appliable to different parts of the processing.

Considering that, lets try to address your questions:

but I imagine it infeasible for the results of every single possible query to be indexed

Yes, it would be, and actually is infeasible to have results for every single possible query. There is an infinite number of terms in the world (even if you assume that only terms properly spelled will be entered), and there is an exponential number of queries from these n -> inf terms (2^n). So what is done? Caching. But if there are so many queries/results, which ones to cache? Caching policies. The most frequent/popular/relevant-for-the-user queries are the ones cached.

wouldn't the hardware latency in Google's hardware be huge? Even if the data in Google were all stored in TB/s SSDs

Nowdays, with such highly developed processors, people tend to think that every possible task that must finish within a second (or less), and that deals with so much data, must be processed by extremely powerful processors with multiple cores and lots of memory. However, the one thing ruling market is money, and the investors are not interested in wasting it. So what is done?

The preference is actually for having lots of machines, each using simple/accessible (in terms of cost) processors, which lowers the price of building up the multitude of clusters there are. And yes, it does work. The main bottleneck always boils down to disk, if you consider simple measurements of performance. But once there are so many machines, one can afford to load things up to main memory, instead of working on hard disks.

Memory cards are expensive for us, mere human beings, but they are very cheap for enterprises that buy lots of such cards at once. Since it's not costly, having much memory as needed to load indexes and keep caches at hand is not a problem. And since there are so many machines, there is no need for super fast processors, as you can direct queries to different places, and have clusters of machines responsible for attending specific geographical regions, which allows for more specialized data caching, and even better response times.

Does MapReduce help solve this problem?

Although I don't think that using or not MapReduce is restricted information inside Google, I'm not conversant about this point. However, Google's implementation of MapReduce (which is surely not Hadoop) must have lots of optimizations, many involving the aspects discussed above. So, the architecture of MapReduce probably helps guiding how the computations are physically distributed, but there are many other points to be considered to justify such speed in querying time.

Okay, so I understand that popular searches can be cached in memory. But what about unpopular searches?

The graph below presents a curve of how the kinds of queries occur. You can see that there are three main kinds of searches, each of them holding approximately 1/3 of the volume of queries (area below curve). The plot shows power law, and reinforces the fact that smaller queries are the most popular. The second third of queries are still feasible to process, since they hold few words. But the set of so-called obscure queries, which usually consist of non-experienced users' queries, are not a negligible part of the queries.

Heavy-tailed distribution

And there lies space for novel solutions. Since it's not just one or two queries (but one third of them), they must have relevant results. If you type in something much too obscure in a Google search, it shan't take longer to return a list of results, but will most probably show you something it inferred you'd like to say. Or it may simply state that there was no document with such terms -- or even cut down your search to 32 words (which just happened to me in a random test here).

There are dozens of appliable heuristics, which may be either to ignore some words, or to try to break the query into smaller ones, and gather the most popular results. And all these solutions can be tailored and tweaked to respect feasible waiting times of, say, less then a second? :D

OTHER TIPS

MapReduce has nothing to do with real-time anything. It is a batch-oriented processing framework suitable for some offline tasks, like ETL and index building. Google has moved off of MapReduce for most jobs now, and even the Hadoop ecosystem is doing the same.

The answer to low latency is generally to keep precomputed indices in memory. Anything that touches disk is hard to make fast and scale. This is how newer-generation Hadoop-based SQL engines like Impala get so much speed compared to MapReduce-based infrastructure like Hive, for example.

Search infrastructure can't cache the results of every single query. But it sure can cache intermediate results, or, more complete results for top queries. With a little caching you can serve results for a significant minority of all queries.

Search is also split across servers. So one machine can delegate to 100 to each get a part of the result and then combine them.

You can also get away with some degree of approximation. Google does not literally form a thousand pages of search results; it just has to get the first page about right.

Keep in mind that Google has millions of computers around the globe. Your queries are going to a data center geographically near to you and that is only serving your geography. This cuts out most of the latency, which is network and not processing time in the data center.

MapReduce is not used in searching. It was used a long time ago to build the index; but it is a batch processing framework, and most of the web does not change all the time, so the newer architectures are all incremental instead of batch oriented.

Search in Google will largely work the same it works in Lucene and Elastic Search, except for a lot of fine tuned extra weighting and optimizations. But at the very heart, they will use some form of an inverted index. In other words, they do not search several terabytes when you enter a search query (even when it is not cached). They likely don't look at the actual documents at all. But they use a lookup table that lists which documents match your query term (with stemming, misspellings, synonyms etc. all preprocessed). They probably retrieve the list of the top 10000 documents for each word (10k integers - just a few kb!) and compute the best matches from that. Only if there aren't good matches in these lists, they expand to the next such blocks etc.

Queries for common words can be easily cached; and via preprocessing you can build a list of the top 10k results and then rerank them according to the user profile. There is nothing to be gained by computing an "exact" answer, too. Looking at the top 10k results is likely enough; there is no correct answer; and if a better result somewhere at position 10001 is missed, nobody will know or notice (or care). It likely was already ranked down in preprocessing and would not have made it into the top 10 that is presented to the user at the end (or the top 3, the user actually looks at)

Rare terms on the other hand aren't much of a challenge either - one of the lists only contains a few matching documents, and you can immediately discard all others.

I recommend reading this article:

The Anatomy of a Large-Scale Hypertextual Web Search Engine
Sergey Brin and Lawrence Page
Computer Science Department, Stanford University, Stanford, CA 94305
http://infolab.stanford.edu/~backrub/google.html

And yes, that's the Google founders who wrote this. It's not the latest state, but it will already work at a pretty large scale.

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange