Use of indexes for multi-word queries in full-text search (e.g. web search)

https://stackoverflow.com/questions/6032469

14-11-2019
|

Domanda

I understand that a fundamental aspect of full-text search is the use of inverted indexes. So, with an inverted index a one-word query becomes trivial to answer. Assuming the index is structured like this:

some-word -> [doc385, doc211, doc39977, ...] (sorted by rank, descending)

To answer the query for that word the solution is just to find the correct entry in the index (which takes O(log n) time) and present some given number of documents (e.g. the first 10) from the list specified in the index.

But what about queries which return documents that match, say, two words? The most straightforward implementation would be the following:

set A to be the set of documents which have word 1 (by searching the index).
set B to be the set of documents which have word 2 (ditto).
compute the intersection of A and B.

Now, step three probably takes O(n log n) time to perform. For very large A and Bs that could make the query slow to answer. But search engines like Google always return their answer in a few milliseconds. So that can't be the full answer.

One obvious optimization is that since a search engine like Google doesn't return all the matching documents anyway, we don't have to compute the whole intersection. We can start with the smallest set (e.g. B) and find enough entries which also belong to the other set (e.g. A).

But can't we still have the following worst case? If we have set A be the set of documents matching a common word, and set B be the set of documents matching another common word, there might still be cases where A ∩ B is very small (i.e. the combination is rare). That means that the search engine has to linearly go through a all elements x member of B, checking if they are also elements of A, to find the few that match both conditions.

Linear isn't fast. And you can have way more than two words to search for, so just employing parallelism surely isn't the whole solution. So, how are these cases optimized? Do large-scale full-text search engines use some kind of compound indexes? Bloom filters? Any ideas?

Soluzione

As you said some-word -> [doc385, doc211, doc39977, ...] (sorted by rank, descending), I think the search engine may not do this, the doc list should be sorted by doc ID, each doc has a rank according to the word.
When a query comes, it contains several keywords. For each word, you can find a doc list. For all keywords, you can do merge operations, and compute the relevance of doc to query. Finally return the top ranked relevance doc to user.
And the query process can be distributed to gain better performance.

Altri suggerimenti

Most systems somehow implement TF-IDF in one way or another. TF-IDF is a product of functions term frequency and inverse document frequency.

The IDF function relates the document frequency to the total number of documents in a collection. The common intuition for this function says that it should give a higher value for terms that appear in few documents and lower value for terms that appear in all documents making them irrelevant.

You mention Google, but Google optimises search with PageRank (links in/out) as well as term frequency and proximity. Google distributes the data and uses Map/Reduce to parallelise operations - to compute PageRank+TF-IDF.

There's a great explanation of the theory behind this in Information Retrieval: Implementing Search Engines chapter 2. Another idea to investigate further is also to look how Solr implements this.

Even without ranking, I wonder how the intersection of two sets is computed so fast by google.

Obviously the worst-case scenario for computing the intersection for some words A, B, C is when their indexes are very big and the intersection very small. A typical case would be a search for some very common ("popular" in DB terms) words in different languages.

Let's try "concrete" and 位置 ("site", "location") in chinese and 極端な ("extreme") in japanese.

Google search for 位置 returns "About 1,500,000,000 results (0.28 seconds) " Google search for "concrete" returns "About 2,020,000,000 results (0.46 seconds) " Google search for "極端な" About 7,590,000 results (0.25 seconds)

It is extremly improbable that all three terms would ever appear in the same document, but let's google them: Google search for "concrete 位置極端な" returns "About 174,000 results (0.13 seconds)"

Adding a russian word "игра" (game) Search игра: About 212,000,000 results (0.37 seconds)

Search for all of them: " игра concrete 位置極端な " returns About 12,600 results (0.33 seconds)

Of course the returned search results are nonsense and they do not contain all the search terms.

But looking at the query time for the composed ones, I wonder if there is some intersection computed on the word indexes at all. Even if everything is in RAM and heavily sharded, computing the intersection of two sets with 1,500,000,000 and 2,020,000,000 entries is O(n) and can hardly be done in <0.5 sec, since the data is on different machines and they have to communicate.

There must be some join computation, but at least for popular words, this is surely not done on the whole word index. Adding the fact that the results are fuzzy, it seems evident that Google uses some optimization of kind "give back some high-ranked results, and stop after 0,5 sec".

How this is implemented, I don't know. Any ideas?

Google does not need to actually find all results, only the top ones. The index can be sorted by grade first and only then by id. Since the same ID always has the same grade this does not hurt sets intersection time.

So google starts intersection until it finds 10 results , and then does a statistical estimation to tell you how many more results it found.

A worst case is almost impossible. If all words are "common" then intersection will give the first 10 results very fast. If there is a rare word, then intersection is fast because complexity is O(N long M) where N is the smallest group.

You need to remember that google keeps it's indexes in memory and uses parallel computing.For example U can split the problem into two searches each searching only half of the web, and then marge result and take the best. Google has millions of computes

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a StackOverflow