Question

I am trying to understand what exactly is going on in the background.

Given the simplified Model of an Inverted Index (forget about positions and scores): For each word there is a sorted list of document IDs. Mutli word queries intersect those sorted lists to yield another such list. (In the end there is ranking)

e.g.

word1: 1 3 7 9 10 11 ...
word2: 2 3 4 9 10 12 ...

Is the following understanding of fields correct?

Different fields mean different index spaces or at least different lists. e.g. having fields abstract and body could end up in a scenario like this:

abstract:word1 7 10 ...
body:word1     1 3 9 10 11 ...
abstract:word2 3 4 ...
body:word2     1 3 9 10 12 ...

Is this understanding correct? If not, what are those fields in terms of the underlying inverted index? I could not find any documentation that explicitly states how it is done internally.

Apart from that I wonder if there is some support for a feature like searching in all/any field. If implemented like i assume it to be, this should be troublesome or require redundancy by keeping lists like above as well. Implementing fields by means of subranges of full wordlists could certainly perform better.

Would be great to know what Lucene actually does.

Was it helpful?

Solution

Since Lucene 4.0, there is a different terms dictionary per field, so your postings lists for abstract and body will be stored separately.

Apart from that I wonder if there is some support for a feature like searching in all/any field. If implemented like i assume it to be, this should be troublesome or require redundancy by keeping lists like above as well. Implementing fields by means of subranges of full wordlists could certainly perform better.

No sure to understand what you mean by "means of subranges or full wordlists", but if you run a BooleanQuery on several fields, Lucene will merge the postings lists on the fly.

Would be great to know what Lucene actually does.

The source code is freely accessible. :-)

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top