Document retrieval with unwanted words

https://stackoverflow.com/questions/11096589

15-06-2021
|

Question

I am building a data structure that helps indexing a collection of S documents of total length n, such that it supports the following query: Given two words P1 and P2, count all the documents that contain P1 but not P2. I want the answer to be complete (not to miss results).

I've built a generalized suffix tree and pick every sqrt(n)-th leaf and its ancestors (and delete every one-childed node). For each internal node v I pre-calculate the answer for the query against node u.

But with this, if the query contains words that appear in the tree in nodes v and u, I can have the answer in O(1), but what can I do when the words are not in one of the nodes that we picked?

I can do it easily by keeping a O(n²) data structure with pre-processing and having all the possible answers ready for O(1) time retrieval, but the goal is to build this data structure in O(n) space and make the queries as efficient as possible.

Solution

It sounds like an inverted index would still be useful to you. It's a map of words onto ordered lists of documents containing them. The documents need to have a common, total ordering, and it is in this order in which they appear in their per-word buckets.

Assuming your n is total length of the corpus in word occurrences (and not vocabulary size), it can be constructed in O(n log n) time and linear space.

Given P1 and P2, you make two separate queries to get the documents containing the two terms respectively. Since the two lists share a common ordering, you can do a linear merge-like algorithm and select just those documents containing P1 but not P2:

c1 <- cursor to first element of list of docs containing P1
c2 <- cursor to first element of list of docs containing P2
results <- [] # our return value

while c1 not exhausted
  if c2 exhausted or *c1 < *c2
    results.append(c1++)
  else if *c1 == *c2
    c1++
    c2++
  else # *c1 > *c2
    c2++

return results

Notice every pass of the loop iterates at least one cursor; it runs in linear time in the sum of the sizes of the two initial queries. Since only things from the c1 cursor enter results, we know all results contain P1.

Finally, note we always advance only the "lagging" cursor: this (and the total document ordering) guarantees that if a document appears in both initial queries, there will be a loop iteration where both cursors point to that document. When this iteration occurs, the middle clause necessarily kicks in and the document is skipped by advancing both cursors. Thus documents containing P2 necessarily do not get added to results.

This query is an example of a general class called Boolean queries; it's possible to extend this algorithm to cover most any boolean expression. Certain queries break the efficiency of the algorithm (by forcing it to walk over the entire vocabulary space) but basically so long as you don't negate each term (i.e. don't ask for not P1 and not P2) you're fine. See this for an in-depth treatment.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow