Cosine similarity between a query and document in Lucene

https://stackoverflow.com/questions/7226204

15-01-2021
|

Question

I wanted to get cosine similarity between a long query and a document in a collection. I'm using Lucence to index the collection and submitting the queries to retrieve documents.

However, I'm getting the following error for some of the queries.

"Caused by: org.apache.lucene.search.BooleanQuery$TooManyClauses: maxClauseCount is set to 1024"

I replicated some of the terms in the query to boost their weight. But it seems lucene is just doing simple boolean retrieval instead of calculating the cosine similarity using tf-idf for both document and query.

Can anybody confirm this ?

Solution

This page explains the scoring used in lucene:

http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/Similarity.html

It states:

The score of query q for document d correlates to the cosine-distance or dot-product between document and query vectors in a Vector Space Model (VSM) of Information Retrieval. A document whose vector is closer to the query vector in that model is scored higher.

So no, lucene is not just using boolean retrieval.

Your exception is related to your query, and the way lucene transforms it. It would be helpful if you could give an example of a query that's failing.

Furthermore, you write:

I replicated some of the terms in the query to boost their weight.

You don't have to do that, instead you can simply assign a weight to the terms in your query: http://lucene.apache.org/java/2_0_0/queryparsersyntax.html

E.g. to search for apple and orange, and boost orange, you can write:

apple orange^4

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow