Apache Lucene - Custom weighting for semantic analysis

https://stackoverflow.com/questions/21881997

13-10-2022
|

Question

I'm working on a JEE application and I'm a newbie on Lucene ( through Hibernate-search ) which I use for indexing CV documents. Actually I'm developping a search engine to sort candidates by keyWord ( ex : HTML5 ). I would like to include a kind of semantic regard in my analysis, then I tought detect the various sections of a CV and weigh the same term differently depending on the section where it is located.

Then, I ask how could I modify the Lucene core to implement my "custom weigh rules" imagining that I have a method which give me the weight for a term occurency. I would have something like :

term.setWeight(term.getSection().getWeightSection());

With term the term in the Lucene's meaning

PS: 1) I read the Lucene core documentation but I can precisely find what I'm looking for. I only found, until now, the Class Weight. But I've understood this class was use to weigh the queries and not the terms.

2) I'm not a native English speaker, then if something is not clear thanks to ask some details or precision.

Thanks a lot.

Nico.

Solution

Rather than just having one big body field, and trying to apply weights to segments within the field, you should define multiple fields for the different sections of the document. You can apply to boost to a field at index time simply enough, with Field.setBoost.

To conveniently search over all of those fields, you can use MultiFieldQueryParser.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow