preventing certain docs from being indexed in clucene

https://stackoverflow.com/questions/18234775

clucene
standardanalyzer

24-06-2022
|

Question

I am building a search index with clucene and I want to make sure docs containing any offensive terms never get added to the index. Using a StandardAnalyzer with stop list is not good enough since the offensive doc still gets added and would be returned for non-offensive searches.

Instead I am hoping to build up a document, then check if it contains any offensive words, then adding it only if it doesn't.

Cheers!

Solution

You can't really access that type of data in a Document

What you can do is run the analysis chain manually on the text and check each token individually. You can do this in a stupid loop, or by adding another analyzer to the chain that just raises a flag you check later.

This introduces some more work, but the best way to achieve that IMO.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow