How to validate a chapter heading for text using fuzzy logic in Java

https://stackoverflow.com/questions/20212333

05-08-2022
|

Question

I need a solution for identifying incorrect chapter headings in a book.

We are developing an ingestion system for books that does all sorts of validation, like spell-checking and offensive-language-filtering. Now we'd like to flag chapter headings that seem inaccurate given the chapter body. For example, if the heading was "The Function of the Spleen", I would not expect the chapter to be about the liver.

I am familiar with fuzzy string matching algorithms but this seems like more like an NLP or classification problem. If I could match (or closely match) the phrase "function of the spleen", then that's great -- high confidence. Otherwise, a high occurrence of both "function" and "spleen" in the text also yields confidence. And of course, the closer they are together the better.

This needs to be done in-memory, on the fly, and in Java.

My current naive approach is to simply tokenize all the words, remove noise words (like prepositions), stem what's left, and then count the number of matches. At a minimum I'd expect each word in the heading to appear at least once in the text.

Is there a different approach, ideally one that would take into account things like proximity and ordering?

Solution

I think that it is a classification problem, as such take a look at WEKA

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow