Question

I need a solution for identifying incorrect chapter headings in a book.

We are developing an ingestion system for books that does all sorts of validation, like spell-checking and offensive-language-filtering. Now we'd like to flag chapter headings that seem inaccurate given the chapter body. For example, if the heading was "The Function of the Spleen", I would not expect the chapter to be about the liver.

I am familiar with fuzzy string matching algorithms but this seems like more like an NLP or classification problem. If I could match (or closely match) the phrase "function of the spleen", then that's great -- high confidence. Otherwise, a high occurrence of both "function" and "spleen" in the text also yields confidence. And of course, the closer they are together the better.

This needs to be done in-memory, on the fly, and in Java.

My current naive approach is to simply tokenize all the words, remove noise words (like prepositions), stem what's left, and then count the number of matches. At a minimum I'd expect each word in the heading to appear at least once in the text.

Is there a different approach, ideally one that would take into account things like proximity and ordering?

Was it helpful?

Solution

I think that it is a classification problem, as such take a look at WEKA

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top