Question

I've been searching for a while now, but found nothing that suits my need so far. (This was helpful, but not convincing)

From two different sources, I get two different strings. I want to check, if the shorter one is contained within the larger one. However, as those strings both root in an OCR-document, there might be obvious differences.

Example:

String textToSearch = "Recognized Headline";
String documentText = "This is the document text, spanning multiple pages" .
                      "..." .
                      "..." .
                      "This the row with my Recognizect Head1ine embedded" .
                      "..." .               ^^^^^^^^^^^^^^^^^^^^
                      "..." .
                      "End of the document";

How can I find my string reliably in the page without using a standalone Lucene/Solr installation? (Or maybe I've just not found the tutorial/manual). There must be some library out there which can do this, right?

Was it helpful?

Solution

First of all you need to find your input source. A webpage has a DOM tree that can be parsed in two ways: SAX (event-driven model without context) or DOM (tree-based model with context). SAX is ideal here because you don't really need to have contextual information to retrieve a stream of tokenized text nodes from the DOM. Convert all the textual nodes into a stream of tokens.

One you have a stream of tokens you can do your processing on them. For large amounts of input algorithms like the Levenshtein string matching become inadequate. Instead, look into Markov Chains. They can help match a set of inputs against a set of outputs fairly reliably and efficiently.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top