Can OpenNLP use HTML tags as part of the training?

https://stackoverflow.com/questions/10093677

30-05-2021
|

Question

I'm creating a training set for the TokenNameFinder using html documents converted into plain text, but my precision is low and I want to use the HTML tags as part of the training. Like words in bold, and sentences in differents margin sizes. Will OpenNLP accept and use those tags to create rules? Is there another way to make use of those tags to improve precision?

Solution

It is not clear what you mean with using HTML tags to train OpenNLP. The train input is an annotated tokenized sentence:

<START:person> Pierre Vinken <END> , 61 years old , will join the board as a nonexecutive director Nov. 29 .
Mr . <START:person> Vinken <END> is chairman of <START:company> Elsevier N.V. <END> , the Dutch publishing group .

To train an OpenNLP model using the standard tooling you need annotations follows this convention. Note that the annotations does not follow the XML standard.

You can embed annotations directly to the HTML documents you will use for training. It might even help the classifier with the extra context, but I've never read any experimental results about it.

You should keep in mind that the training data should be tokenized. It means that you should include white spaces between words and punctuation, as well as between text elements and html:

<p> <i> Mr . <START:person> Vinken <END> </i> is chairman of <b> <START:company> Elsevier N.V. <END> </b>, the Dutch publishing group .

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow