Question

I am doing manual tagging to train my own NER Do I have to include untagged text in sentences I am preparing for named entity recognition?

<START:person> Olivier Grisel <END> is working on the <START:software> Stanbol <END> project .

Or can I omit untagged parts like this?

<START:person> Olivier Grisel <END>
<START:software> Stanbol <END>

PS: Thanks for all the great answers. I tried omitting the untagged parts and in that case OpenNLP marked every line as an entity, so it didn't work. As the answers explain, untagged parts are necessary.

Was it helpful?

Solution

If you are doing manual tagging to train your own NER model (it's not 100% clear from your question), you should include the same kind of data you expect to tag later, most likely full sentences. The default model features (see OpenNLP docs) include a window of tokens to the left and right of the token that's currently being considered, so you want your labeled entities to appear in their normal context. You also want your model to learn which words shouldn't be tagged as entities, so they also need to appear in context in your training data.

See the related question: Open NLP Name Finder Training

OTHER TIPS

It depends on how do you plan to use the corpus. If you omit the untagged part you'll miss the positions of entities in the document.

In classic named entity recognition you'll also need the exact positions of entities. In your system you'll probably have some rules, dictionaries, statistical taggers, and you'll go over each token in the text, check if it's part of a named entity (NE recognition) and what type it is (NE classification).

If you don't know where in your manually annotated corpus the entity appears, it's not clear how you can use it. If you need the corpus to just compare the lists of entities (created by your system and existing in the corpus) then you can probably go without the positions, but you won't be able to check where the entity appears.

For instance if you have a document:

"I know Barack Obama is the president but who is Michelle Obama? Barack Obama is definitely the president."

And your system gets "Barack Obama" once and your corpus has "Barack Obama" x2, you don't know which one you got. Also, you can get just Obama as a named entity.

In short, I wouldn't do it :)

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top