Question

I'm seeking to tokenise Peruvian names in order to separate the name fields from them. The problem I'm worried about is how exactly OpenNLP treats the complicated names that can occur in Peru, e.g Rafael de la Fuente Benavides. Would OpenNLP take that whole string as a name or will the de/la/del/los connectors mess it up? Also, how would OpenNLP determine when a name is "done"? Since some peruvian names indeed go quite long (e.g Jose Manuel de los Reyes Gonzalez de Prada y Ulloa) I would wonder if OpenNLP would split that into 2 or 3 "names".

The goal of this is to be able to use the tokenization to seperate it into the correct fields for a person database e.g

Rafael de la Fuente Benavides   ----> First: Rafael Paternal Last:Benavides
Jose Carlos Mariategui La Chira ----> First: Jose Carlos Paternal Last: Mariategui  Maternal Last: Chira
Was it helpful?

Solution

In order to properly recognize names, OpenNLP must be trained: you should provide a file formatted like this (also other formats are supported):

Sé <START:person> Rafael de la Fuente Benavides <END >, que trabajan en España

The training file must contain one sentence per line and each sentence may contain one or more names: to perform well, you'd provide the model with at least 15000 (according to the documentation).

Pay attention to the whitespace that precedes and follow each tag: this tag <START:person>Rafael<END> would be rejected.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top