Algorithms for splitting personal names in parts

https://stackoverflow.com/questions/1633883

06-07-2019
|

Question

I'm looking for references on separating a name: "John A. Doe" in parts, first=John, middle=A., last=Doe. In Mexico we have paternal, maternal, first and second given names, and can be written in different permutations, so the problem is quite complex.

As it depends on data, we are working with matching software that calculates a score for every word so we can take decisions (it is based on a big database). The input data is not clean, it is imported from some government web pages and is human filtered so it could have junk that has to be recognized as well. Any suggestions?

[Edit] Examples:

name:
   Javier Abdul Córdoba Gándara
common permutations (or as it may appear in gvt data referring to same person):
   Córdoba Gándara Javier Abdul
   Javier A. Córdoba Gándara
   Javier Abdul Córdoba G.

paternal=Córdoba
maternal=Gándara
first given:Javier
second given:Abdul

name: María de la Luz Sánchez Martínez
paternal:Sánchez
maternal: Martínez
first given: María de la Luz

name: Paloma Viridiana Alin Arias Medina
paternal: Arias
maternal: Medina
first given: Paloma
second given: Viridiana Alin

As I said what the meaning of each word depends on the score. One has no way of knowing that

Viridiana

and

Alin

are given names if not from the score.

We have a very strong database (80 million records or so) so we can get some use of the scoring system. I am designing some algorithm that uses that but looking for other references.

Solution

Unfortunately - and having done quite a bit of this work myself - your ideal algorithm will be very data specific, and you will need to work this out for your particular situation.

Of the total time and effort to develop this algorithm, I'd say the time will be split roughly as follows:

10% for general string manipulation
30% for the specific nature of the data (Mexican name formats, data input quirks)
60% to cater for data quality / lack of quality

And I believe that's quite generous towards the general string manipulation. Of course it depends whether you need quality results for all records, or only the 'clean' records etc, and if you are able to ignore the 'difficult' records it makes it a lot simpler.

Some general tips

If they are not required, remove non alphanumeric / whitespace characters
Split on spaces
Use hyphens / punctuation to identify surnames or family names
Initials (which are generally single letters) are not surnames; i.e. they must be first / middle
determine the level of confidence that you have programmatically identified the each name (and test this thoroughly). You may find there are subsets of data that contain similar patterns that need to be catered for individually (they may come from different sources etc)

OTHER TIPS

You may need to add some natural language or machine learning to check. The problem of identifying author names (e.g. in scientific papers) is difficult as they can be reported with differing orders, degrees of abbreviation, elisions etc. If your database is dirty you will end with ambiguity whatever you do.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow