Question

in the example below, I want to make 2 groups in a regex:

Name FirtSurname SecondSurname ..

The first group would be Name

The second FirtSurname SecondSurname ...

^(\w+)(.*)$   - would capture all
\w+           - would make n groups (number of words). 

I want only 2 groups. First name and anything that follows on another.

Any help?

Was it helpful?

Solution

First, as someone with punctuation in my given name :-) PLEASE don't use \w to try to match names :-) … both - and ' are not uncommon.

Using Perl, for example:

  if ("Bruce-Robert Fenn Pocock" =~ /^(\w+)(.*)$/) { print "First: $1    Rest: $2" }

  → First: Bruce    Rest: -Robert Fenn Pocock

Perhaps just group all non-space characters, then skip the first occurrence of whitespace:

  if ("Bruce-Robert Fenn Pocock" =~ /^(\S+)\s*(.*)$/) { print "First: $1    Rest: $2" }

  → First: Bruce-Robert    Rest: Fenn Pocock

Of course, if you run across people with middle names in your dataset, there's no way to tell them apart from matronym-patronym pairs or multi-part last names.

I hope/assume you don't have honorifics in your input, either.

First: Don         Rest: Juan de la Mancha
     *** wrong: Don is honorific
First: Diego       Rest: de la Vega
First: John        Rest: Jacob Smith
     *** wrong: Jacob is probably a middle name
First: De'shawna   Rest: Cummings
First: Wehrner     Rest: von Braun
First: Oscar       Rest: Vazquez-Oliverez

Ultimately, the only way to accurately break down a name into an honorific, given name, middle name(s), surnames (matronym, patronym), and suffix(es), is to ask.

(EG. my own name, in Anglo circles, the "Fenn" is considered a "middle name," in Latino circles, it's interpreted as a matronym.)

Honorifics and suffices can often be guessed-at from a list, but e.g. military titles and doctoral suffices are a long list ("Dr John Doe, Pharm.D", "Maj. Gen. Thomas Ts'o") and not unambiguous (e.g. "Don" is both a short form of "Donald" and an honorific).

PS. Lovely article here:

http://www.kalzumeus.com/2010/06/17/falsehoods-programmers-believe-about-names/

OTHER TIPS

Assuming you only want the first name in group 1 and the rest of the name in group 2:

^(\b[\w]+\b)([\w\W]+)

Assuming there is only a single space between words, this works:

(\w+) ([\w ]+)

Regular expression visualization

Debuggex Demo

If multiple spaces are a possibility:

(\w+) +([\w ]+)

To eliminate the spaces at the ends:

\b(\w+)\b \b([\w ]+)\b

To allow dashes and apostrophes, as mentioned by @BRPocock:

\b([\w-']+)\b \b([\w -']+)\b

While this forbids punctuation at the ends, it allows multiple dashes and apostrophes, including next to each other, such as: Mc'er'''doo--dl-e

Making it more robust than this can become a project within itself.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top