Domanda

Is there a way I can extract parts of a name from a string, using regular expression or other logic.

I would like to split names by spaces, but in the case that a name is prefixed, I would like to split on the prefix, e.g.

Osama bin Laden bin Mohammed => Osama, bin Laden, bin Mohamed
Jorge do Pinto da Silva => Jorge, do Pinto, da Silva
John Andrew Smith => John, Andrew, Smith
José Mário dos Santos Mourinho Félix => José, Mário, dos Santos, Mourinho, Félix

Working code based on Tim's suggestion:

$str = 'Manuel D\'Souza do Pinto bin Laden Al-saud el Mecca de la Vere Na Sokakah van Der Reidejin del Monte du Pont ter Johannes';
preg_match_all( '~\b(von der|van de|van den|del la|de la|van der|vande|vanden|vander|st|der|des|dela|della|bin|dos|ur|ibn|bint|da|do|le|la|del|du|de|di|el|al|van|von|ter|na|del|san|los)\s+[^\s]+\b|\b[^\s]+~i', $str, $mat );
print_r( $mat );

Result:

Array(
[0] => Array
    (
        [0] => Manuel
        [1] => D'Souza
        [2] => do Pinto
        [3] => bin Laden
        [4] => Al-saud
        [5] => el Mecca
        [6] => de la Vere
        [7] => Na Sokakah
        [8] => van Der Reidejin
        [9] => del Monte
        [10] => du Pont
        [11] => ter Johannes
    )

[1] => Array
    (
        [0] => 
        [1] => 
        [2] => do
        [3] => bin
        [4] => 
        [5] => el
        [6] => de la
        [7] => Na
        [8] => van Der
        [9] => del
        [10] => du
        [11] => ter
    )

)

È stato utile?

Soluzione

Keeping in mind all those falsehoods programmers believe about names, you could still try

\b\p{Lu}\p{Ll}*|\b\p{Ll}+\s+\p{Lu}\p{Ll}*

will match an uppercase word (name) or a lowercase prefix, followed by an uppercase word.

See it live on regex101.com.

Explanation:

\b      # Start of word
\p{Lu}  # One uppercase letter
\p{Ll}* # Any number of lowercase letters
|       # or
\b      # Start of word
\p{Ll}+ # One or more lowercase letters
\s+     # Whitespace
\p{Lu}  # One uppercase letter
\p{Ll}* # Any number of lowercase letters

Altri suggerimenti

Since the question is about using splitting. here is one regex that should work:

$re = '/\b(?<!-)(?>\p{Ll}+|\p{L}{1,3}) +(*SKIP)(*FAIL)| +/u';
$str = 'Manuel D\'Souza do Pinto bin Laden Al-saud el Mecca de la Vere Na Sokakah van Der Reidejin del Monte du Pont ter Johannes';
print_r( preg_split($re, $str) );

OUTPUT:

Array
(
    [0] => Manuel
    [1] => D'Souza
    [2] => do Pinto
    [3] => bin Laden
    [4] => Al-saud
    [5] => el Mecca
    [6] => de la Vere
    [7] => Na Sokakah
    [8] => van Der Reidejin
    [9] => del Monte
    [10] => du Pont
    [11] => ter Johannes
)
  • (*FAIL) behaves like a failing negative assertion and is a synonym for (?!)
  • (*SKIP) defines a point beyond which the regex engine is not allowed to backtrack when the subpattern fails later
  • (*SKIP)(*FAIL) together provide an alternative of restriction that you cannot have a variable length lookbehinf in above regex.

You can use this regex:

[a-z]+\s[A-Z][a-z]+|[A-Z][a-z]+

The above will match those. So you don't need to split. Just match them.

What it is doing is, it looks for small cased word plus space and then the name or the name alone.

Also note that it will fail on different accents other than English.

Demo

Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top