Regular Expression to Extract Surnames With Prefixes

Question 1

Keeping in mind all those falsehoods programmers believe about names, you could still try

\b\p{Lu}\p{Ll}*|\b\p{Ll}+\s+\p{Lu}\p{Ll}*

will match an uppercase word (name) or a lowercase prefix, followed by an uppercase word.

See it live on regex101.com.

Explanation:

\b      # Start of word
\p{Lu}  # One uppercase letter
\p{Ll}* # Any number of lowercase letters
|       # or
\b      # Start of word
\p{Ll}+ # One or more lowercase letters
\s+     # Whitespace
\p{Lu}  # One uppercase letter
\p{Ll}* # Any number of lowercase letters

Question 2

Since the question is about using splitting. here is one regex that should work:

$re = '/\b(?<!-)(?>\p{Ll}+|\p{L}{1,3}) +(*SKIP)(*FAIL)| +/u';
$str = 'Manuel D\'Souza do Pinto bin Laden Al-saud el Mecca de la Vere Na Sokakah van Der Reidejin del Monte du Pont ter Johannes';
print_r( preg_split($re, $str) );

OUTPUT:

Array
(
    [0] => Manuel
    [1] => D'Souza
    [2] => do Pinto
    [3] => bin Laden
    [4] => Al-saud
    [5] => el Mecca
    [6] => de la Vere
    [7] => Na Sokakah
    [8] => van Der Reidejin
    [9] => del Monte
    [10] => du Pont
    [11] => ter Johannes
)

(*FAIL) behaves like a failing negative assertion and is a synonym for (?!)
(*SKIP) defines a point beyond which the regex engine is not allowed to backtrack when the subpattern fails later
(*SKIP)(*FAIL) together provide an alternative of restriction that you cannot have a variable length lookbehinf in above regex.

Question 3

You can use this regex:

[a-z]+\s[A-Z][a-z]+|[A-Z][a-z]+

The above will match those. So you don't need to split. Just match them.

What it is doing is, it looks for small cased word plus space and then the name or the name alone.

Also note that it will fail on different accents other than English.

Demo