Update
So \G
initially is set to a matched condition at position 0.
Which means in multi-line mode, BOS has to be a special case.
Even though BOString is a BOLine, if the assertion (?= ^ .* [a-z] )
fails,
\G
is initially set as matched (default?) and UC words are found without being validated.
(?|(?=\A.*[a-z]).*?\b([A-Z]+)\b|(?!\A)(?:(?=^.*[a-z])|\G.*?\b([A-Z]+)\b))
Update 2 Posted for posterity.
After some discussion with @Robin, the above regex can be refactored to this:
# (?:(?=^.*[a-z])|(?!\A)\G).*?\b([A-Z]+)\b
(?:
(?= ^ .* [a-z] ) # BOL, check if line has lower case letter
| # or
(?! \A ) # Not at BOS (beginning of string, where \G is in a matched state)
\G # Start the match at the end of last match (if previous matched state)
)
.*? \b
( [A-Z]+ ) # (1), Found UC word
\b
Perl test case:
$/ = undef;
$str = <DATA>;
@ary = $str =~ /(?:(?=^.*[a-z])|(?!\A)\G).*?\b([A-Z]+)\b/mg;
print "@ary", "\n-------------\n";
while ($str =~ /(?:(?=^.*[a-z])|(?!\A)\G).*?\b([A-Z]+)\b/mg)
{
print "$1 ";
}
__DATA__
DA EFR
ab ABC
ab ABC 12 CD
ABC DE t
ABC 23 DE EFG a
Output >>
ABC ABC CD ABC DE ABC DE EFG
-------------
ABC ABC CD ABC DE ABC DE EFG