Question

I stumbled upon this seemingly trivial question, and I'm stuck on it. I have a string, in which I want to match in one regex all uppercase words only if somewhere in the string there's at least a lowercase letter.

Basically, I want each of these lines (we can consider I'll apply the regex to each line separately, no need for some multiline handling) to output:

ab ABC          //matches or captures ABC
ab ABC 12 CD    //matches or captures ABC, CD
ABC DE          //matches or captures nothing (no lowercase)
ABC 23 DE EFG a //matches or captures ABC, DE, EFG
AB aF DE        //matches or captures AB, DE

I am using PCRE as regex flavor (I know some other flavors allow for variable length look-behind).

Update after comments

Obviously, there are lots of easy solutions if I use multiple regex or the program language I'm using to call the regex (e.g. first validate the string by looking for a lowercase letter then match all uppercase words with two different regex).

My goal here is to find a way to do it with one regex.

I have no technical imperative for this constraint. Take it as an exercise of style if you have to, or curiosity, or me trying to up my regex skills: the task seemed (at first) so simple that I'd like to know if one regex alone can achieve it. If it can't, I'd like to understand why.

Or if it can but regex aren't designed for these kind of tasks, I wish I'd know why - or at least what are "these kind of unsuited tasks", so that I can choose the right solution when I meet them.


So, is it doable in one regex?

Was it helpful?

Solution

Update
So \G initially is set to a matched condition at position 0.
Which means in multi-line mode, BOS has to be a special case.
Even though BOString is a BOLine, if the assertion (?= ^ .* [a-z] ) fails,
\G is initially set as matched (default?) and UC words are found without being validated.

(?|(?=\A.*[a-z]).*?\b([A-Z]+)\b|(?!\A)(?:(?=^.*[a-z])|\G.*?\b([A-Z]+)\b))

Update 2 Posted for posterity.
After some discussion with @Robin, the above regex can be refactored to this:

 #  (?:(?=^.*[a-z])|(?!\A)\G).*?\b([A-Z]+)\b

 (?:
      (?= ^ .* [a-z] )        # BOL, check if line has lower case letter
   |                        # or
      (?! \A )                # Not at BOS (beginning of string, where \G is in a matched state)
      \G                      # Start the match at the end of last match (if previous matched state)
 )
 .*? \b 
 ( [A-Z]+ )              # (1), Found UC word
 \b     

Perl test case:

$/ = undef;

$str = <DATA>;

@ary = $str =~ /(?:(?=^.*[a-z])|(?!\A)\G).*?\b([A-Z]+)\b/mg;

print "@ary", "\n-------------\n";

while ($str =~ /(?:(?=^.*[a-z])|(?!\A)\G).*?\b([A-Z]+)\b/mg)
{
   print "$1 ";
}

__DATA__
DA EFR
ab ABC
ab ABC 12 CD
ABC DE  t
ABC 23 DE EFG a

Output >>

ABC ABC CD ABC DE ABC DE EFG
-------------
ABC ABC CD ABC DE ABC DE EFG

OTHER TIPS

Silly questions deserve silly answers.

/(?{ @matches = m{\b\p{Lu}+\b}g if m{\p{Ll}} })/;

Test:

use strict;
use warnings;
use feature qw( say );

while (<DATA>) {
   chomp;

   local our @matches;
   /(?{ @matches = m{\p{Lu}+}g if m{\p{Ll}} })/;

   say "$_: ", join ', ', @matches;
}

__DATA__
ab ABC
ab ABC 12 CD
ABC DE
ABC 23 DE EFG a

And now for the silly answer I promised:

my @matches = /
   \G
   (?: (?! ^ )
   |   (?= .* \p{Ll} )
   )
   .*? ( \b \p{Lu}+ \b )
/sg;

which condenses to

my @matches = /\G(?:(?!^)|(?=.*\p{Ll})).*?(\b\p{Lu}+\b)/sg;

At the start of the string, it looks ahead for a lower-case. Anywhere else, there's no need to check since we already checked.

I'm not sure if that can be done, but here is some background information, explaining the "Why?" part a bit.

Regexes were designed to match regular languages, and originally, that's all they could do. In fact, regular grammars are among the simplest that aren't completely trivial; most modern computer languages use non-regular grammars, for instance. (See, especially, this section.)

So, there is a limit to what kind of languages a regex can describe, and it is far more limited that what you can describe with some simple English sentences, for example.

The Chomsky hierarchy is a way to classify languages into different levels of expressiveness. Note that regular grammars are all the way at the bottom, and most useful (programming) languages are either Type 3, or borderline Type-3 (i.e. with a few Type-3 parts added in). This is due to a simple fact: our brains are quite capable of processing context-sensitive (Type-3) grammars, even complex ones (so we want programming languages to be powerful). However, computer parsers for context sensitive grammars are quite a bit slower than those for Type-2 (so, we want programming languages to be limited in power!

For regexes, which are expected to match very quickly, it's even more important to limit their overall expressiveness. But, by writing two or more regexes with some control-structure added, you are effectively expanding them to be more powerful than a regular expression parser.

Maybe we're over thinking things:

#! /usr/bin/env perl
#
use strict;
use feature qw(say);
use autodie;
use warnings;
use Data::Dumper;

while ( my $string = <DATA> ) {
    chomp $string;
    my @array;
    say qq(String: "$string");
    if ( @array = $string =~ /(\b[A-Z]+\b)/g ) {
        say qq(String groups: ) . join( ", ", @array ) . "\n";
    }
}

__DATA__
ab ABC
ab ABC 12 CD
ABC DE
ABC 23 DE EFG a
AB aF DE
ADSD asd ADSD
asd ADSDSD
SDSD SDD SD
SSDD SDS asds

The output:

String: "ab ABC"
String groups: ABC

String: "ab ABC 12 CD"
String groups: ABC, CD

String: "ABC DE"
String groups: ABC, DE

String: "ABC 23 DE EFG a"
String groups: ABC, DE, EFG

String: "AB aF DE"
String groups: AB, DE

String: "ADSD asd ADSD"
String groups: ADSD, ADSD

String: "asd ADSDSD"
String groups: ADSDSD

String: "SDSD SDD SD"
String groups: SDSD, SDD, SD

String: "SSDD SDS asds"
String is groups: SSDD, SDS

Did I miss something?

One regex:

@words = split (/[a-z]+/, $_);
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top