Question

I'm trying to write a regex that will match everything BUT an apostrophe that has not been escaped. Consider the following:

<?php $s = 'Hi everyone, we\'re ready now.'; ?>

My goal is to write a regular expression that will essentially match the string portion of that. I'm thinking of something such as

/.*'([^']).*/

in order to match a simple string, but I've been trying to figure out how to get a negative lookbehind working on that apostrophe to ensure that it is not preceded by a backslash...

Any ideas?

- JMT

Was it helpful?

Solution

<?php
$backslash = '\\';

$pattern = <<< PATTERN
#(["'])(?:{$backslash}{$backslash}?+.)*?{$backslash}1#
PATTERN;

foreach(array(
    "<?php \$s = 'Hi everyone, we\\'re ready now.'; ?>",
    '<?php $s = "Hi everyone, we\\"re ready now."; ?>',
    "xyz'a\\'bc\\d'123",
    "x = 'My string ends with with a backslash\\\\';"
    ) as $subject) {
        preg_match($pattern, $subject, $matches);
        echo $subject , ' => ', $matches[0], "\n\n";
}

prints

<?php $s = 'Hi everyone, we\'re ready now.'; ?> => 'Hi everyone, we\'re ready now.'

<?php $s = "Hi everyone, we\"re ready now."; ?> => "Hi everyone, we\"re ready now."

xyz'a\'bc\d'123 => 'a\'bc\d'

x = 'My string ends with with a backslash\\'; => 'My string ends with with a backslash\\'

OTHER TIPS

Here's my solution with test cases:

/.*?'((?:\\\\|\\'|[^'])*+)'/

And my (Perl, but I don't use any Perl-specific features I don't think) proof:

use strict;
use warnings;

my %tests = ();
$tests{'Case 1'} = <<'EOF';
$var = 'My string';
EOF

$tests{'Case 2'} = <<'EOF';
$var = 'My string has it\'s challenges';
EOF

$tests{'Case 3'} = <<'EOF';
$var = 'My string ends with a backslash\\';
EOF

foreach my $key (sort (keys %tests)) {
    print "$key...\n";
    if ($tests{$key} =~ m/.*?'((?:\\\\|\\'|[^'])*+)'/) {
        print " ... '$1'\n";
    } else {
        print " ... NO MATCH\n";
    }
}

Running this shows:

$ perl a.pl
Case 1...
 ... 'My string'
Case 2...
 ... 'My string has it\'s challenges'
Case 3...
 ... 'My string ends with a backslash\\'

Note that the initial wildcard at the start needs to be non-greedy. Then I use non-backtracking matches to gobble up \\ and \' and then anything else that is not a standalone quote character.

I think this one probably mimics the compiler's built-in approach, which should make it pretty bullet-proof.

/.*'([^'\\]|\\.)*'.*/

The parenthesized portion looks for non-apostrophes/backslashes and backslash-escaped characters. If only certain characters can be escaped change the \\. to \\['\\a-z], or whatever.

Via negative look behind:

/
.*?'              #Match until '
(
 .*?              #Lazy match & capture of everything after the first apostrophe
)    
(?<!(?<!\\)\\)'   #Match first apostrophe that isn't preceded by \, but accept \\
.*                #Match remaining text
/
Regex reg = new Regex("(?<!\\\\)'(?<string>.*?)(?<!\\\\)'");

This is for JavaScript:

/('|")(?:\\\\|\\\1|[\s\S])*?\1/

it...

  • matches single or double quoted strings
  • matches empty strings (length 0)
  • matches strings with embedded whitespace (\n, \t, etc.)
  • skips inner escaped quotes (single or double)
  • skips single quotes within double quotes and vice versa

Only the first quote is captured. You can capture the unquoted string in $2 with:

/('|")((?:\\\\|\\\1|[\s\S])*?)\1/

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top