Question

I'm trying to split a UTF-8 string on a quote character (") with delimiter capture, except where that quote is followed by a second quote ("") so that (for example)

"A ""B"" C" & "D ""E"" F"

will split into three elements

"A ""B"" C"
&
"D ""E"" F"

I've been attempting to use:

$string = '"A ""B"" C" & "D ""E"" F"';
$temp = preg_split(
    '/"[^"]/mui',
    $string,
    null, 
    PREG_SPLIT_NO_EMPTY | PREG_SPLIT_DELIM_CAPTURE
);

but without success as it gives me

array(7) {
  [0]=>
  string(2) " ""
  [1]=>
  string(1) """
  [2]=>
  string(1) "C"
  [3]=>
  string(2) "& "
  [4]=>
  string(2) " ""
  [5]=>
  string(1) """
  [6]=>
  string(2) "F""
}

So it's losing any characters that immediately follow a quote unless that character is also a quote

In this example there's a quote as the first and last characters in the string, though that may not always be the case, e.g.

{ "A ""B"" C" & "D ""E"" F" }

needs to split into five elements

{
"A ""B"" C"
&
"D ""E"" F"
}

Can anybody help me get this working?

Was it helpful?

Solution

Since you said that you don't mind the quotes to be consumed on the split, you can use the expression:

(?<!")\s?"\s?(?!")

Where two negative lookarounds are used. The output on your sample will be:

{ 
A ""B"" C
&
D ""E"" F
}

[I put the \s? to consume any trailing space, remove them if you want to keep them]

OTHER TIPS

I think it would probably be easier to use preg_match_all:

preg_match_all('/"([^"]|"")+"|[^"]+/', $string, $matches);

Here’s a demo. The regular expression matches a quoted string or not a quoted string, so if the last part doesn‘t have a closing quote, it’ll ignore that; that might need changing, depending on your situation.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top