PHP Regex delimiter

Question

One thing that needs correcting is that if your regular expression and/or input data is encoded in UTF-8 (which in this case it is, since it comes straight from inside a UTF-8 encoded file) you must use the u modifier for your regular expression.

Another issue is that the copyright character should not be used as a delimiter in UTF-8 because the PCRE functions consider that the first byte of your pattern encodes your delimiter (this could plausibly be called a bug in PHP).

When you attempt to use the copyright sign as a delimiter in UTF-8, what actually gets saved into the file is the byte sequence 0xC2 0xA9. preg_match looks at the first byte 0xC2 and decides that it is an alphanumeric character because in your current locale that byte corresponds to the character Latin capital letter A with circumflex Â (see extended ASCII table). Therefore a warning is generated and processing is immediately aborted.

Given these facts, the ideal solution would be to choose an unusual delimiter from inside the ASCII character set because that character would encode to the same byte sequence both in single byte encodings and in UTF-8.

I would not consider printable ASCII characters unusual enough for this purpose, so a good choice would be one of the control characters (ASCII codes 1 to 31). For example, STX (\x02) would fit the bill.

Together with the u regex modifier this means you should write the regex like this:

$result = preg_match("\x02<.*?>\x02u", '<something string>');