Domanda

I have a method that checks whether string contains special unicode characters. I found regex expression here http://w3.org/International/questions/qa-forms-utf-8.html. But I found that it does not work for strings longer than 307 characters. Here is sample code:

$regex = '%^(?:
      [\x09\x0A\x0D\x20-\x7E]            # ASCII
    | [\xC2-\xDF][\x80-\xBF]             # non-overlong 2-byte
    | \xE0[\xA0-\xBF][\x80-\xBF]         # excluding overlongs
    | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}  # straight 3-byte
    | \xED[\x80-\x9F][\x80-\xBF]         # excluding surrogates
    | \xF0[\x90-\xBF][\x80-\xBF]{2}      # planes 1-3
    | [\xF1-\xF3][\x80-\xBF]{3}          # planes 4-15
    | \xF4[\x80-\x8F][\x80-\xBF]{2}      # plane 16
    )*$%xs';

$matches = null;
if (preg_match($regex,
    '..........<p>
...............<a href="#tag"></a><br>
...............<a href="#tag"></a><br>
...............<a href="#tag"></a><br>
...............<a href="#tag"></a><br>
...............<a href="#tag"></a><br>
...............<a href="#tag"></a><br>
...............<a href="#tag"></a><br>
.........</p>', $matches)) {
    echo 'IN';
} else {
    echo 'OUT!';
}

I've added dots instead of spaces just for test. When I run this script, I get no response from server (not even an error although I have set all errors to be displayed). However, if I remove only one character from matching string, it works as expected (IN is echoed). I can't seem find anything online that can help me with this. Debugging doesn't help because it just stops (breaks) on if (debugging session stops).

Here are my pcre settings from php.ini: pcre.backtrack_limit: 1000000, pcre.recursion_limit: 100000

I've tried online tools for checking regex and none produces this error (it works just fine).

Anyone? Thanks.

È stato utile?

Soluzione

The reason seems to be the bactracking limit that is upper to the server capabilities (this is the reason why you get no error message).

You can limit the backtraking using:

$regex = '%^(?>
      [\x09\x0A\x0D\x20-\x7E]++                # ASCII
    | (?>[\xC2-\xDF][\x80-\xBF])++             # non-overlong 2-byte
    | (?>\xE0[\xA0-\xBF][\x80-\xBF])++         # excluding overlongs
    | (?>[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2})++  # straight 3-byte
    | (?>\xED[\x80-\x9F][\x80-\xBF])++         # excluding surrogates
    | (?>\xF0[\x90-\xBF][\x80-\xBF]{2})++      # planes 1-3
    | (?>[\xF1-\xF3][\x80-\xBF]{3})++          # planes 4-15
    | (?>\xF4[\x80-\x8F][\x80-\xBF]{2})++      # plane 16
    )*+$%xs';

About backtracking, atomic groups and possessive quantifiers:

Backtracking is a mechanism used by the regex engine to explore other possibilities of matches from a position in the string when a subpattern fails at a position in the string.

Let's consider the string aaabcccb and the pattern ^.+cb$:

 string     |  pattern   |   state
------------+------------+--------------------------
 aaabcccb   |  ^.+cb$    |   BEGIN
 aaabcccb   |  ^.+cb$    |   OK
 aaabcccb   |  ^.+cb$    |   FAIL
 aaabcccb   |  ^.+cb$    |   BACKTRACK
 aaabcccb   |  ^.+cb$    |   FAIL
 aaabcccb   |  ^.+cb$    |   BACKTRACK
 aaabcccb   |  ^.+cb$    |   OK
 aaabcccb   |  ^.+cb$    |   OK
 aaabcccb   |  ^.+cb$    |   OK, SUCCEED
------------+------------+--------------------------

This describes the default behavior of the regex engine, the subpattern with the greedy quantifier .+ takes all that is possible (all the string in this example), but after the regex engine must go back character by character to make the subpattern cb succeed. A greedy quantifier allows this behavior and may get characters back.

You can forbid backtracking using a possessive quantifier. Example with ^.++cb$:

 string     |  pattern   |   state
------------+------------+--------------------------
 aaabcccb   |  ^.++cb$   |   BEGIN
 aaabcccb   |  ^.++cb$   |   OK
 aaabcccb   |  ^.++cb$   |   FAIL
 aaabcccb   |  ^.++cb$   |   NO MATCH
------------+------------+--------------------------

The regex engines can't backtrack in the substring matched by .++, the whole pattern fails immediatly since c is not found.

An atomic group defines a subpattern in which the regex engine is not allowed to backtrack. In other words, possessive quantifiers and atomic groups are the same feature: (?>a+) <=> a++

Note: However, keep in mind that the regex engine can always backtrack inside an atomic group as long as it is not closed: ^(?>.+c)b$ will succeed with the precedent string, but ^(?>.+)cb$ will fail.

Once an atomic group is closed, or when you use a possessive quantifier, the matched substring is an atom in the etymological meaning (i.e. something that can't be divided). However, the regex engine can always backtrack atom by atom, for example: ^(?>ab)+abc$ will match abababc when ^(?>ab)++abc$ (or ^(?>(?>ab)+)abc$) will fail.

One of the main advantage of atomic groups and possessive quantifiers (or the fact to forbid bactracking) is to reduce the number of steps to make a pattern succeed or fail.

Improvements:

Since possessive quantifiers and atomic groups are used everywhere, each substring is matched once and for all, and when a character isn't in one of these groups, the pattern will fail immediatly.

An other improvement is to add quantifier for each element of the alternation. An example with the string: zzzzzzzzzzzzza

with the pattern: (?:a|b|c|...x|y|z)+

The regex engine must try each part of the alternation until it find the good letter, and this for each letters (13x26 = 278 tests to obtain all z)

with the pattern: (?>a+|b+|c+|...x+|y+|z+)+

The regex engine need only 26 tests, and once it arrived to z+, it obtains all the z.

Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top