PHP 中的合并正则表达式

https://stackoverflow.com/questions/244959

05-07-2019
|

题

假设我有以下两个包含正则表达式的字符串。我如何合并它们？更具体地说，我希望有两个表达式作为替代。

$a = '# /[a-z] #i';
$b = '/ Moo /x';
$c = preg_magic_coalesce('|', $a, $b);
// Desired result should be equivalent to:
// '/ \/[a-zA-Z] |Moo/'

当然，将其作为字符串操作来执行是不切实际的，因为它将涉及解析表达式、构建语法树、合并树，然后输出与该树等效的另一个正则表达式。没有最后一步我就很高兴了。不幸的是，PHP 没有 RegExp 类（或者有吗？）。

有没有任何实现这个目标的方法？顺便问一下，还有其他语言提供方法吗？这不是一个很正常的场景吗？可能不会。:-(

或者,有没有办法检查 有效率的 如果两个表达式中的任何一个匹配，哪个匹配更早（如果它们在同一位置匹配，则哪个匹配更长）？这就是我现在正在做的事情。不幸的是，我经常在长字符串上对两个以上的模式执行此操作。结果是慢的（是的，这绝对是瓶颈）。

编辑：

我应该更具体——抱歉。 $a 和 $b 是变量, ，他们的内容超出了我的控制范围！否则，我只会手动合并它们。因此，我无法对使用的分隔符或正则表达式修饰符做出任何假设。例如，请注意，我的第一个表达式使用 i 修饰符（忽略大小写），而第二个使用 x （扩展语法）。因此，我不能只是将两者连接起来，因为第二个表达式确实不是忽略大小写，第一个不使用扩展语法（其中的任何空格都很重要！

解决方案

我看到porneL实际上描述了一堆这个，但是这个处理大部分问题。它取消在先前子表达式中设置的修饰符（其他答案错过）并设置每个子表达式中指定的修饰符。它还处理非斜杠分隔符（我在这里找不到允许字符的规范，所以我使用。，你可能想要进一步缩小）。

一个缺点是它不处理表达式中的反向引用。我最担心的是反向引用本身的局限性。我将把它作为练习留给读者/提问者。

// Pass as many expressions as you'd like
function preg_magic_coalesce() {
    $active_modifiers = array();

    $expression = '/(?:';
    $sub_expressions = array();
    foreach(func_get_args() as $arg) {
        // Determine modifiers from sub-expression
        if(preg_match('/^(.)(.*)\1([eimsuxADJSUX]+)$/', $arg, $matches)) {
            $modifiers = preg_split('//', $matches[3]);
            if($modifiers[0] == '') {
                array_shift($modifiers);
            }
            if($modifiers[(count($modifiers) - 1)] == '') {
                array_pop($modifiers);
            }

            $cancel_modifiers = $active_modifiers;
            foreach($cancel_modifiers as $key => $modifier) {
                if(in_array($modifier, $modifiers)) {
                    unset($cancel_modifiers[$key]);
                }
            }
            $active_modifiers = $modifiers;
        } elseif(preg_match('/(.)(.*)\1$/', $arg)) {
            $cancel_modifiers = $active_modifiers;
            $active_modifiers = array();
        }

        // If expression has modifiers, include them in sub-expression
        $sub_modifier = '(?';
        $sub_modifier .= implode('', $active_modifiers);

        // Cancel modifiers from preceding sub-expression
        if(count($cancel_modifiers) > 0) {
            $sub_modifier .= '-' . implode('-', $cancel_modifiers);
        }

        $sub_modifier .= ')';

        $sub_expression = preg_replace('/^(.)(.*)\1[eimsuxADJSUX]*$/', $sub_modifier . '$2', $arg);

        // Properly escape slashes
        $sub_expression = preg_replace('/(?<!\\\)\//', '\\\/', $sub_expression);

        $sub_expressions[] = $sub_expression;
    }

    // Join expressions
    $expression .= implode('|', $sub_expressions);

    $expression .= ')/';
    return $expression;
}

编辑：我已经重写了这个（因为我是强迫症）并最终得到：

function preg_magic_coalesce($expressions = array(), $global_modifier = '') {
    if(!preg_match('/^((?:-?[eimsuxADJSUX])+)$/', $global_modifier)) {
        $global_modifier = '';
    }

    $expression = '/(?:';
    $sub_expressions = array();
    foreach($expressions as $sub_expression) {
        $active_modifiers = array();
        // Determine modifiers from sub-expression
        if(preg_match('/^(.)(.*)\1((?:-?[eimsuxADJSUX])+)$/', $sub_expression, $matches)) {
            $active_modifiers = preg_split('/(-?[eimsuxADJSUX])/',
                $matches[3], -1, PREG_SPLIT_NO_EMPTY|PREG_SPLIT_DELIM_CAPTURE);
        }

        // If expression has modifiers, include them in sub-expression
        if(count($active_modifiers) > 0) {
            $replacement = '(?';
            $replacement .= implode('', $active_modifiers);
            $replacement .= ':$2)';
        } else {
            $replacement = '$2';
        }

        $sub_expression = preg_replace('/^(.)(.*)\1(?:(?:-?[eimsuxADJSUX])*)$/',
            $replacement, $sub_expression);

        // Properly escape slashes if another delimiter was used
        $sub_expression = preg_replace('/(?<!\\\)\//', '\\\/', $sub_expression);

        $sub_expressions[] = $sub_expression;
    }

    // Join expressions
    $expression .= implode('|', $sub_expressions);

    $expression .= ')/' . $global_modifier;
    return $expression;
}

它现在使用（？modifiers：sub-expression）而不是（？modifiers）子表达式|（？cancel-modifiers）子表达式但我'我们注意到两者都有一些奇怪的修饰副作用。例如，在两种情况下，如果子表达式具有 / u 修饰符，则它将无法匹配（但如果您将'u'作为第二个参数传递新功能，这将很好地匹配。）

其他提示

从每个中删除分隔符和标记。这个正则表达式应该这样做：
```
/^(.)(.*)\1([imsxeADSUXJu]*)$/
```
将表达式连接在一起。你需要非捕获括号来注入标志：
```
"(?$flags1:$regexp1)|(?$flags2:$regexp2)"
```
如果有任何后向引用，则计算捕获括号并相应地更新引用（例如正确连接 /（。）x \ 1 / 和 /（。）y \ 1 / 是 /（。）x \ 1 |（。）y \ 2 / ）。

编辑

我重写了代码！ 它现在包含如下列出的更改。此外，我还进行了广泛的测试（我不会在这里发布，因为它们太多）来查找错误。到目前为止，我还没有找到。

该函数现在分为两部分：有一个单独的功能 preg_split 它接受一个正则表达式并返回一个包含裸表达式（不带分隔符）和修饰符数组的数组。这可能会派上用场（事实上，它已经派上用场了；这就是我做出此更改的原因）。
该代码现在可以正确处理反向引用。 毕竟这对于我的目的来说是必要的。添加并不难，用于捕获反向引用的正则表达式看起来很奇怪（实际上可能效率极低，对我来说看起来是 NP 难的——但这只是一种直觉，只适用于奇怪的边缘情况）。顺便问一下，有谁知道比我的方法更好的检查奇数匹配项的方法吗？负向后查找在这里不起作用，因为它们只接受固定长度的字符串而不是正则表达式。但是，我需要此处的正则表达式来测试前面的反斜杠是否实际上本身已转义。

另外，我不知道 PHP 在缓存匿名方面有多好 create_function 使用。从性能角度来看，这可能不是最好的解决方案，但看起来已经足够好了。
我修复了健全性检查中的一个错误。
我已经删除了过时修饰符的取消，因为我的测试表明这是没有必要的。

顺便说一句，这段代码是我在 PHP 中处理的各种语言的语法荧光笔的核心组件之一，因为我对列出的替代方案不满意别处.

谢谢！

色情L, 无眼睑, ，了不起的工作！非常感谢。我其实已经放弃了。

我已经建立在你的解决方案的基础上，我想在这里分享它。 ~~我没有实现重新编号反向引用，因为这与我的情况无关（我认为......）。不过，也许以后这将变得必要。~~

一些问题 …

一件事， @无眼睑: 为什么你觉得有必要取消旧的修改器？据我所知，这是没有必要的，因为修饰符无论如何都只在本地应用。啊，是的，还有一件事。您对分隔符的转义似乎过于复杂。愿意解释一下为什么您认为需要这样做吗？我相信我的版本应该也能工作，但我可能是错的。

另外，我还更改了您的函数的签名以满足我的需求。我还认为我的版本更普遍有用。再说一次，我可能是错的。

顺便说一句，您现在应该意识到实名对 SO 的重要性了。;-) 我无法在代码中给予您真正的信任。:-/

代码

不管怎样，我想分享到目前为止我的结果，因为我不敢相信没有人需要这样的东西。代码似乎工作得很好。 ~~不过，广泛的测试尚未完成。~~ 请给出意见！

言归正传……

/**
 * Merges several regular expressions into one, using the indicated 'glue'.
 *
 * This function takes care of individual modifiers so it's safe to use
 * <em>different</em> modifiers on the individual expressions. The order of
 * sub-matches is preserved as well. Numbered back-references are adapted to
 * the new overall sub-match count. This means that it's safe to use numbered
 * back-refences in the individual expressions!
 * If {@link $names} is given, the individual expressions are captured in
 * named sub-matches using the contents of that array as names.
 * Matching pair-delimiters (e.g. <code>"{…}"</code>) are currently
 * <strong>not</strong> supported.
 *
 * The function assumes that all regular expressions are well-formed.
 * Behaviour is undefined if they aren't.
 *
 * This function was created after a {@link https://stackoverflow.com/questions/244959/
 * StackOverflow discussion}. Much of it was written or thought of by
 * “porneL” and “eyelidlessness”. Many thanks to both of them.
 *
 * @param string $glue  A string to insert between the individual expressions.
 *      This should usually be either the empty string, indicating
 *      concatenation, or the pipe (<code>|</code>), indicating alternation.
 *      Notice that this string might have to be escaped since it is treated
 *      like a normal character in a regular expression (i.e. <code>/</code>)
 *      will end the expression and result in an invalid output.
 * @param array $expressions    The expressions to merge. The expressions may
 *      have arbitrary different delimiters and modifiers.
 * @param array $names  Optional. This is either an empty array or an array of
 *      strings of the same length as {@link $expressions}. In that case,
 *      the strings of this array are used to create named sub-matches for the
 *      expressions.
 * @return string An string representing a regular expression equivalent to the
 *      merged expressions. Returns <code>FALSE</code> if an error occurred.
 */
function preg_merge($glue, array $expressions, array $names = array()) {
    // … then, a miracle occurs.

    // Sanity check …

    $use_names = ($names !== null and count($names) !== 0);

    if (
        $use_names and count($names) !== count($expressions) or
        !is_string($glue)
    )
        return false;

    $result = array();
    // For keeping track of the names for sub-matches.
    $names_count = 0;
    // For keeping track of *all* captures to re-adjust backreferences.
    $capture_count = 0;

    foreach ($expressions as $expression) {
        if ($use_names)
            $name = str_replace(' ', '_', $names[$names_count++]);

        // Get delimiters and modifiers:

        $stripped = preg_strip($expression);

        if ($stripped === false)
            return false;

        list($sub_expr, $modifiers) = $stripped;

        // Re-adjust backreferences:

        // We assume that the expression is correct and therefore don't check
        // for matching parentheses.

        $number_of_captures = preg_match_all('/\([^?]|\(\?[^:]/', $sub_expr, $_);

        if ($number_of_captures === false)
            return false;

        if ($number_of_captures > 0) {
            // NB: This looks NP-hard. Consider replacing.
            $backref_expr = '/
                (                # Only match when not escaped:
                    [^\\\\]      # guarantee an even number of backslashes
                    (\\\\*?)\\2  # (twice n, preceded by something else).
                )
                \\\\ (\d)        # Backslash followed by a digit.
            /x';
            $sub_expr = preg_replace_callback(
                $backref_expr,
                create_function(
                    '$m',
                    'return $m[1] . "\\\\" . ((int)$m[3] + ' . $capture_count . ');'
                ),
                $sub_expr
            );
            $capture_count += $number_of_captures;
        }

        // Last, construct the new sub-match:

        $modifiers = implode('', $modifiers);
        $sub_modifiers = "(?$modifiers)";
        if ($sub_modifiers === '(?)')
            $sub_modifiers = '';

        $sub_name = $use_names ? "?<$name>" : '?:';
        $new_expr = "($sub_name$sub_modifiers$sub_expr)";
        $result[] = $new_expr;
    }

    return '/' . implode($glue, $result) . '/';
}

/**
 * Strips a regular expression string off its delimiters and modifiers.
 * Additionally, normalize the delimiters (i.e. reformat the pattern so that
 * it could have used '/' as delimiter).
 *
 * @param string $expression The regular expression string to strip.
 * @return array An array whose first entry is the expression itself, the
 *      second an array of delimiters. If the argument is not a valid regular
 *      expression, returns <code>FALSE</code>.
 *
 */
function preg_strip($expression) {
    if (preg_match('/^(.)(.*)\\1([imsxeADSUXJu]*)$/s', $expression, $matches) !== 1)
        return false;

    $delim = $matches[1];
    $sub_expr = $matches[2];
    if ($delim !== '/') {
        // Replace occurrences by the escaped delimiter by its unescaped
        // version and escape new delimiter.
        $sub_expr = str_replace("\\$delim", $delim, $sub_expr);
        $sub_expr = str_replace('/', '\\/', $sub_expr);
    }
    $modifiers = $matches[3] === '' ? array() : str_split(trim($matches[3]));

    return array($sub_expr, $modifiers);
}

附：我已将此发布社区 wiki 设为可编辑。你知道这是什么意思 …！

我很确定不可能像任何语言一样将正则表达式放在一起 - 它们可能具有不兼容的修饰符。

我可能只是把它们放在一个数组中并循环遍历它们，或者手工合并它们。

编辑：如果您按照编辑中的描述一次一个地执行这些操作，您可能能够在子字符串上运行第二个（从开始到最早的匹配）。这可能有所帮助。

function preg_magic_coalasce($split, $re1, $re2) {
  $re1 = rtrim($re1, "\/#is");
  $re2 = ltrim($re2, "\/#");
  return $re1.$split.$re2;
}

您可以采用以下替代方式：

$a = '# /[a-z] #i';
$b = '/ Moo /x';

$a_matched = preg_match($a, $text, $a_matches);
$b_matched = preg_match($b, $text, $b_matches);

if ($a_matched && $b_matched) {
    $a_pos = strpos($text, $a_matches[1]);
    $b_pos = strpos($text, $b_matches[1]);

    if ($a_pos == $b_pos) {
        if (strlen($a_matches[1]) == strlen($b_matches[1])) {
            // $a and $b matched the exact same string
        } else if (strlen($a_matches[1]) > strlen($b_matches[1])) {
            // $a and $b started matching at the same spot but $a is longer
        } else {
            // $a and $b started matching at the same spot but $b is longer
        }
    } else if ($a_pos < $b_pos) {
        // $a matched first
    } else {
        // $b matched first
    }
} else if ($a_matched) {
    // $a matched, $b didn't
} else if ($b_matched) {
    // $b matched, $a didn't
} else {
    // neither one matched
}

许可以下： CC-BY-SA 和归因

不隶属于 StackOverflow