PHP에서 정규 표현식을 합산합니다

https://stackoverflow.com/questions/244959

05-07-2019
|

문제

정규 표현식이 포함 된 다음 두 줄이 있다고 가정합니다. 나는 그들을 어떻게 합산합니까? 보다 구체적으로, 나는 두 가지 표현을 대안으로 갖고 싶습니다.

$a = '# /[a-z] #i';
$b = '/ Moo /x';
$c = preg_magic_coalesce('|', $a, $b);
// Desired result should be equivalent to:
// '/ \/[a-zA-Z] |Moo/'

물론,이 작업을 문자열 작업으로 수행하는 것은 표현을 구문 분석하고, 구문 트리를 구성하고, 나무를 연합 한 다음 나무와 동등한 다른 정규 표현식을 출력하기 때문에 실용적이지 않습니다. 나는이 마지막 단계없이 완전히 행복합니다. 불행히도 PHP에는 Regexp 클래스가 없습니다 (또는 그렇습니까?).

거기 있어요 어느 이것을 달성하는 방법? 또한 다른 언어가 방법을 제공합니까? 이것이 꽤 정상적인 시나리오가 아닌가? 추측하지 마십시오. :-(

또는, 확인 방법이 있습니까? 효율적으로 두 표현 중 하나가 일치하고 어느 것이 더 일찍 일치하는 경우 (그리고 동일한 위치에서 일치하는 경우 어느 일치가 더 길습니까?)? 이것이 내가 지금하고있는 일입니다. 불행히도, 나는 긴 끈으로, 종종 두 개 이상의 패턴에 대해 이것을한다. 결과는입니다 느린 (그렇습니다. 이것은 분명히 병목 현상입니다).

편집하다:

나는 더 구체적이어야했다 - 죄송합니다. $a 그리고 $b ~이다 변수, 그들의 내용은 내 통제를 벗어납니다! 그렇지 않으면, 나는 그것들을 수동으로 합쳐질 것입니다. 따라서 사용 된 구분 수정 자 또는 정규식 수정 자에 대해서는 가정을 할 수 없습니다. 예를 들어, 내 첫 표현이 i 두 번째로 사용하는 동안 수정 자 (케이싱을 무시) x (확장 된 구문). 따라서 두 번째 표현이하기 때문에 두 가지를 연결할 수는 없습니다. ~ 아니다 케이싱을 무시하고 첫 번째는 확장 된 구문을 사용하지 않습니다 (그리고 그 안에있는 모든 공백이 중요합니다!

해결책

나는 그 포넬이 실제로 본다 설명 이것의 무리이지만 이것은 대부분의 문제를 처리합니다. IT는 이전 하위 발현 (다른 답변이 놓친)에서 설정된 수정자를 취소하고 각 하위 표현에 지정된대로 수정자를 설정합니다. 또한 비 슬픈 구분 제고기를 처리합니다 (캐릭터가 무엇인지에 대한 사양을 찾을 수 없었습니다. 허용된 여기서 내가 사용했습니다 ., 당신은 더 좁히고 싶을 수도 있습니다).

한 가지 약점은 표현 내에서 역 참조를 처리하지 않는다는 것입니다. 그것에 대한 나의 가장 큰 관심사는 역 참조 자체의 한계입니다. 나는 그것을 독자/질문자에게 연습으로 남겨 둘 것입니다.

// Pass as many expressions as you'd like
function preg_magic_coalesce() {
    $active_modifiers = array();

    $expression = '/(?:';
    $sub_expressions = array();
    foreach(func_get_args() as $arg) {
        // Determine modifiers from sub-expression
        if(preg_match('/^(.)(.*)\1([eimsuxADJSUX]+)$/', $arg, $matches)) {
            $modifiers = preg_split('//', $matches[3]);
            if($modifiers[0] == '') {
                array_shift($modifiers);
            }
            if($modifiers[(count($modifiers) - 1)] == '') {
                array_pop($modifiers);
            }

            $cancel_modifiers = $active_modifiers;
            foreach($cancel_modifiers as $key => $modifier) {
                if(in_array($modifier, $modifiers)) {
                    unset($cancel_modifiers[$key]);
                }
            }
            $active_modifiers = $modifiers;
        } elseif(preg_match('/(.)(.*)\1$/', $arg)) {
            $cancel_modifiers = $active_modifiers;
            $active_modifiers = array();
        }

        // If expression has modifiers, include them in sub-expression
        $sub_modifier = '(?';
        $sub_modifier .= implode('', $active_modifiers);

        // Cancel modifiers from preceding sub-expression
        if(count($cancel_modifiers) > 0) {
            $sub_modifier .= '-' . implode('-', $cancel_modifiers);
        }

        $sub_modifier .= ')';

        $sub_expression = preg_replace('/^(.)(.*)\1[eimsuxADJSUX]*$/', $sub_modifier . '$2', $arg);

        // Properly escape slashes
        $sub_expression = preg_replace('/(?<!\\\)\//', '\\\/', $sub_expression);

        $sub_expressions[] = $sub_expression;
    }

    // Join expressions
    $expression .= implode('|', $sub_expressions);

    $expression .= ')/';
    return $expression;
}

편집 : 나는 이것을 다시 작성하고 (나는 OCD 때문에) 다음과 같이 끝났다.

function preg_magic_coalesce($expressions = array(), $global_modifier = '') {
    if(!preg_match('/^((?:-?[eimsuxADJSUX])+)$/', $global_modifier)) {
        $global_modifier = '';
    }

    $expression = '/(?:';
    $sub_expressions = array();
    foreach($expressions as $sub_expression) {
        $active_modifiers = array();
        // Determine modifiers from sub-expression
        if(preg_match('/^(.)(.*)\1((?:-?[eimsuxADJSUX])+)$/', $sub_expression, $matches)) {
            $active_modifiers = preg_split('/(-?[eimsuxADJSUX])/',
                $matches[3], -1, PREG_SPLIT_NO_EMPTY|PREG_SPLIT_DELIM_CAPTURE);
        }

        // If expression has modifiers, include them in sub-expression
        if(count($active_modifiers) > 0) {
            $replacement = '(?';
            $replacement .= implode('', $active_modifiers);
            $replacement .= ':$2)';
        } else {
            $replacement = '$2';
        }

        $sub_expression = preg_replace('/^(.)(.*)\1(?:(?:-?[eimsuxADJSUX])*)$/',
            $replacement, $sub_expression);

        // Properly escape slashes if another delimiter was used
        $sub_expression = preg_replace('/(?<!\\\)\//', '\\\/', $sub_expression);

        $sub_expressions[] = $sub_expression;
    }

    // Join expressions
    $expression .= implode('|', $sub_expressions);

    $expression .= ')/' . $global_modifier;
    return $expression;
}

이제 사용합니다 (?modifiers:sub-expression) 보다는 (?modifiers)sub-expression|(?cancel-modifiers)sub-expression 그러나 나는 둘 다 이상한 수정 자 부작용이 있음을 알았습니다. 예를 들어, 서브 표현이 /u 수정 자, 일치하지 않는 경우 (그러나 통과하는 경우 'u' 새로운 기능의 두 번째 인수로서, 그것은 잘 일치 할 것입니다).

다른 팁

각각의 구분 제와 플래그를 벗겨냅니다. 이 regex는 다음을 수행해야합니다.
```
/^(.)(.*)\1([imsxeADSUXJu]*)$/
```
표현식에 합류하십시오. 플래그를 주입하려면 캡처되지 않은 괄호가 필요합니다.
```
"(?$flags1:$regexp1)|(?$flags2:$regexp2)"
```
후면 참조가있는 경우, 괄호 캡처 캡처 카운트 및 그에 따라 참조를 다시 업데이트하십시오 (예 : 올바르게 조인 /(.)x\1/ 그리고 /(.)y\1/ ~이다 /(.)x\1|(.)y\2/ ).

편집하다

코드를 다시 작성했습니다! 이제 다음과 같이 나열된 변경 사항이 포함되어 있습니다. 또한 오류를 찾기 위해 광범위한 테스트 (너무 많아서 여기에 게시하지 않음)를 수행했습니다. 지금까지 나는 아무것도 찾지 못했습니다.

기능은 이제 두 부분으로 분할되었습니다. 별도의 기능이 있습니다. preg_split 이는 정규 표현식을 취하고 베어 표현식 (구분 제외)과 다양한 수정자를 포함하는 배열을 반환합니다. 이것은 유용 할 수 있습니다 (실제로는 이미 가지고 있습니다. 이것이 내가이 변화를 한 이유입니다).
코드는 이제 올바르게 역 참조를 처리합니다. 이것은 결국 내 목적에 필요했습니다. 추가하기가 어렵지 않았습니다. 역 참조를 포착하는 데 사용되는 정규 표현은 단지 이상하게 보입니다 (실제로 극도로 비효율적 일 수 있습니다. 나에게 NP- 하드처럼 보이지만 직관 일 뿐이며 이상한 가장자리 케이스에만 적용됩니다). . 그건 그렇고, 누군가 내 방식보다 고르지 않은 수의 경기를 확인하는 더 나은 방법을 알고 있습니까? 부정적인 외관은 정규 표현 대신 고정 길이 스트링 만 허용하기 때문에 여기서는 작동하지 않습니다. 그러나, 나는 여기에 선행의 백 슬래시가 실제로 스스로 탈출되는지 여부를 테스트하기 위해 여기에 정규식이 필요합니다.

또한 PHP가 익명의 캐싱에 얼마나 좋은지 모르겠습니다. create_function 사용. 성능 측면에서는 이것이 최선의 솔루션이 아니지만 충분히 좋아 보입니다.
정신 점검에서 버그를 수정했습니다.
테스트에서 필요하지 않다는 것을 보여주기 때문에 구식 수정 자의 취소를 제거했습니다.

그건 그렇고,이 코드는 나열된 대안에 만족하지 않기 때문에 PHP에서 작업하는 다양한 언어에 대한 구문 형광펜의 핵심 구성 요소 중 하나입니다. 다른 곳.

감사!

포넬, 눈 없음, 놀라운 작품! 많은 감사합니다. 나는 실제로 포기했다.

나는 당신의 솔루션을 구축했으며 여기에서 공유하고 싶습니다. ~~나는 이것이 내 경우와 관련이 없기 때문에 재 너퍼링 역 참조를 구현하지 않았다 (나는 생각한다…). 아마도 이것은 나중에 필요할 것입니다.~~

몇 가지 질문…

한 가지, @eyelidlessness: 왜 오래된 수정자를 취소해야한다고 생각하십니까? 내가 보는 한, 수정자는 어쨌든 로컬로만 적용되기 때문에 이것은 필요하지 않습니다. 아 네, 다른 하나입니다. 구분자의 탈출은 지나치게 복잡해 보입니다. 이것이 왜 이것이 필요하다고 생각하는지 설명하기 위해 관심이 있습니까? 내 버전도 효과가 있다고 생각하지만 매우 잘못 될 수 있습니다.

또한 내 요구에 맞게 기능의 서명을 변경했습니다. 또한 내 버전이 더 일반적으로 유용하다는 것도 있습니다. 다시, 나는 틀렸을 수도 있습니다.

BTW, 이제 실명의 중요성을 알아야합니다. ;-) 나는 당신에게 코드에서 진정한 크레딧을 줄 수 없습니다. :-/

코드

어쨌든, 나는 지금까지 내 결과를 공유하고 싶다. 코드 보인다 아주 잘 작동합니다. ~~그러나 광범위한 테스트는 아직 수행되지 않았습니다.~~ 의견을주세요!

더 이상 고민하지 않고…

/**
 * Merges several regular expressions into one, using the indicated 'glue'.
 *
 * This function takes care of individual modifiers so it's safe to use
 * <em>different</em> modifiers on the individual expressions. The order of
 * sub-matches is preserved as well. Numbered back-references are adapted to
 * the new overall sub-match count. This means that it's safe to use numbered
 * back-refences in the individual expressions!
 * If {@link $names} is given, the individual expressions are captured in
 * named sub-matches using the contents of that array as names.
 * Matching pair-delimiters (e.g. <code>"{…}"</code>) are currently
 * <strong>not</strong> supported.
 *
 * The function assumes that all regular expressions are well-formed.
 * Behaviour is undefined if they aren't.
 *
 * This function was created after a {@link https://stackoverflow.com/questions/244959/
 * StackOverflow discussion}. Much of it was written or thought of by
 * “porneL” and “eyelidlessness”. Many thanks to both of them.
 *
 * @param string $glue  A string to insert between the individual expressions.
 *      This should usually be either the empty string, indicating
 *      concatenation, or the pipe (<code>|</code>), indicating alternation.
 *      Notice that this string might have to be escaped since it is treated
 *      like a normal character in a regular expression (i.e. <code>/</code>)
 *      will end the expression and result in an invalid output.
 * @param array $expressions    The expressions to merge. The expressions may
 *      have arbitrary different delimiters and modifiers.
 * @param array $names  Optional. This is either an empty array or an array of
 *      strings of the same length as {@link $expressions}. In that case,
 *      the strings of this array are used to create named sub-matches for the
 *      expressions.
 * @return string An string representing a regular expression equivalent to the
 *      merged expressions. Returns <code>FALSE</code> if an error occurred.
 */
function preg_merge($glue, array $expressions, array $names = array()) {
    // … then, a miracle occurs.

    // Sanity check …

    $use_names = ($names !== null and count($names) !== 0);

    if (
        $use_names and count($names) !== count($expressions) or
        !is_string($glue)
    )
        return false;

    $result = array();
    // For keeping track of the names for sub-matches.
    $names_count = 0;
    // For keeping track of *all* captures to re-adjust backreferences.
    $capture_count = 0;

    foreach ($expressions as $expression) {
        if ($use_names)
            $name = str_replace(' ', '_', $names[$names_count++]);

        // Get delimiters and modifiers:

        $stripped = preg_strip($expression);

        if ($stripped === false)
            return false;

        list($sub_expr, $modifiers) = $stripped;

        // Re-adjust backreferences:

        // We assume that the expression is correct and therefore don't check
        // for matching parentheses.

        $number_of_captures = preg_match_all('/\([^?]|\(\?[^:]/', $sub_expr, $_);

        if ($number_of_captures === false)
            return false;

        if ($number_of_captures > 0) {
            // NB: This looks NP-hard. Consider replacing.
            $backref_expr = '/
                (                # Only match when not escaped:
                    [^\\\\]      # guarantee an even number of backslashes
                    (\\\\*?)\\2  # (twice n, preceded by something else).
                )
                \\\\ (\d)        # Backslash followed by a digit.
            /x';
            $sub_expr = preg_replace_callback(
                $backref_expr,
                create_function(
                    '$m',
                    'return $m[1] . "\\\\" . ((int)$m[3] + ' . $capture_count . ');'
                ),
                $sub_expr
            );
            $capture_count += $number_of_captures;
        }

        // Last, construct the new sub-match:

        $modifiers = implode('', $modifiers);
        $sub_modifiers = "(?$modifiers)";
        if ($sub_modifiers === '(?)')
            $sub_modifiers = '';

        $sub_name = $use_names ? "?<$name>" : '?:';
        $new_expr = "($sub_name$sub_modifiers$sub_expr)";
        $result[] = $new_expr;
    }

    return '/' . implode($glue, $result) . '/';
}

/**
 * Strips a regular expression string off its delimiters and modifiers.
 * Additionally, normalize the delimiters (i.e. reformat the pattern so that
 * it could have used '/' as delimiter).
 *
 * @param string $expression The regular expression string to strip.
 * @return array An array whose first entry is the expression itself, the
 *      second an array of delimiters. If the argument is not a valid regular
 *      expression, returns <code>FALSE</code>.
 *
 */
function preg_strip($expression) {
    if (preg_match('/^(.)(.*)\\1([imsxeADSUXJu]*)$/s', $expression, $matches) !== 1)
        return false;

    $delim = $matches[1];
    $sub_expr = $matches[2];
    if ($delim !== '/') {
        // Replace occurrences by the escaped delimiter by its unescaped
        // version and escape new delimiter.
        $sub_expr = str_replace("\\$delim", $delim, $sub_expr);
        $sub_expr = str_replace('/', '\\/', $sub_expr);
    }
    $modifiers = $matches[3] === '' ? array() : str_split(trim($matches[3]));

    return array($sub_expr, $modifiers);
}

추신 :이 게시물 커뮤니티 위키를 편집 할 수있었습니다. 당신은 이것이 무엇을 의미하는지 알고 있습니다…!

나는 어떤 언어로든 regexps를 모아 놓을 수 없다고 확신합니다. 그들은 양립 할 수없는 수정자를 가질 수 있습니다.

나는 아마 그것들을 배열에 넣고 그들을 통과하거나 손으로 결합 할 것입니다.

편집 : 편집에 설명 된대로 한 번에 하나씩하고 있다면, 시작부터 가장 빠른 일치에 이르기까지 두 번째 문자열에서 두 번째를 실행할 수 있습니다. 그것은 일을 도울 수 있습니다.

function preg_magic_coalasce($split, $re1, $re2) {
  $re1 = rtrim($re1, "\/#is");
  $re2 = ltrim($re2, "\/#");
  return $re1.$split.$re2;
}

다음과 같은 대안적인 방법으로 할 수 있습니다.

$a = '# /[a-z] #i';
$b = '/ Moo /x';

$a_matched = preg_match($a, $text, $a_matches);
$b_matched = preg_match($b, $text, $b_matches);

if ($a_matched && $b_matched) {
    $a_pos = strpos($text, $a_matches[1]);
    $b_pos = strpos($text, $b_matches[1]);

    if ($a_pos == $b_pos) {
        if (strlen($a_matches[1]) == strlen($b_matches[1])) {
            // $a and $b matched the exact same string
        } else if (strlen($a_matches[1]) > strlen($b_matches[1])) {
            // $a and $b started matching at the same spot but $a is longer
        } else {
            // $a and $b started matching at the same spot but $b is longer
        }
    } else if ($a_pos < $b_pos) {
        // $a matched first
    } else {
        // $b matched first
    }
} else if ($a_matched) {
    // $a matched, $b didn't
} else if ($b_matched) {
    // $b matched, $a didn't
} else {
    // neither one matched
}

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 StackOverflow