Regex 교체 : “:”등으로

https://stackoverflow.com/questions/428013

06-07-2019
|

문제

나는 다음과 같은 많은 문자열을 가지고 있습니다.

"Hello, here's a test colon&#58;. Here's a test semi-colon&#59;"

나는 그것을 대체하고 싶습니다

"Hello, here's a test colon:. Here's a test semi-colon;"

그리고 모두를 위해 인쇄 가능한 ASCII 값.

현재 나는 사용하고있다 boost::regex_search 일치합니다 &#(\d+);, 각 경기를 차례로 처리 할 때 문자열을 구축합니다 (내가 찾은 마지막 경기 이후 일치하지 않는 부분 문자열을 추가하는 포함).

누구든지 더 나은 방법을 생각할 수 있습니까? 나는 비록 검색 방법에 열려 있지만,이 경우 Regex는 합리적으로 합리적인 접근 방식으로 보였다.

감사,

돔

해결책

정규식을 사용하는 큰 장점은 다음과 같은 까다로운 사례를 다루는 것입니다. &#38; 엔티티 교체는 반복적이지 않으며 단일 단계입니다. REGEX도 상당히 효율적일 것입니다. 두 리드 캐릭터가 고정되어 있으므로 시작하지 않은 것을 빠르게 건너 뜁니다. &#. 마지막으로, Regex 솔루션은 미래의 관리자에게 많은 놀라움이없는 솔루션입니다.

나는 반대가 올바른 선택이라고 말하고 싶습니다.

그래도 최고의 동정인입니까? 당신은 두 자리 숫자가 필요하다는 것을 알고 있으며 3 자리가 있다면 첫 번째 숫자는 1입니다. 인쇄 가능한 ASCII는 결국입니다.  -~. 그런 이유로 고려할 수 있습니다 &#1?\d\d;.

내용을 교체 할 때는 사용합니다 boost :: regex :: 교체에 대해 설명 된 기본 알고리즘 :

For each match // Using regex_iterator<>
    Print the prefix of the match
    Remove the first 2 and last character of the match (&#;)
    lexical_cast the result to int, then truncate to char and append.

Print the suffix of the last match.

다른 팁

이것은 아마도 C ++, 부스트 또는 정규 응답이 아니기 때문에 약간의 다운 투표를 얻을 것입니다. 그러나 여기 Snobol 솔루션이 있습니다. 이것은 ASCII에서 작동합니다. 유니 코드를 위해 무언가를 작업하고 있습니다.

        NUMS = '1234567890'
MAIN    LINE = INPUT                                :F(END)
SWAP    LINE ?  '&#' SPAN(NUMS) . N ';' = CHAR( N ) :S(SWAP)
        OUTPUT = LINE                               :(MAIN)
END

* Repaired SNOBOL4 Solution
* &#38;#38; -> &#38;
     digit = '0123456789'
main line = input                        :f(end)
     result = 
swap line arb . l
+    '&#' span(digit) . n ';' rem . line :f(out)
     result = result l char(n)           :(swap)
out  output = result line                :(main)
end

기존 스노볼 솔루션은 하나의 "&"만 있기 때문에 다중 패턴 케이스를 올바르게 처리하지 않습니다. 다음 솔루션은 더 잘 작동해야합니다.

        dd = "0123456789"
        ccp = "#" span(dd) $ n ";" *?(s = s char(n)) fence (*ccp | null)
   rdl  line = input                              :f(done)
   repl line "&" *?(s = ) ccp = s                 :s(repl)
        output = line                             :(rdl)
   done
   end

Boost의 REGEX 지원에 대해 잘 모르지만 콜백 또는 Lambdas 등을 지원하는 REPLEC () 메소드가 있는지 확인하십시오. 이것이 바로 다른 언어로 Regexes와 함께이 작업을 수행하는 일반적인 방법입니다.

파이썬 구현은 다음과 같습니다.

s = "Hello, here's a test colon&#58;. Here's a test semi-colon&#59;"
re.sub(r'&#(1?\d\d);', lambda match: chr(int(match.group(1))), s)

생산 :

"Hello, here's a test colon:. Here's a test semi-colon;"

나는 지금 Boost를 보았고 Regex_replace 기능이 있음을 알았습니다. 그러나 C ++는 실제로 나를 혼란스럽게하므로 교체 부품에 콜백을 사용할 수 있는지 알 수 없습니다. 그러나 부스트 문서를 올바르게 읽으면 ( d d) 그룹과 일치하는 문자열을 $ 1로 제공해야합니다. 부스트를 사용하고 있다면 확인하겠습니다.

우리가 여기서 주제를 벗어난 한, Perl 대체는 'e'옵션이 있습니다. 에서와 같이 표현을 평가하십시오. 예를 들어

Echo "안녕하세요, 여기 테스트 콜론이 있습니다. 여기 테스트 세미콜론이 있습니다.
추가 테스트 A. abc. ~ .def. "
| Perl -We 'sub translate {my $ x = $ _ [0]; if (($ x> = 32) && ($ x <= 126))
{return sprintf ( "%c", $ x); } else {return "&#". $ x. ";"; }}
while (<>) {s/&#(1? d d);/& 번역 ($ 1)/ge; 인쇄; } '

꽤 인쇄 :

#!/usr/bin/perl -w

sub translate
{
  my $x=$_[0];

  if ( ($x >= 32) && ($x <= 126) )
  {
    return sprintf( "%c", $x );
  }
  else
  {
    return "&#" . $x . ";" ;
  }
}

while (<>)
{
  s/&#(1?\d\d);/&translate($1)/ge;
  print;
}

Perl은 Perl이지만, 그것을 쓸 수있는 훨씬 더 좋은 방법이 있다고 확신합니다 ...

C 코드로 돌아 가기 :

당신은 또한 자신의 유한 상태 기계를 굴릴 수도 있습니다. 그러나 그것은 나중에 유지하기 위해 지저분하고 번거 롭습니다.

다음은 또 다른 Perl의 One-Liner입니다 (참조 @Mrree의 답변):

테스트 파일 :

$ cat ent.txt 
Hello, &#12; here's a test colon&#58;. 
Here's a test semi-colon&#59; '&#131;'

1 라이너 :

$ perl -pe's~&#(1?\d\d);~
> sub{ return chr($1) if (31 < $1 && $1 < 127); $& }->()~eg' ent.txt

또는보다 구체적인 Regex 사용 :

$ perl -pe"s~&#(1(?:[01][0-9]|2[0-6])|3[2-9]|[4-9][0-9]);~chr($1)~eg" ent.txt

두 개의 하나의 라이너는 동일한 출력을 생성합니다.

Hello, &#12; here's a test colon:. 
Here's a test semi-colon; '&#131;'

부스트 :: 정신 Parser Generator Framework는 바람직한 변환하는 파서를 쉽게 만들 수 있습니다. NCR에스.

// spirit_ncr2a.cpp
#include <iostream>
#include <string>
#include <boost/spirit/include/classic_core.hpp>

int main() {
  using namespace BOOST_SPIRIT_CLASSIC_NS; 

  std::string line;
  while (std::getline(std::cin, line)) {
    assert(parse(line.begin(), line.end(),
         // match "&#(\d+);" where 32 <= $1 <= 126 or any char
         *(("&#" >> limit_d(32u, 126u)[uint_p][&putchar] >> ';')
           | anychar_p[&putchar])).full); 
    putchar('\n');
  }
}

엮다:

    $ g++ -I/path/to/boost -o spirit_ncr2a spirit_ncr2a.cpp

운영:

    $ echo "Hello, &#12; here's a test colon&#58;." | spirit_ncr2a

산출:

    "Hello, &#12; here's a test colon:."

나는 Regex를 꽤 잘했다고 생각했지만 Lambdas가 Regex에서 사용 된 것을 본 적이 없다.

현재 Python을 사용하고 있으며이 OneLiner로 해결했을 것입니다.

''.join([x.isdigit() and chr(int(x)) or x for x in re.split('&#(\d+);',THESTRING)])

그것이 말이 되는가?

다음은 사용한 NCR 스캐너입니다 몸을 풀다:

/** ncr2a.y: Replace all NCRs by corresponding printable ASCII characters. */
%%
&#(1([01][0-9]|2[0-6])|3[2-9]|[4-9][0-9]); { /* accept 32..126 */
  /**recursive: unput(atoi(yytext + 2)); skip '&#'; `atoi()` ignores ';' */
  fputc(atoi(yytext + 2), yyout); /* non-recursive version */
}

실행 파일을 만들려면 :

$ flex ncr2a.y
$ gcc -o ncr2a lex.yy.c -lfl

예시:

$ echo "Hello, &#12; here's a test colon&#58;. 
> Here's a test semi-colon&#59; '&#131;'
> &#38;#59; <-- may be recursive" \
> | ncr2a

비수체 버전을 인쇄합니다.

Hello, &#12; here's a test colon:.
Here's a test semi-colon; '&#131;'
&#59; <-- may be recursive

그리고 재귀는 다음과 같습니다.

Hello, &#12; here's a test colon:.
Here's a test semi-colon; '&#131;'
; <-- may be recursive

이것은 원래 문제 설명이 분명히 완전하지 않은 경우 중 하나이지만, 32에서 126 사이의 문자를 생성하는 사례에 대해서만 트리거하고 싶다면 이전에 게시 한 솔루션에 대한 사소한 변화입니다. 내 솔루션은 또한 다중 패턴 케이스를 처리합니다 (이 첫 번째 버전은 인접한 패턴 중 일부가 범위 내이고 다른 패턴이없는 경우를 처리하지는 않지만).

      dd = "0123456789"
      ccp = "#" span(dd) $ n *lt(n,127) *ge(n,32) ";" *?(s = s char(n))
 +      fence (*ccp | null)
 rdl  line = input                              :f(done)
 repl line "&" *?(s = ) ccp = s                 :s(repl)
      output = line                             :(rdl)
 done
 end

그 사건을 처리하는 것은 특히 어렵지 않을 것입니다 (예 :#131;#58; 제작 ";#131; :"또한 :

      dd = "0123456789"
      ccp = "#" (span(dd) $ n ";") $ enc
 +      *?(s = s (lt(n,127) ge(n,32) char(n), char(10) enc))
 +      fence (*ccp | null)
 rdl  line = input                              :f(done)
 repl line "&" *?(s = ) ccp = s                 :s(repl)
      output = replace(line,char(10),"#")       :(rdl)
 done
 end

다음은 다음과 같은 버전입니다 boost::regex_token_iterator. 이 프로그램은 10 진수를 대체합니다 NCRS 읽기 stdin 상응하는 ASCII 문자로 인쇄하여 인쇄합니다 stdout.

#include <cassert>
#include <iostream>
#include <string>
#include <boost/lexical_cast.hpp>
#include <boost/regex.hpp>

int main()
{
  boost::regex re("&#(1(?:[01][0-9]|2[0-6])|3[2-9]|[4-9][0-9]);"); // 32..126
  const int subs[] = {-1, 1}; // non-match & subexpr
  boost::sregex_token_iterator end;
  std::string line;

  while (std::getline(std::cin, line)) {
    boost::sregex_token_iterator tok(line.begin(), line.end(), re, subs);

    for (bool isncr = false; tok != end; ++tok, isncr = !isncr) {
      if (isncr) { // convert NCR e.g., '&#58;' -> ':'
        const int d = boost::lexical_cast<int>(*tok);
        assert(32 <= d && d < 127);
        std::cout << static_cast<char>(d);
      }
      else
        std::cout << *tok; // output as is
    }
    std::cout << '\n';
  }
}

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 StackOverflow