Regex代替:以":"等等

https://stackoverflow.com/questions/428013

06-07-2019
|

题

我有一串串如：

"Hello, here's a test colon&#58;. Here's a test semi-colon&#59;"

我要替换，

"Hello, here's a test colon:. Here's a test semi-colon;"

等所有 printable ASCII值.

目前我在用 boost::regex_search 匹配 &#(\d+);, 建立一串为我过程的每个比赛的中转(包括追加的substring含有没有匹配的，因为最后一场比赛我发现)。

任何人都可以想到一个更好的方式这样做？我很开放的非regex方法，但regex似乎是一个合理明智的做法，在这种情况。

谢谢，

Dom

解决方案

使用正则表达式的一大优势是处理棘手的情况，如＆amp;＃38;＃38; 实体替换不是迭代的，只需一步。正则表达式也相当高效：两个主角字符是固定的，因此它会快速跳过任何不以＆amp;＃开头的内容。最后，正则表达式解决方案对未来的维护者来说没有太多惊喜。

我会说正则表达式是正确的选择。

这是最好的正则表达式吗？你知道你需要两个数字，如果你有3个数字，第一个数字将是1.可打印的ASCII是在所有＆amp;＃32; - ＆amp;＃126; 之后。因此，您可以考虑＆amp;＃1？\ d \ d; 。

至于更换内容，我会使用为boost :: regex :: replace 描述的基本算法：

For each match // Using regex_iterator<>
    Print the prefix of the match
    Remove the first 2 and last character of the match (&#;)
    lexical_cast the result to int, then truncate to char and append.

Print the suffix of the last match.

其他提示

这可能会让我获得一些投票，因为这不是c ++，提升或正则表达式的反应，但这是一个SNOBOL解决方案。这适用于ASCII。我正在为Unicode工作。

        NUMS = '1234567890'
MAIN    LINE = INPUT                                :F(END)
SWAP    LINE ?  '&#' SPAN(NUMS) . N ';' = CHAR( N ) :S(SWAP)
        OUTPUT = LINE                               :(MAIN)
END

* Repaired SNOBOL4 Solution
* &#38;#38; -> &#38;
     digit = '0123456789'
main line = input                        :f(end)
     result = 
swap line arb . l
+    '&#' span(digit) . n ';' rem . line :f(out)
     result = result l char(n)           :(swap)
out  output = result line                :(main)
end

现有的SNOBOL解决方案不能正确处理多模式情况，因为只有一个“＆amp;”。以下解决方案应该更好地运作：

        dd = "0123456789"
        ccp = "#" span(dd) $ n ";" *?(s = s char(n)) fence (*ccp | null)
   rdl  line = input                              :f(done)
   repl line "&" *?(s = ) ccp = s                 :s(repl)
        output = line                             :(rdl)
   done
   end

我不知道boost中的正则表达式支持，但检查它是否有支持回调或lambda或其他类似的replace（）方法。这是我用其他语言写的正则表达式的常用方法。

这是一个Python实现：

s = "Hello, here's a test colon&#58;. Here's a test semi-colon&#59;"
re.sub(r'&#(1?\d\d);', lambda match: chr(int(match.group(1))), s)

产：

"Hello, here's a test colon:. Here's a test semi-colon;"

我现在看了一下boost，我看到它有一个regex_replace函数。但是C ++真的让我感到困惑，所以我无法弄清楚你是否可以使用回调替换部分。但是，如果我正确读取了增强文档，那么（\ d \ d）组匹配的字符串应该是1美元。如果我使用提升，我会查看它。

雅知道，只要我们不在这里，perl替换就有'e'选项。与评估表达式一样。 E.g。

echo“你好，这是一个测试结肠＆amp;＃58;。这是一个测试分号＆＃59;
进一步测试＆＃38;＃65;。。ABC＆安培;＃126;＆.DEF QUOT;结果| perl -we'sub translate {my $ x = $ _ [0]; if（（$ x＆gt; = 32）＆amp;＆amp;（$ x＆lt; = 126））
{return sprintf（＆quot;％c＆quot;，$ x）; } else {return＆quot;＆amp;＃＆quot;。$ x。＆quot;;＆quot ;;

while（＆lt;＆gt;）{s /＆amp;＃（1？\ d \ d）; /＆amp; translate（$ 1）/ ge;打印; }“

漂亮印刷：

#!/usr/bin/perl -w

sub translate
{
  my $x=雅知道，只要我们不在这里，perl替换就有'e'选项。与评估表达式一样。 E.g。


   echo“你好，这是一个测试结肠＆amp;＃58;。这是一个测试分号＆＃59; 
进一步测试＆＃38;＃65;。 。ABC＆安培;＃126;＆.DEF QUOT;结果| perl -we'sub translate {my $ x = $ _ [0]; if（（$ x＆gt; = 32）＆amp;＆amp;（$ x＆lt; = 126））
 {return sprintf（＆quot;％c＆quot;，$ x）; } else {return＆quot;＆amp;＃＆quot;。$ x。＆quot;;＆quot ;; 
 
 
 while（＆lt;＆gt;）{s /＆amp;＃（1？\ d \ d）; /＆amp; translate（$ 1）/ ge;打印; }“


漂亮印刷：

<*>

虽然perl是perl，但我确信有更好的方式来编写... 



返回C代码：

您也可以滚动自己的有限状态机。但是后来维护起来会变得混乱和麻烦。[0];

  if ( ($x >= 32) && ($x <= 126) )
  {
    return sprintf( "%c", $x );
  }
  else
  {
    return "&#" . $x . ";" ;
  }
}

while (<>)
{
  s/&#(1?\d\d);/&translate($1)/ge;
  print;
}

虽然perl是perl，但我确信有更好的方式来编写...

返回C代码：

您也可以滚动自己的有限状态机。但是后来维护起来会变得混乱和麻烦。

这里是另一个Perl的一衬(见 @mrree的答案):

一个测试文件：

$ cat ent.txt 
Hello, &#12; here's a test colon&#58;. 
Here's a test semi-colon&#59; '&#131;'

一衬:

$ perl -pe's~&#(1?\d\d);~
> sub{ return chr($1) if (31 < $1 && $1 < 127); $& }->()~eg' ent.txt

或者使用更多的具体regex:

$ perl -pe"s~&#(1(?:[01][0-9]|2[0-6])|3[2-9]|[4-9][0-9]);~chr($1)~eg" ent.txt

一套产生同样的产出：

Hello, &#12; here's a test colon:. 
Here's a test semi-colon; '&#131;'

提升：：精神析发电机的框架允许容易地创建一个分析器，将希望忆s.

// spirit_ncr2a.cpp
#include <iostream>
#include <string>
#include <boost/spirit/include/classic_core.hpp>

int main() {
  using namespace BOOST_SPIRIT_CLASSIC_NS; 

  std::string line;
  while (std::getline(std::cin, line)) {
    assert(parse(line.begin(), line.end(),
         // match "&#(\d+);" where 32 <= $1 <= 126 or any char
         *(("&#" >> limit_d(32u, 126u)[uint_p][&putchar] >> ';')
           | anychar_p[&putchar])).full); 
    putchar('\n');
  }
}

编译：

    $ g++ -I/path/to/boost -o spirit_ncr2a spirit_ncr2a.cpp

运行：

    $ echo "Hello, &#12; here's a test colon&#58;." | spirit_ncr2a

输出：

    "Hello, &#12; here's a test colon:."

我确实认为我在正则表达式方面相当不错，但我从未见过在正则表达式中使用过lambdas，请赐教我！

我目前正在使用python并且已经用这个oneliner解决了它：

''.join([x.isdigit() and chr(int(x)) or x for x in re.split('&#(\d+);',THESTRING)])

这有意义吗？

这是使用

这是原始问题陈述显然不是很完整的情况之一，但是如果你真的只想触发产生32到126之间字符的情况，这对解决方案来说是一个微不足道的改变我早先发布。请注意，我的解决方案还处理多模式情况（尽管第一个版本不会处理某些相邻模式在范围内而其他模式不在范围内的情况。）

      dd = "0123456789"
      ccp = "#" span(dd) $ n *lt(n,127) *ge(n,32) ";" *?(s = s char(n))
 +      fence (*ccp | null)
 rdl  line = input                              :f(done)
 repl line "&" *?(s = ) ccp = s                 :s(repl)
      output = line                             :(rdl)
 done
 end

处理这种情况并不是特别困难（例如;＃131;＃58;产生“;＃131;：”以及：

      dd = "0123456789"
      ccp = "#" (span(dd) $ n ";") $ enc
 +      *?(s = s (lt(n,127) ge(n,32) char(n), char(10) enc))
 +      fence (*ccp | null)
 rdl  line = input                              :f(done)
 repl line "&" *?(s = ) ccp = s                 :s(repl)
      output = replace(line,char(10),"#")       :(rdl)
 done
 end

这是基于 <代码>升压:: regex_token_iterator 。该程序用相应的ASCII字符替换从 stdin 读取的十进制 NCR 并将它们打印到 stdout 。

#include <cassert>
#include <iostream>
#include <string>
#include <boost/lexical_cast.hpp>
#include <boost/regex.hpp>

int main()
{
  boost::regex re("&#(1(?:[01][0-9]|2[0-6])|3[2-9]|[4-9][0-9]);"); // 32..126
  const int subs[] = {-1, 1}; // non-match & subexpr
  boost::sregex_token_iterator end;
  std::string line;

  while (std::getline(std::cin, line)) {
    boost::sregex_token_iterator tok(line.begin(), line.end(), re, subs);

    for (bool isncr = false; tok != end; ++tok, isncr = !isncr) {
      if (isncr) { // convert NCR e.g., '&#58;' -> ':'
        const int d = boost::lexical_cast<int>(*tok);
        assert(32 <= d && d < 127);
        std::cout << static_cast<char>(d);
      }
      else
        std::cout << *tok; // output as is
    }
    std::cout << '\n';
  }
}

许可以下： CC-BY-SA 和归因

不隶属于 StackOverflow