正则表达式问题 - 引用封闭文本块之外的一个或多个空格

https://stackoverflow.com/questions/263985

06-07-2019
|

题

我希望用一个空格替换多个空格的任何出现，但在引号之间的文本中不采取任何操作。

有没有办法用Java正则表达式做到这一点？如果是这样，你可以尝试一下或给我一个暗示吗？

解决方案

这是另一种方法，它使用前瞻来确定当前位置之后的所有引号都是匹配对。

text = text.replaceAll("  ++(?=(?:[^\"]*+\"[^\"]*+\")*+[^\"]*+$)", " ");

如果需要，可以调整前瞻以处理引用部分内的转义引号。

其他提示

当尝试匹配可能包含在其他内容中的内容时，构造一个匹配两者的正则表达式会很有帮助，如下所示：

("[^"\\]*(?:\\.[^"\\]*)*")|(  +)

这将匹配带引号的字符串或两个或多个空格。因为这两个表达式是组合的，所以它将匹配带引号的字符串或两个或多个空格，但不匹配引号内的空格。使用此表达式，您需要检查每个匹配项以确定它是否是带引号的字符串或两个或更多空格并相应地执行操作：

Pattern spaceOrStringRegex = Pattern.compile( "(\"[^\"\\\\]*(?:\\\\.[^\"\\\\]*)*\")|(  +)" );

StringBuffer replacementBuffer = new StringBuffer();

Matcher spaceOrStringMatcher = spaceOrStringRegex.matcher( text );

while ( spaceOrStringMatcher.find() ) 
{
    // if the space group is the match
    if ( spaceOrStringMatcher.group( 2 ) != null ) 
    {
        // replace with a single space
        spaceOrStringMatcher.appendReplacement( replacementBuffer, " " );
    }
}

spaceOrStringMatcher.appendTail( replacementBuffer );

引号之间的文字：引号是在同一行还是多行？

对它进行标记并在标记之间发出单个空格。快速google for“java tokenizer处理引号”出现：此链接

YMMV

编辑：SO不喜欢那个链接。以下是谷歌搜索链接：谷歌的。这是第一个结果。

就个人而言，我不使用Java，但是这个RegExp可以解决这个问题：

([^\" ])*(\\\".*?\\\")*

使用RegExBuddy尝试表达式，它会生成此代码，对我来说很好看：

try {
    Pattern regex = Pattern.compile("([^\" ])*(\\\".*?\\\")*", Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE);
    Matcher regexMatcher = regex.matcher(subjectString);
    while (regexMatcher.find()) {
        for (int i = 1; i <= regexMatcher.groupCount(); i++) {
            // matched text: regexMatcher.group(i)
            // match start: regexMatcher.start(i)
            // match end: regexMatcher.end(i)

            // I suppose here you must use something like
            // sstr += regexMatcher.group(i) + " "
        }
    }
} catch (PatternSyntaxException ex) {
    // Syntax error in the regular expression
}

至少，它似乎在Python中运行良好：

import re

text = """
este  es   un texto de   prueba "para ver  como se comporta  " la funcion   sobre esto
"para ver  como se comporta  " la funcion   sobre esto  "o sobre otro" lo q sea
"""

ret = ""
print text  

reobj = re.compile(r'([^\" ])*(\".*?\")*', re.IGNORECASE)

for match in reobj.finditer(text):
    if match.group() <> "":
        ret = ret + match.group() + "|"

print ret

在您解析所引用的内容后，根据需要批量或逐片地在其余内容上运行此内容：

String text = "ABC   DEF GHI   JKL";
text = text.replaceAll("( )+", " ");
// text: "ABC DEF GHI JKL"

杰夫，你是在正确的轨道上，但你的代码中有一些错误，即：（1）你忘了逃避否定字符类中的引号; （2）第一捕获组内的parens应该是非捕获种类; （3）如果第二组捕获parens没有参与匹配， group（2）返回null，你没有测试它; （4）如果在正则表达式中测试两个或更多个空格而不是一个或多个，则不需要稍后检查匹配的长度。这是修改后的代码：

import java.util.regex.*;

public class Test
{
  public static void main(String[] args) throws Exception
  {
    String text = "blah    blah  \"boo   boo boo\"  blah  blah";
    Pattern p = Pattern.compile( "(\"[^\"\\\\]*(?:\\\\.[^\"\\\\]*)*\")|(  +)" );
    StringBuffer sb = new StringBuffer();
    Matcher m = p.matcher( text );
    while ( m.find() ) 
    {
      if ( m.group( 2 ) != null ) 
      {
        m.appendReplacement( sb, " " );
      }
    }
    m.appendTail( sb );
    System.out.println( sb.toString() );
  }
}

许可以下： CC-BY-SA 和归因

不隶属于 StackOverflow