在java中使用RegEx解析CSV输入

https://stackoverflow.com/questions/1441556

10-07-2019
|

题

我知道，现在我有两个问题。但我很开心！

我从这个建议不要试图分裂，而是要匹配什么是可接受的字段，并从那里扩展到这个表达。

final Pattern pattern = Pattern.compile("\"([^\"]*)\"|(?<=,|^)([^,]*)(?=,|$)");

表达式看起来像没有恼人的转义引号：

"([^"]*)"|(?<=,|^)([^,]*)(?=,|$)

这对我来说效果很好 - 或者它匹配<！>引用两个引号和它们之间的任何内容<！>“;或<！>”行之间或逗号之间的某些内容和行尾或逗号<！>“;通过匹配迭代可以获得所有字段，即使它们是空的。例如，

the quick, "brown, fox jumps", over, "the",,"lazy dog"

分解为

the quick
"brown, fox jumps"
over
"the"

"lazy dog"

大！现在我想删除引号，所以我添加了前瞻和后瞻性非捕获组，就像我为逗号做的那样。

final Pattern pattern = Pattern.compile("(?<=\")([^\"]*)(?=\")|(?<=,|^)([^,]*)(?=,|$)");

表达式再次出现：

(?<=")([^"]*)(?=")|(?<=,|^)([^,]*)(?=,|$)

而不是期望的结果

the quick
brown, fox jumps
over
the

lazy dog

现在我得到了这个细分：

the quick
"brown
 fox jumps"
,over,
"the"
,,
"lazy dog"

我错过了什么？

解决方案

运营商优先权。基本上没有。这一切都是从左到右。所以or（|）适用于收尾报价lookahead和逗号前瞻

尝试：

(?:(?<=")([^"]*)(?="))|(?<=,|^)([^,]*)(?=,|$)

其他提示

(?:^|,)\s*(?:(?:(?=")"([^"].*?)")|(?:(?!")(.*?)))(?=,|$)

这应该做你想要的。

说明：

(?:^|,)\s*

模式应该以字符串或字符串的开头开头。另外，忽略开头的所有空格。

预测并查看其余部分是否以引号

开头

(?:(?=")"([^"].*?)")

如果确实如此，则非贪婪地匹配到下一个引用。

(?:(?!")(.*?))

如果它不以引号开头，则非贪婪地匹配到下一个逗号或字符串结尾。

(?=,|$)

模式应以逗号或字符串结尾结束。

当我开始理解我做错了什么时，我也开始明白这些看起来有多么复杂。我终于意识到我不想要所有匹配的文本，我想要它内部的特定组。我最终使用的东西与我原来的RegEx非常相似，只是我没有对结束逗号做一个预测，我认为这应该更有效率。这是我的最终代码。

package regex.parser;

import java.util.ArrayList;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class CSVParser {

    /*
     * This Pattern will match on either quoted text or text between commas, including
     * whitespace, and accounting for beginning and end of line.
     */
    private final Pattern csvPattern = Pattern.compile("\"([^\"]*)\"|(?<=,|^)([^,]*)(?:,|$)");  
    private ArrayList<String> allMatches = null;    
    private Matcher matcher = null;
    private String match = null;
    private int size;

    public CSVParser() {        
        allMatches = new ArrayList<String>();
        matcher = null;
        match = null;
    }

    public String[] parse(String csvLine) {
        matcher = csvPattern.matcher(csvLine);
        allMatches.clear();
        String match;
        while (matcher.find()) {
            match = matcher.group(1);
            if (match!=null) {
                allMatches.add(match);
            }
            else {
                allMatches.add(matcher.group(2));
            }
        }

        size = allMatches.size();       
        if (size > 0) {
            return allMatches.toArray(new String[size]);
        }
        else {
            return new String[0];
        }           
    }   

    public static void main(String[] args) {        
        String lineinput = "the quick,\"brown, fox jumps\",over,\"the\",,\"lazy dog\"";

        CSVParser myCSV = new CSVParser();
        System.out.println("Testing CSVParser with: \n " + lineinput);
        for (String s : myCSV.parse(lineinput)) {
            System.out.println(s);
        }
    }

}

我知道这不是OP想要的，但是对于其他读者，可以使用String.replace方法之一去除OPs当前正则表达式的结果数组中每个元素的引号。

许可以下： CC-BY-SA 和归因

不隶属于 StackOverflow