Question

I want to parse Delimiter Separated values with quoting characters and escape for quoting.

For example: a, "b""c""", d -> Expected to pare as three columns (a), (b"C"), (d) assuming comma as delimiter, quote is both quoting character and escape character.

I want to support multiple delimiters and enclosing characters also.

For example: a, "b""c"""|d -> Expected to pare as three columns if we use both comma and | used as delimiter.

Another example: a, <b\<c\>>|d -> Expected to parse as three columns if we use both comma and | as delimiters, < as left enclosure > as right enclosure and \ as escape.

Is it possible to create a parser combinator using JParsec?

After spending some time with API, I expected below code to work, but it is not working as expected to parse the above example.

Parser<?> quote_content = Scanners.notAmong(rightEnclose).many();
Parser<?> quoted = Scanners.nestableBlockComment(Scanners.among(leftEnclose),
            Scanners.among(rightEnclose), quote_content);
Parser<?> unquoted = Scanners.notAmong(delimiter + leftEnclose);
Parser<?> chunk =  Parsers.or(escapedSequence(), unquoted);

Parser<?>  all = chunk().many1().source().sepBy(Scanners.among(delimiter));

Please suggest is it possible using JParsec, is there any better alternative?

No correct solution

OTHER TIPS

Here is a basic working example using double quotes as string enclosing and doubling double-quotes to escape double-quotes (SQL-like strings...):

   @Test public void test() throws Exception {
     Parser<Void> escapingDoubleQuotesString = pattern(regex("((\"\")|[^\",])*"), "string");
     Parser<String> quoted = escapingDoubleQuotesString //
       .between(isChar('"'), isChar('"')).source() //
       .map(unquoteString());

     assertThat(quoted.parse("\"\"\"c\"")).isEqualTo("\"c");

     Parser<String> unquoted = escapingDoubleQuotesString.source().map(unescapeQuotes());

     assertThat(unquoted.parse("\"\"c")).isEqualTo("\"c");

     Parser<List<String>> separated = quoted.or(unquoted).sepBy(pattern(regex("\\s*,\\s*"), "comma"));

     assertThat(separated.parse("a,\"b\"\"c\"\"\", d")).containsExactly("a", "b\"c\"", "d");
   }

   private Map<? super String, ? extends String> unescapeQuotes() {
     return new Map<String, String>() {
         @Override public String map(String s) {
           return s.replace("\"\"", "\"");
         }
       };
   }

   private Map<String, String> unquoteString() {
     return new Map<String, String>() {
         @Override public String map(String s) {
           return unescapeQuotes().map(s.substring(1, s.length() - 1));
         }
       };
   }

This could be improved by distinguishing quoted-strings content from unquoted strings content to allow using commas inside quoted strings. From this base it should be rather easy to add more separators or change the way strings are quoted/bracketed.

As a general guideline, using Test Driven Development to write jparsec parsers is a good combo. At the very least you should write unit tests to have a good understanding of how each parser works and how they combine.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top