Question

I'm writing a Scala parser for the following grammar:

expr := "<" anyString ">" "<" anyString ">"
anyString := // any string

For example, "<foo> <bar>" is a valid string, as is "<http://www.example.com/example> <123>", and "<1> <_hello>"

So far, I have the following:

object MyParser extends JavaTokenParsers {

  override def skipWhitespace = false

  def expr: Parser[Any] = "<" ~ anyString ~ ">" ~ whiteSpace ~ "<" ~ anyString ~ ">"

  def anyString = ???

}

My questions are the following (I've included my suspected answer, but please confirm anyway, if I'm correct!):

  1. How to implement a regex parser which accepts any string? This must have an almost trivial answer, like def anyString = """\a*""".r, where \a is the symbol which represents any character (although \a is probably not the droid I'm looking for).

  2. If I set anyString to accept any string, will it stop before the > symbol or will it run until the end of the string and fail? I believe it will run until the end of the string and fail, and then it will eventually find the > and consume up to there. This seems to result in a very inefficient parser, and any comments on this would be appreciated!

  3. What if the string within < and > contains a > symbol (e.g. <fo>o> <bar>)? Will anyString consume until the first > or the last one? Is there any way to specify whether it consumes the least it can, or the most?

  4. In order to fix the previous point, I'd like to forbid < > in anyString. How to write that?.

Thank you!

Was it helpful?

Solution

I'm currently researching my own question, and I'll try to answer myself here.

  1. The Java Pattern documentation specifies that . matches any character. Therefore, the regex which accepts any string would be:

    def anyString = ".*".r
    

    To accept any non-empty string, we can use ".+".r.

  2. To understand this, consider the following toy example:

     object MyParser1 {
       override def skipWhitespace = false
       def expr = "<" ~ anyString ~ ">"
       def anyString = ".*".r
     }
    

    Here, the string <> is rejected. To test this, use:

    println(  MyParser1.parseAll(MyParser1.expr, "<>")  )
    

    This indicates that the .* parser is consuming until the end of the string, whereby the > is not available for the final parser. Therefore, it seems to be necessary to forbid < and > form appearing in anyString.

  3. As in the previous point, the .* parser consumes the whole string, and therefore consumes all > symbols.

  4. In the same documentation, a negation operator is given. To exclude < and >, we can write:

    def almostAnyString = "[^<>]*".r
    

    In general, the construct [^abc] will match any character except a, b, and c.

To conclude, the best implementation I've found so far is the following:

object MyParser extends JavaTokenParsers {
  override def skipWhitespace = false // don't allow whitespace between parsers by default

  def expr: Parser[Any] = "<" ~ almostAnyString ~ ">" ~
                          whiteSpace ~ // this parser is defined in JavaTokenParsers
                          "<" ~ almostAnyString ~ ">"

  def almostAnyString = "[^<>]*".r

}
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top