Question

I have been trying to get my head around Scala's parser combinators. It seems that they are pretty powerful but the only tutorial examples I seem to find are with mathematical expressions and very little proper real-world parsing examples with DSLs that need to be parsed and mapped to different entities etc.

For the sake of this example, lets say I have this BNF where I have this entity named Model, which is made up of a string like this: [model [name <name> ]]. This is a simplistic example of a much larger BNF I have and there are more entities in reality.

So I defined my own class Model which takes the name as the constructor and then defined my own ModelParser object which extends JavaTokenParsers. I then defined the following parsers, following the BNF (I know some may have a simpler regex matcher but I preferred to follow the BNF exactly for other reasons).

def model : Parser[Model] = "[model" ~> "[name" ~> name <~ "]]" ^^ ( Model(_) )
def name : Parser[String] = (letter ~ (anyChar*)) ^^ {case text => text.toString())
def anyChar = letter | digit | "_".r | "-".r
def letter = """[a-zA-Z]""".r
def digit = """\d""".r

The toString of Model looks like this:

override def toString : String = "[model " + name + "]"

When I try to run it with a string like [model [name helloWorld]] I get this [model [h~List(e, l, l, o, W, o, r, l, d)]] instead of what I am expecting [model helloWorld]

How do I get those individual characters to join back in the string they were originally in?

I am also confused with the individual parsers and the use of .r. Sometimes I saw examples where they had just the following as a parser (to parse "hello"):

def hello = "hello"

Isn't that just a String? How on Earth did it suddenly become a parser that can be combined with other parsers? And what is the .r actually doing? I have read at least 3 tutorials but still totally lost what is actually happening.

Was it helpful?

Solution

The problem is that anyChar* parses a List[String] (where in this case each string is a single character), and the result of calling toString on a list of strings is "List(...)", not the string you'd get by concatenating the contents. In addition, the case text => pattern is matching on the entire letter ~ (anyChar*), not just the anyChar* part.

It's possible to address both of these issues pretty straightforwardly:

case class Model(name: String) {
  override def toString : String = "[model " + name + "]"
}

import scala.util.parsing.combinator._

object ModelParser extends RegexParsers {
  def model: Parser[Model] = "[model" ~> "[name" ~> name <~ "]]" ^^ (Model(_))

  def name: Parser[String] = letter ~ (anyChar*) ^^ {
    case first ~ rest => (first :: rest).mkString
  }

  def anyChar = letter | digit | "_".r | "-".r
  def letter = """[a-zA-Z]""".r
  def digit = """\d""".r
}

We just append the first character string to the list of the rest, and then call mkString on the entire list, which will concatenate the contents. This works as expected:

scala> ModelParser.parseAll(ModelParser.model, "[model [name helloWorld]]")
res0: ModelParser.ParseResult[Model] = [1.26] parsed: [model helloWorld]

As you note, it would be possible (and possibly clearer and more performant) to let the regular expressions do more of the work:

object ModelParser extends RegexParsers {
  def model: Parser[Model] = "[model" ~> "[name" ~> name <~ "]]" ^^ (Model(_))

  def name: Parser[String] = """[a-zA-Z\d_-]+""".r
}

This example also illustrates the way that the parsing combinator library uses implicit conversions to cut down on some of the verbosity of writing parsers. As you say, def hello = "hello" defines a string, and "[a-zA-Z]+".r defines a Regex (via the r method on StringOps), but either can be used as a parser because RegexParsers defines implicit conversions from String (this one's named literal) and Regex (regex) to Parser[String].

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top