Parsing sentences using Scala parser combinator

https://stackoverflow.com/questions/21355513

02-10-2022
|

Question

I just started playing with parser combinators in Scala, but got stuck on a parser to parse sentences such as "I like Scala." (words end on a whitespace or a period (.)).

I started with the following implementation:

package example

import scala.util.parsing.combinator._

object Example extends RegexParsers {
  override def skipWhitespace = false

  def character: Parser[String] = """\w""".r

  def word: Parser[String] =
    rep(character) <~ (whiteSpace | guard(literal("."))) ^^ (_.mkString(""))

  def sentence: Parser[List[String]] = rep(word) <~ "."
}

object Test extends App {
  val result = Example.parseAll(Example.sentence, "I like Scala.")

  println(result)
}

The idea behind using guard() is to have a period demarcate word endings, but not consume it so that sentences can. However, the parser gets stuck (adding log() reveals that it is repeatedly trying the word and character parser).

If I change the word and sentence definitions as follows, it parses the sentence, but the grammar description doesn't look right and will not work if I try to add parser for paragraph (rep(sentence)) etc.

def word: Parser[String] =
  rep(character) <~ (whiteSpace | literal(".")) ^^ (_.mkString(""))

def sentence: Parser[List[String]] = rep(word) <~ opt(".")

Any ideas what may be going on here?

Solution

However, the parser gets stuck (adding log() reveals that it is repeatedly trying the word and character parser).

The rep combinator corresponds to a * in perl-style regex notation. This means it matches zero or more characters. I think you want it to match one or more characters. Changing that to a rep1 (corresponding to + in perl-style regex notation) should fix the problem.

However, your definition still seems a little verbose to me. Why are you parsing individual characters instead of just using \w+ as the pattern for a word? Here's how I'd write it:

object Example extends RegexParsers {
  override def skipWhitespace = false

  def word: Parser[String] = """\w+""".r

  def sentence: Parser[List[String]] = rep1sep(word, whiteSpace) <~ "."
}

Notice that I use rep1sep to parse a non-empty list of words separated by whitespace. There's a repsep combinator as well, but I think you'd want at least one word per sentence.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow