Question

Since this is my first post I'd like to take the opportunity to say: What a great site SO is!

Anyway, to the question:

I'm somewhat of a Scala newbie and I'm trying to solve a data extraction and parsing problem with the parser combinators in Scala and I'm getting java.lang.StackOverflowError exceptions.

My real world example is too big to include so I'm reusing code from another SO question with the same problem. The code is slighly modified though. I tried to solve the problem using the PackratParsers but did not succeed.

import scala.util.parsing.combinator.syntactical.StandardTokenParsers
import scala.util.parsing.combinator.PackratParsers

object ArithmeticParser1 extends StandardTokenParsers with PackratParsers {
  lexical.delimiters ++= List("(", ")", "+", "-", "*", "/")

  lazy val reduceList: Int ~ List[String ~ Int] => Int = {
    case i ~ ps => (i /: ps)(reduce)
  }

  def reduce(x: Int, r: String ~ Int) = (r: @unchecked) match {
    case "+" ~ y => x + y
    case "-" ~ y => x - y
    case "*" ~ y => x * y
    case "/" ~ y => x / y
  }

  lazy val expr  : PackratParser[Int] = term ~ rep ("+" ~ term | "-" ~ term) ^^ reduceList
  lazy val term  : PackratParser[Int] = factor ~ rep ("*" ~ factor | "/" ~ factor) ^^ reduceList
  lazy val factor: PackratParser[Int] = "(" ~> expr <~ ")" | numericLit ^^ (_.toInt)

  def main(args: Array[String]) {
    val times = 500
    val s = "(" * times + "1 + 1" + ")" * times
    val tokens = new PackratReader(new lexical.Scanner(s))
    println(phrase(expr)(tokens))
  }
}

I've mixed in the PackratParsers trait, changed the defs to lazy vals and I'm using a PackratReader. What am I missunderstanding here? From reading Daniel C. Sobrals answer comments to the SO question How can I ignore non-matching preceding text when using Scala's parser combinators? it seems like PackratParsers should do the trick.

Ref: PackratParsers paper

Was it helpful?

Solution

The problem

The problem is that you are indeed filling the stack. Your expression consists of 500 opening brackets, the "1+1" and then 500 closing brackets. In the terms of your grammar you have 500 terms of type "factor" nested into each other, and then inside one term of type "expr".

For each start of a (nested) term, the parser has to push something on a stack (in this case a function call). When the nested term is finished, the parsed pops this thing from the stack (in this case: the function returns). If after the last token the stack is empty and if the stack never goes negative (pops too much), then your term is well formed (in your case: the parentheses are balanced).

In simple terms: The parser uses the stack to count if the parenthesis are balanced.

You are using several tools too speed the parsing up. None of these tools helps with the stack consumption.

Your helpers:

Using a packrat parser

The packrat parser caches already parsed parts, so they don't need to be parsed again. This can come with a nice speed up, when your grammar has lots of alternatives with common parts. It doesn't help in your case, because you ahve no alternatives. And it doesn't help with the stack consumption.

Omitting parts of the result

You use <~ and ~> to omit some parts of the parsed result. But this only helps inside a term. When the according rule is invoked, the parser must still push something on the stack.

Using tokens

You break up the input stream in tokens before you parse it. This usually speeds up the parsing, because tokenizing (non recursive, usually regular expressions) is much cheaper than parsing (recursive). But in your case, the problem lies in the depth of the nested terms (recursive problem). So your pressure is completely in the parsing part, and hence uses up the stack. Tokenizing does not help with that problem.

Solution

I don't think you can solve this easily. Parser have to use some kind of stack data structure. And usually the built in stack is used for performance issues. You would have to use some stack structure on the heap. This will usually be much slower and will not be compatible with Scala Parser Combinators.

You could try to wrap "factor" in continuation-passing style (CPS). Then the calls will be tracked on the heap and not on the stack.

Out of the box there is no simple solution for this.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top