Why is Scala's combinator parsing slow when parsing large files? What can I do?

https://stackoverflow.com/questions/23117635

04-07-2023
|

Question

I need to parse files that have millions of lines. I noticed that my combinator parser gets slower and slower as it parses more and more lines. The problem seems to be in scala's "rep" or regex parsers, because this behaviour occurs even for the simple example parser shown below:

def file: Parser[Int] = rep(line) ^^ { 1 }  // a file is a repetition of lines

def line: Parser[Int] = """(?m)^.*$""".r ^^ { 0 } // reads a line and returns 0

When I try to parse a file with millions of lines of equal length with this simple parser, in the beginning it parses 46 lines/ms. After 370000 lines, the speed drops to 20 lines/ms. After 840000 lines, it drops to 10 lines/ms. After 1790000 lines, 5 lines/ms...

My questions are:

Why does this happen?
What can I do to prevent this?

Solution

This is probably a result of the change in Java 7u6 that doesn't have substrings as a part of the original string. So big strings get copied over and over, causing lots and lots of memory churn (among other things). As you increase the amount of stuff you've parsed (I'm assuming you're storing at least some of it), the garbage collector has more and more work to do, so creating all that extra garbage has a steeper and steeper penalty.

There is a ticket to fix the memory usage, and code from Zach Moazeni there that lets you wrap your strings inside a construct that will make substrings properly (which you can pass into the parser in place of strings).

This won't necessarily change the overall result that parsing eventually slows down, but it should help reduce the time overall.

Also, I wouldn't advise making a file be a repetition of lines. You're making the parser keep track of the entire file when it really need not. I'd feed it in a line at a time. (And then if the lines are short, you may not need the above fix.)

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow