Performance of uu-parsinglib compared to "try" in Parsec

Question 1

To understand the subtleties it is important to understand the differences between the try construct in Haskell and the non-greedy parsing strategy used in uu-parsinglib. Effectively the latter is a try which just looks ahead one symbol. In that respect it is less powerful than the try construct from Parsec, in which you specify that a specific construct has to be present completely. And then there is the underlying different overall strategy. Parsec uses a back-tracking strategy with explicit tries to commit, whereas uu-parsinglib uses a breadth-first strategy with an occasional single symbol look-ahead.

So it does not come as a surprise that there is a time difference between the two. In the Parsec case it is decided that the tried alternative does apply after having seen the complete construct (two symbols), whereas the greedy uu-parsinglib decides that this must be the right alternative after having successfully seen the first symbol. And this conclusion may be unjustified.

If one moves to the breadth-first strategy uu-parsinglib uses one has to keep track of several alternatives at the same time, and this take time. Two alternative, twice the time, etc.

The advantage of Parsec is that you can tune a back-tracking parser by liberal use of try constructs and by placing alternatives in the right order, but you are also more likely to ask questions on mailing list about why your parsers do not work as expected. You are not so much writing a grammar as writing parser. The uu-parsinglib starts from the other end of the spectrum: we try to accept quite a large collection of grammars (and the parsers implied by them).

My feeling is also that in the presence of try constructs having excellent error-repairing parsers is much more complicated. Once a try construct fails it is not possible to go back there and decide that, with a small correction, it is much better than the alternatives that come after it.

Question 2

The behavior you're describing (using pList_ng) does apply to other parsers (such as the simple list-of-successes method described in, for example, Jeroen Fokker's Functional Parsers) combinators), but not to uu-parsinglib. The library uses a breadth-first strategy to avoid space leaks (as a result of hanging on to the entire input, as would be the case when using a depth-first strategy). That is why I asked whether you had either created a test case or looked at the internals at all.

For a more technical description, see the paper in Text.ParserCombinators.UU.README (and maybe the source code after that). Here I'll use pExample2 to sketch the parsing process. Branching happens in pList_ng (in pTest), which uses <|> to recognize either the empty string or another element. Because pTest is followed by pa, instead of the empty string, the alternative to parsing another element is actually recognizing a single 'a'.

When we see the first 'a' in the input, both alternatives can successfully parse this character. Next, we see a 'b'. Now the alternative that parses only a single 'a' cannot make any further progress, so we drop that one. There is one alternative left: the one that recognizes (a list of) 'a' followed by 'b' (pTest). Next up is another 'a', and there are again two alternatives to consider. But then we see a 'b' and, again, we can immediately drop the second alternative. Then there is one last characer, an 'a', which once more means two alternatives. But now we get to the end of the string and only the alternative obtained by letting pa recognize the final a leads to a succesful parse.

Considering the alternative input "ababab", we see that the pa alternative fails again when we get to the final 'b', so only the pTest alternative remains. That one finishes because we get to the end and then pa (following pTest in pExample2) fails, so we get an error.

At any point, uu-parsinglib only needs to keep alternatives in memory that have not yet failed, and the breadth-first strategy ensures that all alternatives are evaluated in lockstep, so there is no need to hang on to the first 'a' and 'b' until the end of the string is reached and it does not match the whole string first before backtracking.

Edit:

From what I gather about Parsec, this is indeed less efficient, because Parsec does not consider pa until pTest finishes. Section 5.1 in the paper says a few words about something like try for uu-parsinglib, including some objections. Raw speed is not a primary goal and I once saw a presentation of some benchmarks where uu-parsinglib also did not come out on top, but overall it performed reasonably well. If speed is so important for this compiler you mentioned and if you don't need any extra features such as online results or error correction, maybe you should just stick to Parsec? (Or look for more comprehensive benchmarks first.)

There are clearly significant differences between the two libraries, so I'm not surprised to see that Parsec is faster in some cases, although the difference in this case is indeed pretty big. Maybe there is a way to tweak the uu-parsinglib version further (without changing internals as hinted at by that section about greedy parsing in the paper), but it's not obvious (to me anyway).

Well, one thing you could do is rewrite the grammar:

pTest' = pList $ pa *> pList ((\x y -> [x,y]) <$$> pb <*> pa)

For Parsec that would the following I think (but this seems to make it slower):

pTest' = many $ pa *> many (flip (\x y -> [x,y]) <$> pb <*> pa)

This helps, but not enough to beat the Parsec version. Using your benchmark, I get the following results:

 uu-parsinglib (pTest)  - 1.98s
 uu-parsinglib (pTest') - 1.11s
 Parsec (pTest)         - 0.59s
 Parsec (pTest')        - 0.67s