Is it possible to efficiently look ahead more than one Char in Attoparsec?

Question

There are a few solutions to this, but none are great....

Method 1- Fast to implement, but not so fast to run

Well, (according to http://hackage.haskell.org/package/attoparsec-0.10.1.1/docs/Data-Attoparsec-ByteString.html), attoparsec always backtracks on failure, so you can always do something like this-

parseLine1 = do
  line <- takeTill (== '\n')
  char '\n'
  case <some sort of test on line, ie- a regex> of
    Just -> return <some sort of data type>
    Nothing -> fail "Parse Error"

then later many of these chained together will work as expected

parseLine = parseLine1 <|> parseLine2

The problem with this solution is, as you can see, you are still doing a bunch of backtracking, which can really slow things down.

Method 2- The traditional method

The usual way to handle this type of thing is to rewrite the grammar, or in the case of a parser combinator, move stuff around, to make the full algorithm need only one character of lookahead. This can almost always be done in practice, although it sometimes makes the logic much harder to follow....

For example, suppose you have a grammar production rule like this-

pet = "dog" | "dolphin"

This would need two characters of lookahead before either path could be resolved. Instead you can left factor the whole thing like this

pet => "ca" halfpet
halfpet => "g" | "lphin"

No parallel processing is needed, but the grammar is much uglier. (Although I wrote this as a production rule, there is a one to one mapping to a similar parser combinator).

Method 3- The correct way, but involved to write.

The true way that you want to do this is to directly compile a regex to a parser combinator.... Once you compile any regular language, the resulting algorithm always only need one character of lookahead, so the resulting attoparsec code should be pretty simple (like the routine in method 1 for a single character read), but the work will be in compiling the regex.

Compiling a regex is covered in many textbooks, so I won't go into detail here, but it basically amounts to replacing all the ambiguous paths in the regex state machine with new states. Or to put it differently, it automatically "left factors" all the cases that would need backtracking.

(I wrote a library that automatically "left factors" many cases in context free grammars, turning almost any context free grammar into linear parser once, but I haven't yet made it available.... some day, when I have cleaned it up I will).