Using PEG Parser for BBCode Parsing: pegjs or … what?

https://stackoverflow.com/questions/11268376

18-06-2021
|

Question

I have a bbcode -> html converter that responds to the change event in a textarea. Currently, this is done using a series of regular expressions, and there are a number of pathological cases. I've always wanted to sharpen the pencil on this grammar, but didn't want to get into yak shaving. But... recently I became aware of pegjs, which seems a pretty complete implementation of PEG parser generation. I have most of the grammar specified, but am now left wondering whether this is an appropriate use of a full-blown parser.

My specific questions are:

As my application relies on translating what I can to HTML and leaving the rest as raw text, does implementing bbcode using a parser that can fail on a syntax error make sense? For example: [url=/foo/bar]click me![/url] would certainly be expected to succeed once the closing bracket on the close tag is entered. But what would the user see in the meantime? With regex, I can just ignore non-matching stuff and treat it as normal text for preview purposes. With a formal grammar, I don't know whether this is possible because I am relying on creating the HTML from a parse tree and what fails a parse is ... what?
I am unclear where the transformations should be done. In a formal lex/yacc-based parser, I would have header files and symbols that denoted the node type. In pegjs, I get nested arrays with the node text. I can emit the translated code as an action of the pegjs generated parser, but it seems like a code smell to combine a parser and an emitter. However, if I call PEG.parse.parse(), I get back something like this:

[
       [
          "[",
          "img",
          "",
          [
             "/",
             "f",
             "o",
             "o",
             "/",
             "b",
             "a",
             "r"
          ],
          "",
          "]"
       ],
       [
          "[/",
          "img",
          "]"
       ]
    ]

given a grammar like:

document
   = (open_tag / close_tag / new_line / text)*

open_tag
   = ("[" tag_name "="? tag_data? tag_attributes? "]")


close_tag
   = ("[/" tag_name "]") 

text
   = non_tag+

non_tag
   = [\n\[\]]

new_line
   = ("\r\n" / "\n")

I'm abbreviating the grammar, of course, but you get the idea. So, if you notice, there is no contextual information in the array of arrays that tells me what kind of a node I have and I'm left to do the string comparisons again even thought the parser has already done this. I expect it's possible to define callbacks and use actions to run them during a parse, but there is scant information available on the Web about how one might do that.

Am I barking up the wrong tree? Should I fall back to regex scanning and forget about parsing?

Thanks

Solution

Regarding your first question I have tosay that a live preview is going to be difficult. The problems you pointed out regarding that the parser won't understand that the input is "work in progress" are correct. Peg.js tells you at which point the error is, so maybe you could take that info and go a few words back and parse again or if an end tag is missing try adding it at the end.

The second part of your question is easier but your grammar won't look so nice afterwards. Basically what you do is put callbacks on every rule, so for example

text
   = text:non_tag+ {
     // we captured the text in an array and can manipulate it now
     return text.join("");
   }

At the moment you have to write these callbacks inline in your grammar. I'm doing a lot of this stuff at work right now, so I might make a pullrequest to peg.js to fix that. But I'm not sure when I find the time to do this.

OTHER TIPS

First question (grammar for incomplete texts):

You can add

incomplete_tag = ("[" tag_name "="? tag_data? tag_attributes?)
//                         the closing bracket is omitted ---^

after open_tag and change document to include an incomplete tag at the end. The trick is that you provide the parser with all needed productions to always parse, but the valid ones come first. You then can ignore incomplete_tag during the live preview.

Second question (how to include actions):

You write socalled actions after expressions. An action is Javascript code enclosed by braces and are allowed after a pegjs expression, i. e. also in the middle of a production!

In practice actions like { return result.join("") } are almost always necessary because pegjs splits into single characters. Also complicated nested arrays can be returned. Therefore I usually write helper functions in the pegjs initializer at the head of the grammar to keep actions small. If you choose the function names carefully the action is self-documenting.

For an examle see PEG for Python style indentation. Disclaimer: this is an answer of mine.

Try something like this replacement rule. You're on the right track; you just have to tell it to assemble the results.

text = result:non_tag+ { return result.join(''); }

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow