Markdown blockquote parsing with ANTLR

https://stackoverflow.com/questions/2046080

20-09-2019
|

Question

This has been something that's been bothering me for a while. How does one go about parsing the following text into the HTML below using ANTLR? I can't seem to wrap my head around this at all.

Any Ideas?

Markdown:

> first line
> second line
> > nested quote

output HTML:

<blockquote>
  <p>first line
  second line</p>
  <blockquote>
    <p>nested quote</p>
  </blockquote>
</blockquote>

Solution

Funny that you mention that because I was tackling just this problem last week. See JMD, Markdown and a Brief Overview of Parsing and Compilers. I'm working on a true Markdown parser and I tried it with ANTLR.

There are a couple of ways you can deal with this.

Firstly you could just parse:

BLOCK_QUOTE : '>' (' ' | '\t')? ;

and work it out in the parsing step, possibly as a rewrite rule.

Thing is these are only important when they appear at the beginning of a line so here is another approach:

@members {
  int quoteDepth = 0;
}

BLOCK_QUOTE : '\n' (q+='>' (' ' | '\t')?)+
  { if ($q.size() > quoteDepth) /* emit one or more START_QUOTE tokens */
    else if ($q.size() < quoteDepth /* emit one or more END_QUOTE tokens */
    quoteDepth = $q.size(); }

The above may need to be a parser rule rather than a lexical rule too. I forget.

But even this is unsatisfying because it sort of forces you to treat the Markdown source as a sequence of lines, which isn't really what you want in other parts.

Also normally each lexical rule can only result in one token so you have to overwrite another class that escapes me to allow for emitting multiple tokens. There is an example of this in the (excellent and almost required) The Definitive ANTLR Reference: Building Domain-Specific Languages.

Ultimately I abandoned ANTLR as the tool of choice for this. My own hand-coded solution should hopefully be appearing in the next week or two.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow