matching tag pairs in Treetop grammar

https://stackoverflow.com/questions/4150860

08-10-2019
|

Question

I don't want a repeat of the Cthulhu answer, but I want to match up pairs of opening and closing HTML tags using Treetop. Using this grammar, I can match opening tags and closing tags, but now I want a rule to tie them both together. I've tried the following, but using this makes my parser go on forever (infinite loop):

rule html_tag_pair
  html_open_tag (!html_close_tag (html_tag_pair / '' / text / newline /
    whitespace))+ html_close_tag <HTMLTagPair>
end

I was trying to base this off of the recursive parentheses example and the negative lookahead example on the Treetop Github page. The other rules I've referenced are as follows:

rule newline
  [\n\r] {
    def content
      :newline
    end
  }
end

rule tab
  "\t" {
    def content
      :tab
    end
  }
end

rule whitespace
  (newline / tab / [\s]) {
    def content
      :whitespace
    end
  }
end

rule text
  [^<]+ {
    def content
      [:text, text_value]
    end
  }
end

rule html_open_tag
  "<" html_tag_name attribute_list ">" <HTMLOpenTag>
end

rule html_empty_tag
  "<" html_tag_name attribute_list whitespace* "/>" <HTMLEmptyTag>
end

rule html_close_tag
  "</" html_tag_name ">" <HTMLCloseTag>
end

rule html_tag_name
  [A-Za-z0-9]+ {
    def content
      text_value
    end
  }
end

rule attribute_list
  attribute* {
    def content
      elements.inject({}){ |hash, e| hash.merge(e.content) }
    end
  }
end

rule attribute
  whitespace+ html_tag_name "=" quoted_value {
    def content
      {elements[1].content => elements[3].content}
    end
  }
end

rule quoted_value
  ('"' [^"]* '"' / "'" [^']* "'") {
    def content
      elements[1].text_value
    end
  }
end

I know I'll need to allow for matching single opening or closing tags, but if a pair of HTML tags exist, I'd like to get them together as a pair. It seemed cleanest to do this by matching them with my grammar, but perhaps there's a better way?

Solution

You can only do this using either a separate rule for each HTML tag pair, or using a semantic predicate. That is, by saving the opening tag (in a sempred), then accepting (in another sempred) a closing tag only if it is the same tag. This is much harder to do in Treetop than it should be, because there's no convenient place to save the context and you can't peek up the parser stack, but it is possible.

BTW, the same problem occurs in parsing MIME boundaries (and in Markdown). I haven't checked Mikel's implementation in ActionMailer (probably he uses a nested Mime parser for that), but it is possible in Treetop.

In http://github.com/cjheath/activefacts/blob/master/lib/activefacts/cql/parser.rb I save context in a fake input stream - you can see what methods it has to support - because "input" is available on all SyntaxNodes. I have a different kind of reason for using sempreds there, but some of the techniques are applicable.

OTHER TIPS

Here is a really simple grammar that uses a semantic predicate to match the closing tag to the starting tag.

grammar SimpleXML
  rule document
    (text / tag)*
  end

  rule text
    [^<]+
  end

  rule tag
    "<" [^>]+ ">" (text / tag)* "</" [^>]+ &{|seq| seq[1].text_value == seq[5].text_value } ">"
  end
end

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow