Treetop infinite loop when parsing Latex document

https://stackoverflow.com/questions/14313346

ruby
treetop

15-01-2022
|

Question

I'm trying to write a parser with treetop to parse some latex commands into HTML markup. With the following I get a deadspin in generated code. I've build the source code with tt and stepped through but it doesn't really elucidate what the underlying issue is (it just spins in _nt_paragraph)

Test input: "\emph{hey} and some more text."

grammar Latex
  rule document
    (paragraph)* {
      def content
        [:document, elements.map { |e| e.content }]
      end
    }
  end

  # Example: There aren't the \emph{droids you're looking for} \n\n. 
  rule paragraph
    ( text / tag )* eop {
      def content
        [:paragraph, elements.map { |e| e.content } ]
      end
    }
  end

  rule text
    ( !( tag_start / eop) . )* {
      def content
        [:text, text_value ]
      end
    }
  end

  # Example: \tag{inner_text}
  rule tag
    "\\emph{" inner_text '}' {
      def content
        [:tag, inner_text.content]
      end
    }
  end 

  # Example: \emph{inner_text}
  rule inner_text
    ( !'}' . )* {
      def content
        [:inner_text, text_value]
      end
    }
  end

  # End of paragraph.
  rule eop
    newline 2.. {
      def content
        [:newline, text_value]
      end
    }
  end

  rule newline
    "\n"
  end

  # You know, what starts a tag
  rule tag_start
    "\\"
  end

end

Solution

For anyone curious, Clifford over at the treetop dev google group figured this out.

The problem was with paragraph and text.

Text is 0 or more characters, and there can be 0 or more texts in a paragraph, so what was happening was there was an infinite amount of 0 length characters before the first \n, causing the parser to dead spin. The fix was to adjust text to be:

( !( tag_start / eop) . )+

So that it must have at least one character to match.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow