recognize Ruby code in Treetop grammar

https://stackoverflow.com/questions/4054761

27-09-2019
|

Question

I'm trying to use Treetop to parse an ERB file. I need to be able to handle lines like the following:

<% ruby_code_here %>
<%= other_ruby_code %>

Since Treetop is written in Ruby, and you write Treetop grammars in Ruby, is there already some existing way in Treetop to say "hey, look for Ruby code here, and give me its breakdown" without me having to write out separate rules to handle all parts of the Ruby language? I'm looking for a way, in my .treetop grammar file, to have something like:

rule erb_tag
  "<%" ruby_code "%>" {
    def content
      ...
    end
  }
end

Where ruby_code is handled by some rules that Treetop provides.

Edit: someone else parsed ERB using Ruby-lex, but I got errors trying to reproduce what he did. The rlex program did not produce a full class when it generated the parser class.

Edit: right, so you lot are depressing, but thanks for the info. :) For my Master's project, I'm writing a test case generator that needs to work with ERB as input. Fortunately, for my purposes, I only need to recognize a few things in the ERB code, such as if statements and other conditionals as well as loops. I think I can come up with Treetop grammar to match that, with the caveat that it isn't complete for Ruby.

Solution

As far as I know, nobody has yet created a Treetop grammar for Ruby. (In fact, nobody has ever been able to create any grammar for Ruby other than the YACC grammar that ships with MRI and YARV.) I know that the author of Treetop has been working on one for several years, but it's not a trivial undertaking. Getting the ANTLR grammar which is used in XRuby right took about 5 years, and it is still not fully compliant.

Ruby's syntax is insanely, mindbogglingly complex.

OTHER TIPS

No

I don't think so. Specifying the complex and subtle Ruby grammar in treetop would be a major accomplishment, but it should be possible.

The actual ruby grammer is written in yacc. Now, yacc is a legendary tool but treetop generates a more powerful class of parsers, so it should be possible and perhaps someone has done it.

It's not an afternoon project.

May be I'm kidding but if yacc is less complex than ruby then you could realize yacc in treetop which than uses the ruby grammar created for yacc.

For your purposes, you can probably get away without parsing all of Ruby. What you actually need is a way to detect the %> that closes off a Ruby block. If you don't ever want to fail when the Ruby code contains those closing characters, you must detect anywhere those characters can occur inside the Ruby text; which means you need to detect all forms of literals.

However for you purposes you can probably get away with recognising the most likely cases where %> would occur in Ruby text, and ignore just those cases. This assumes of course that any remaining failure can be handled by getting your user to write the ERB a little differently.

For what it's worth, Treetop itself "parses" Ruby blocks this way; it just counts { and } characters until the closing one is found. So if your block contains a } in a literal string, you're broken (but you can work around by including the matching one in a comment).

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow