Question

Taking as a starting point the code example from the Parslet's own creator (available in this link) I need to extend it so as to retrieve all the non-commented text from a file written in a C-like syntax.

The provided example is able to successfully parse C-style comments, treating these areas as regular line spaces. However, this simple example only expects 'a' characters in the non-commented areas of the file such as the input example:

         a
      // line comment
      a a a // line comment
      a /* inline comment */ a 
      /* multiline
      comment */

The rule used to detect the non-commented text is simply:

   rule(:expression) { (str('a').as(:a) >> spaces).as(:exp) }

Therefore, what I need is to generalize the previous rule to get all the other (non-commented) text from a more generic file such as:

     word0
  // line comment
   word1 // line comment
  phrase /* inline comment */ something 
  /* multiline
  comment */

I am new to Parsing Expression Grammars and neither of my previous trials succeeded.

Was it helpful?

Solution

The general idea is that everything is code (aka non-comment) until one of the sequences // or /* appears. You can reflect this with a rule like this:

rule(:code) {
  (str('/*').absent? >> str('//').absent? >> any).repeat(1).as(:code)
}

As mentioned in my comment, there is a small problem with strings, though. When a comment occurs inside a string, it obviously is part of the string. If you were to remove comments from your code, you would then alter the meaning of this code. Therefore, we have to let the parser know what a string is, and that any character inside there belongs to it. Another thing are escape sequences. For example the string "foo \" bar /*baz*/", which contains a literal double quote, would actually be parsed as "foo \", followed by some code again. This is of course something that needs to be addressed. I have written a complete parser that handles all of the above cases:

require 'parslet'

class CommentParser < Parslet::Parser
  rule(:eof) { 
    any.absent? 
  }

  rule(:block_comment_text) {
    (str('*/').absent? >> any).repeat.as(:comment)
  }

  rule(:block_comment) {
    str('/*') >> block_comment_text >> str('*/')
  }

  rule(:line_comment_text) {
    (str("\n").absent? >> any).repeat.as(:comment)
  }

  rule(:line_comment) {
    str('//') >> line_comment_text >> (str("\n").present? | eof)
  }

  rule(:string_text) {
    (str('"').absent? >> str('\\').maybe >> any).repeat
  }

  rule(:string) {
    str('"') >> string_text >> str('"')
  }

  rule(:code_without_strings) {
    (str('"').absent? >> str('/*').absent? >> str('//').absent? >> any).repeat(1)
  }

  rule(:code) {
    (code_without_strings | string).repeat(1).as(:code)
  }

  rule(:code_with_comments) {
    (code | block_comment | line_comment).repeat
  }

  root(:code_with_comments)
end

It will parse your input

     word0
  // line comment
   word1 // line comment
  phrase /* inline comment */ something 
  /* multiline
  comment */

to this AST

[{:code=>"\n   word0\n "@0},
 {:comment=>" line comment"@13},
 {:code=>"\n  word1 "@26},
 {:comment=>" line comment"@37},
 {:code=>"\n phrase "@50},
 {:comment=>" inline comment "@61},
 {:code=>" something \n "@79},
 {:comment=>" multiline\n comment "@94},
 {:code=>"\n"@116}]

To extract everything except the comments you can do:

input = <<-CODE
     word0
  // line comment
   word1 // line comment
  phrase /* inline comment */ something 
  /* multiline
  comment */
CODE

ast = CommentParser.new.parse(input)
puts ast.map{|node| node[:code] }.join

which will produce

   word0

  word1
 phrase  something

OTHER TIPS

Another way to handle comments is to consider them white space. For example:

rule(:space?) do
  space.maybe
end

rule(:space) do
  (block_comment | line_comment | whitespace).repeat(1)
end

rule(:whitespace) do
  match('/s')
end

rule(:block_comment) do
  str('/*') >>
  (str('*/').absent >> match('.')).repeat(0) >>
  str('*/')
end

rule (:line_comment) do
  str('//') >> match('[^\n]') >> str("\n")
end

Then, when you are writing rules with white-space, such as this entirely off-the-cuff and probably wrong rule for C,

rule(:assignment_statement) do
  lvalue >> space? >> str('=') >> space? >> rvalue >> str(';')
end

comments get "eaten" by the parser without any fuss. Anywhere white space can or must appear, comments of any kind are allowed, and are treated as white space.

This approach is not as suitable for your exact problem, which is to recognize non-comment text in a C program, but it works very well in a parser which must recognize the full language.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top