How do I handle C-style comments in Ruby using Parslet?

Question 1

The general idea is that everything is code (aka non-comment) until one of the sequences // or /* appears. You can reflect this with a rule like this:

rule(:code) {
  (str('/*').absent? >> str('//').absent? >> any).repeat(1).as(:code)
}

As mentioned in my comment, there is a small problem with strings, though. When a comment occurs inside a string, it obviously is part of the string. If you were to remove comments from your code, you would then alter the meaning of this code. Therefore, we have to let the parser know what a string is, and that any character inside there belongs to it. Another thing are escape sequences. For example the string "foo \" bar /*baz*/", which contains a literal double quote, would actually be parsed as "foo \", followed by some code again. This is of course something that needs to be addressed. I have written a complete parser that handles all of the above cases:

require 'parslet'

class CommentParser < Parslet::Parser
  rule(:eof) { 
    any.absent? 
  }

  rule(:block_comment_text) {
    (str('*/').absent? >> any).repeat.as(:comment)
  }

  rule(:block_comment) {
    str('/*') >> block_comment_text >> str('*/')
  }

  rule(:line_comment_text) {
    (str("\n").absent? >> any).repeat.as(:comment)
  }

  rule(:line_comment) {
    str('//') >> line_comment_text >> (str("\n").present? | eof)
  }

  rule(:string_text) {
    (str('"').absent? >> str('\\').maybe >> any).repeat
  }

  rule(:string) {
    str('"') >> string_text >> str('"')
  }

  rule(:code_without_strings) {
    (str('"').absent? >> str('/*').absent? >> str('//').absent? >> any).repeat(1)
  }

  rule(:code) {
    (code_without_strings | string).repeat(1).as(:code)
  }

  rule(:code_with_comments) {
    (code | block_comment | line_comment).repeat
  }

  root(:code_with_comments)
end

It will parse your input

     word0
  // line comment
   word1 // line comment
  phrase /* inline comment */ something 
  /* multiline
  comment */

to this AST

[{:code=>"\n   word0\n "@0},
 {:comment=>" line comment"@13},
 {:code=>"\n  word1 "@26},
 {:comment=>" line comment"@37},
 {:code=>"\n phrase "@50},
 {:comment=>" inline comment "@61},
 {:code=>" something \n "@79},
 {:comment=>" multiline\n comment "@94},
 {:code=>"\n"@116}]

To extract everything except the comments you can do:

input = <<-CODE
     word0
  // line comment
   word1 // line comment
  phrase /* inline comment */ something 
  /* multiline
  comment */
CODE

ast = CommentParser.new.parse(input)
puts ast.map{|node| node[:code] }.join

which will produce

   word0

  word1
 phrase  something

Question 2

Another way to handle comments is to consider them white space. For example:

rule(:space?) do
  space.maybe
end

rule(:space) do
  (block_comment | line_comment | whitespace).repeat(1)
end

rule(:whitespace) do
  match('/s')
end

rule(:block_comment) do
  str('/*') >>
  (str('*/').absent >> match('.')).repeat(0) >>
  str('*/')
end

rule (:line_comment) do
  str('//') >> match('[^\n]') >> str("\n")
end

Then, when you are writing rules with white-space, such as this entirely off-the-cuff and probably wrong rule for C,

rule(:assignment_statement) do
  lvalue >> space? >> str('=') >> space? >> rvalue >> str(';')
end

comments get "eaten" by the parser without any fuss. Anywhere white space can or must appear, comments of any kind are allowed, and are treated as white space.

This approach is not as suitable for your exact problem, which is to recognize non-comment text in a C program, but it works very well in a parser which must recognize the full language.