Question

I'm working on a parser for a C-style language, and for that parser I need the regular expression that matches C-style /**/ comments. Now, I've found this expression on the web:

/\*([^\*]*\*+[^\*/])*([^\*]*\*+|[^\*]*\*/

However, as you can see, this is a rather messy expression, and I have no idea whether it actually matches exactly what I want it to match.

Is there a different way of (rigorously) defining regular expressions that are easy to check by hand that they are really correct, and are then convertible ('compilable') to the above regular expression?

Was it helpful?

Solution

I can think of four ways:

  1. Define an automaton for the language you are interested in. Convert the regular expression to an automaton (using Brzozowski's derivatives). Check that both automata accept the same language (determinize and minimize or use a bisimulation argument).

  2. Write loads of test cases and apply your regular expression to them.

  3. Convert the automaton defined in point 1 to a regular expression, using standard techniques.

  4. A combination of the above.

OTHER TIPS

If you want to be sure that you're parsing C comments, you need to confront your model with the C specification. C99 §6.4.9 defines the syntax of comments as follows:

1. Except within a character constant, a string literal, or a comment, the characters /* introduce a comment. The contents of such a comment are examined only to identify multibyte characters and to find the characters */ that terminate it.

2. Except within a character constant, a string literal, or a comment, the characters // introduce a comment that includes all multibyte characters up to, but not including, the next new-line character. The contents of such a comment are examined only to identify multibyte characters and to find the terminating new-line character.

This is English prose, not a formal definition, but there is a reasonably clear interpretation in terms of a nondeterministic finite automaton (NFA) that consumes a comment:

  • From the initial state, / followed by * enters the in-multiline-comment state, and / followed by / enters the in-single-line-comment state.
  • From the in-multiline-comment state, * followed by / enters the post-comment state.
  • From the in-single-line-comment state, a newline enters the post-comment state.
  • Any other character leaves the state unchanged.

Note that to know whether the initial state applies, you have to perform a little more analysis to detect string and character literals.

Once you have an NFA, you can use standard techniques to build a regular expression (I don't see them in the Wikipedia articles, but they should be discussed in textbooks).

If you already have a regular expression and would like to test it, you can compare its generated language with the one from the NFA deduced from the language specification: equality of regular languages is decidable. One way to decide the equality is to build a minimal deterministic automaton for each; if the languages are equivalent, the minimal DFAs will be isomorphic.

If you are writing a parser, this kind of stuff is handled by the lexical analyzer. And there you can express this by regular expressions, or (as the flex examples I've seen show) just "escape into the underlying language" and finish off the job there. I.e., on seeing /* just skip ahead until you find */ (a DFA for this is easy to build, and from there a C fragment is simple to write).

Licensed under: CC-BY-SA with attribution
Not affiliated with cs.stackexchange
scroll top