Does this require a 2-pass parse: comments embedded within tokens?

https://stackoverflow.com/questions/16990139

31-05-2022
|

Pregunta

Using a parser generator I want to create a parser for "From headers" in email messages. Here is an example of a From header:

From: "John Doe" <john@doe.org>

I think it will be straightforward to implement a parser for that.

However, there is a complication in the "From header" syntax: comments may be inserted just about anywhere. For example, a comment may be inserted within "john":

From: "John Doe" <jo(this is a comment)hn@doe.org>

And comments may be inserted in many other places.

How to handle this complication? Does it require a "2-pass" parser: one pass to remove all comments and a second pass to create the parse tree for the From header? Do modern parser generators support multiple passes on the input? Can it be parsed in a single pass? If yes, would you sketch the approach please?

Solución

I'm not convinced that your interpretation of email addresses is correct; my reading of RFC-822 leads me to believe that a comment can only come before or after a "word", and that "word"s in the local-part of an addr-spec need to be separated by dots ("."). Section 3.1.4 gives a pretty good hint on how to parse: you need a lexical analyzer which feeds syntactic symbols into the parser; the lexical analyzer is expected to unfold headers, ignore whitespace, and identify comments, quoted strings, atoms, and special characters.

Of course, RFC-822 has long been obsoleted, and I think that email headers with embedded comments are anachronistic.

Nonetheless, it seems like you could easily achieve the analysis you wish using flex and bison. As indicated, flex would identify the comments. Strictly speaking, you cannot identify comments with a regular expression, since comments nest. But you can recognize simple nested structures using a start condition stack, or even more economically by maintaining a counter (since flex won't return until the outermost parenthesis is found, the counter doesn't need to be global.)

Licenciado bajo: CC-BY-SA con atribución

No afiliado a StackOverflow