Lexing compressed data

Question

Although the question is very open, and an accurate answer is probably "it depends", it is generally the case that lexers work best when lexing is independent of context. In other words, if a field I-40 should be treated as an interstate reference regardless where it appears then it is probably OK to interpret it in the lexer. On the other hand, if certain fields need to be interpreted and other ones not, it may be more appropriate to handle the interpretation at a different level. So, for example:

M-Lee,"New York",I-40
I-40,Chicago,US-66

Is the I-40 in the first field of the second line a highway, or just some code which happens to look like a highway? In the second case, it might be more appropriate to use a parser rule like this: [1]

data: code ',' city ',' highway '\n' { $$ = MakeData($1,$3,$5); }
code: FIELD { $$ = MakeCode($1); }
city: FIELD { $$ = $1; }
highway: FIELD { $$ = MakeHighway($1); }

In the first case, you might have:

coded_data: CODE ',' FIELD ',' HIGHWAY '\n'
path_data:  HIGHWAY ',' FIELD ',' HIGHWAY '\n'

where it is assumed that can a FIELD is never confused either with a CODE or a HIGHWAY. (Alternatively, you could try to put the parsed HIGHWAY back together into a simple field, but that's getting somewhat ugly, too.)

So, on the whole, I'd opt for one of the following strategies:

Handle the lexical interpretation in a separate function called by the parser (as in my first example above)
Do the lexical interpretation in the lexer, taking advantage of the regular expression language provided by flex, but also retain the uninterpreted string, and decide which of the two you need in the parser. (In this case, the semantic value is a more complicated struct.)