Question

I am working on learning to use lex and yacc. This is a philosophical question about lexing and defining rules for lex.

Say that you want to produce a compiler for tabular data in a csv file. One of the fields has abbreviated and concatenated data.

VALUE1,VALUE2,I-40
VALUE3,VALUE4,US-66

Ultimately, you care that the road is an interstate or a US highway. When you lex these values, should you tokenize the road identifier, then have the compiler split I/US off from the number and deal with it, or should the lexer do that on the frontend?

Was it helpful?

Solution

Although the question is very open, and an accurate answer is probably "it depends", it is generally the case that lexers work best when lexing is independent of context. In other words, if a field I-40 should be treated as an interstate reference regardless where it appears then it is probably OK to interpret it in the lexer. On the other hand, if certain fields need to be interpreted and other ones not, it may be more appropriate to handle the interpretation at a different level. So, for example:

M-Lee,"New York",I-40
I-40,Chicago,US-66

Is the I-40 in the first field of the second line a highway, or just some code which happens to look like a highway? In the second case, it might be more appropriate to use a parser rule like this: [1]

data: code ',' city ',' highway '\n' { $$ = MakeData($1,$3,$5); }
code: FIELD { $$ = MakeCode($1); }
city: FIELD { $$ = $1; }
highway: FIELD { $$ = MakeHighway($1); }

In the first case, you might have:

coded_data: CODE ',' FIELD ',' HIGHWAY '\n'
path_data:  HIGHWAY ',' FIELD ',' HIGHWAY '\n' 

where it is assumed that can a FIELD is never confused either with a CODE or a HIGHWAY. (Alternatively, you could try to put the parsed HIGHWAY back together into a simple field, but that's getting somewhat ugly, too.)

So, on the whole, I'd opt for one of the following strategies:

  1. Handle the lexical interpretation in a separate function called by the parser (as in my first example above)

  2. Do the lexical interpretation in the lexer, taking advantage of the regular expression language provided by flex, but also retain the uninterpreted string, and decide which of the two you need in the parser. (In this case, the semantic value is a more complicated struct.)

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top