Question

I am making a basic lexical analyser in Java for my semester project and I am at conflict on a concept with my subject teacher.

My view is that in general if an input like "1a" is given to lexical analyser then it should give output as:

"<Number><Identifier>"

But, my teacher says that it should flag this as an error because instead of treating it as a number and a identifier it should flag the whole string(i.e. "1a") as an error.This is because(as he says) identifiers cannot start with a number.

On the contrary I think this should be the responsibility of next stage of compiler(syntax analyser) to decide if something is a valid identifier or not. I know he is right about identifiers not starting with a number but I need closure on the part that the lexical analyser should be the one deciding that.

I will really appreciate your help. Thank you

Was it helpful?

Solution

A lexical analyzer should be dealing with which kinds of tokens are legal or not and and dividing the text into tokens. It will error out if a string cannot form a valid token.

The syntax analyzer only deals with the structure of the program once the tokens have been determined. It will give an error if the tokens cannot be parsed according to the given grammar.

So your teacher is correct. Determining whether an identifier is legal falls under lexical analysis.

OTHER TIPS

I agree with your teacher, correct list of identifiers is work for lexical analyser. http://en.wikipedia.org/wiki/Lexical_analysis

Detecting this in the parser would only work for grammars where a number followed by an identifier happens to be syntactically invalid. If 1 a was valid syntax in your language, you would have to handle this in the lexer because the parser can't distinguish between 1a (no whitespace) and 1 a (with whitespace).

Why not do this in the lexer? The lexer's job is to make the parser's job easier. Any work it can do to simplify your parser without adding a lot of complexity to the lexer itself is a good idea.

The reason for this is that languages often use postfixes on numbers, like 1L in C is the value 1 of type long instead of the default type int. Also you want to be able to add postfixes later in a language. Consider your 1a. First this would be parsed as int value 1 followed by an identifier a. But now the creator of your compiler decides to start using a as a postfix on numbers. Suddenly 1a becomes a single token.

For 1a there is also a special case which is that 1a could be meant as a hexidecimal number but you forgot to put on the required postfix/prefix 0x1a for C or 1ah for certain assembler versions.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top