Responsibilities of the Lexer and the Parser

Question 1

I don't think there is any consensus to the question of whether 42foo should be recognised as an invalid number or as two tokens. It's a question of style and both usages are common in well-known languages.

For example:

$ python -c 'print 42and False'
False

$ lua -e 'print(42and false)'
lua: (command line):1: malformed number near '42a'

$ perl -le 'print 42and 0'
42

# Not an idiosyncracy of tcc; it's defined by the standard
$ tcc -D"and=&&" -run - <<<"main(){return 42and 0;}"
stdin:1: error: invalid number

# gcc has better error messages
$ gcc -D"and=&&" -x c - <<<"main(){return 42and 0;}" && ./a.out
<stdin>: In function ‘main’:
<stdin>:1:15: error: invalid suffix "and" on integer constant
<stdin>:1:21: error: expected ‘;’ before numeric constant

$ ruby -le 'print 42and 1'
42

# And now for something completely different (explained below)
$ awk 'BEGIN{print 42foo + 3}'
423

So, both possibilities are in common use.

If you're going to reject it because you think a number and a word should be separated by whitespace, you should reject it in the lexer. The parser cannot (or should not) know whether whitespace separates two tokens. Independent of the validity of 42and, the fragments 42 + 1, 42+1, and 42+ 1) should all be parsed identically. (Except, perhaps, in Fortress. But that was an anomaly.) If you don't mind shoving numbers and words together, then let the parser reject it if (and only if) it is a syntax error.

As a side note, in C and C++, 42and is initially lexed as a "preprocessor number". After preprocessing, it needs to be relexed and it is at that point that the error message is produced. The reason for this odd behaviour is that it is completely legitimate to paste together two fragments to produce a valid number:

$ gcc -D"c_(x,y)=x##y" -D"c(x,y)=c_(x,y)"  -x c - <<<"int main(){return c(12E,1F);}"
$ ./a.out; echo $?
120

Both 12E and 1F would be invalid integers, but pasted together with the ## operator, they form a perfectly legitimate float. The ## operator only works on single tokens, so 12E and 1F both need to lexed as single tokens. c(12E+,1F) wouldn't work, but c(12E0,1F) is also fine.

This is also why you should always put spaces around the + operator in C: classic trick C question: "What is the value of 0x1E+2?"

Finally, the explanation for the awk line:

$ awk 'BEGIN{print 42foo + 3}'
423

That's lexed by awk as BEGIN{print 42 foo + 3} which is then parsed as though it had been written BEGIN{print (42)(foo + 3);}. In awk, string concatenation is written without an operator, but it binds less tightly than any arithmetic operator. Consequently, the usual advice is to use explicit parentheses in expressions which involve concatenation, unless they are really simple. (Also, undefined variables are assumed to have the value 0 if used arithmetically and "" if used as strings.)

Question 2

I disagree with other answers here. It should be done by the lexer. If the character following the digits isn't whitespace or a special character, you're in the middle of an illegal token, specifically an identifier that doesn't start with a letter.

Or else just return the 45 and the 'bar' separately and let the parser handle it as a syntax error.

Question 3

Yes, contextual checks like this belong in the parser.

Also, you say that foo = 42bar is invalid. From the lexer's perspective, it is not, though. The 4 tokens recognized by your lexer are (probably) correct (you don't post your token definitions).

foo = 42bar may or may not be a valid statement in your language.

Question 4

Edit: I just realized that that's actually an invalid token for your language. So yes, it would fail the lexer at that point, because you have no rule matching it. Otherwise, what would it be, InvalidTokenToken?

But let's say that it was a valid token. Say you write a lexer rule saying that id = <number> is ok... what do you do about id = <number> + <number> - <number>, and all of the various combinations that that leads to? How is the lexer going to give you an AST for any of those? This is where the parser comes in.

Are you using a parser combinator framework? I ask because sometimes with those the distinction between parser and lexer rules starts to seem arbitrary, especially since you may not have an explicit grammar in front of you. But the language you're parsing still has a grammar, and what counts as a parser rule is each production of the grammar. At the very "bottom" if you have rules which describe a single terminal, like "a number is one or more digits," and this, and this alone is what the lexer gets used for -- the reason being that it can speed up the parser and simplify its implementation.