Is a literal out of range a syntax error or a semantic error?

https://softwareengineering.stackexchange.com/questions/393505

26-02-2021
|

Question

I am reading more about the differences between syntax and semantics, but I am still wondering about this one.

Let's assume that we have a language that only allows integers to be in the range of 0-255, so 8-bits, and the parser receives 256 or 257.

Should that be an error of syntax or semantics?

Resources I've checked:

Solution

It depends on the grammar of the language.

If the grammar precludes recognizing 256 or 257, then those character sequences are syntax errors — that character sequence is outside of what can be recognized. For example, if the grammar admits one or two digits as well as 1[0-9][0-9], and 2[0-4][0-9], and 25[0-5], and nothing else.

When we get a syntax error, the parser cannot find a legal parsing for this input — it has to be an error. A good language implementation might try its best to build parse output, but would know it is accepting an illegal program, (e.g. it might do this in order to try to report still other syntax errors instead of simply stopping at one).

However, if the grammar allows arbitrary digit strings, yet language precludes values above 255, these have to be found as semantic, or even runtime errors (e.g. 128 + 128 could be either a semantic or runtime error).

There is thus some latitude in how these things are handled, and it goes to whether parsing can proceed or not — a more general grammar might allow parsing the program with 256 so as to continue on past parsing and report other errors in the program as well.

Some languages publish their official grammar; others have unofficial grammars.

OTHER TIPS

I am pretty sure (but don't have a proof for this), that you can turn any static error into a syntax error by making the grammar arbitrarily more complex.

Historically, we have distinguished between lexical analysis (lexing / scanning), syntactic analysis (parsing), and semantic analysis (type checking, type inference, name resolution, etc.) for performance reasons: it was not possible to fit an entire compiler into the memory of a computer, so we broke it down into several components.

The reader turns a stream of bits into a stream of characters (this is usually a simple mapping table, but it might be more complex in the case of variable-length encodings such as UTF-8)
The lexer turns a stream of characters into a stream of tokens (typically, a lexer recognizes a regular language, i.e. a lexer is a Finite Automaton)
The parser turns a stream of tokens into a parse tree (concrete syntax tree) (typically, a parser recognizes a context-free language)
The semantic analyzer first turns a parse tree into an abstract syntax tree (using arbitrary Turing-capable computation)
The various further components of the compiler annotate the AST with additional information, transform it into a different representation, etc.

Nowadays, this is no longer strictly necessary. The reader is typically supplied by the operating system or the standard library of the language. Scannerless parsers are thing, i.e. parsers that integrate scanning into the parsing process.

And, if we want to, we can build arbitrarily complex parsers that recognize much more than just context-free languages. Both modern lexer and parser frameworks allow you to build lexers and parsers that are more powerful than the traditional regular / context-free lexers and parsers. Plus, they allow you to attach arbitrary code in a Turing-complete language. E.g. when you generate C code from GNU Bison, you can embed arbitrary C code of your own into the generated parser which can interoperate with and influence the generated part of the code.

In your example, it would be easy to modify the grammar to only allow integers from 0–255. In a language like Java, where you can use bytes in some places and longs in others, expressing this as a grammar would be much more complex. And so, what the designers of Java did, was to define just one integer literal, and then define semantic rules for how this literal is to be interpreted in different contexts.

However, I do believe that it would be possible to check this in the parsing stage and thus make it a syntax error … it would just make the parser more complex, in fact, it would make the parser as complex as the entire compiler (at least the analysis part of the compiler minus optimizer and code generator).

You decide as the compiler writer. If you are a user of a compiler, it doesn't really make much difference.

But there are different situations where different ranges of integer literals would be acceptable, so producing different scanner tokens for different ranges, then handling this with syntax, would be an absolute pain for the compiler writer. And what if you declare "int a [100];" where an assignment a[99] = 0; is legal but a[100] = 0; isn't? This can only reasonably be checked with a semantic rule.

Licensed under: CC-BY-SA with attribution

Not affiliated with softwareengineering.stackexchange