Constants and compile time evaluation - Why change this behaviour

https://stackoverflow.com/questions/8990722

19-04-2021
|

Question

If you forward to approximately 13 minutes into this video by Eric Lippert he describes a change that was made to the C# compiler that renders the following code invalid (Apparently prior to and including .NET 2 this code would have compiled).

int y;
int x = 10;
if (x * 0 == 0)
    y = 123;

Console.Write(y);

Now I understand that clearly any execution of the above code actually evaluates to

int y;
int x = 10;
y = 123;
Console.Write(y);

But what I dont understand is why it is considered "desirable" to make the following code in-compilable? IE: What are the risks with allowing such inferences to run their course?

Solution

I'm still finding this question a bit confusing but let me see if I can rephrase the question into a form that I can answer. First, let me re-state the background of the question:

In C# 2.0, this code:

int x = 123;
int y;
if (x * 0 == 0) 
    y = 345;
Console.WriteLine(y);

was treated as though you'd written

int x = 123;
int y;
if (true) 
    y = 345;
Console.WriteLine(y);

which in turn is treated as:

int x = 123;
int y;
y = 345;
Console.WriteLine(y);

Which is a legal program.

But in C# 3.0 we took the breaking change to prevent this. The compiler no longer treats the condition as being "always true" despite the fact that you and I both know that it is always true. We now make this an illegal program, because the compiler reasons that it does not know that the body of the "if" is always executed, and therefore does not know that the local variable y is always assigned before it is used.

Why is the C# 3.0 behaviour correct?

It is correct because the specification states that:

a constant expression must contain only constants. x * 0 == 0 is not a constant expression because it contains a non-constant term, x.
the consequence of an if is only known to be always reachable if the condition is a constant expression equal to true.

Therefore, the code given should not classify the consequence of the conditional statement to be always reachable, and therefore should not classify the local y as being definitely assigned.

Why is it desirable that a constant expression contain only constants?

We want the C# language to be clearly understandable by its users, and correctly implementable by compiler writers. Requiring that the compiler make all possible logical deductions about the values of expressions works against those goals. It should be simple to determine whether a given expression is a constant, and if so, what its value is. Put simply, the constant evaluation code should have to know how to perform arithmetic, but should not need to know facts about arithmetical manipulations. The constant evaluator knows how to multiply 2 * 1, but it does not need to know the fact that "1 is the multiplicative identity on integers".

Now, it is possible that a compiler writer might decide that there are areas in which they can be clever, and thereby generate more optimal code. Compiler writers are permitted to do so, but not in a way that changes whether code is legal or illegal. They are only allowed to make optimizations that make the output of the compiler better when given legal code.

How did the bug happen in C# 2.0?

What happened was the compiler was written to run the arithmetic optimizer too early. The optimizer is the bit that is supposed to be clever, and it should have run after the program was determined to be legal. It was running before the program was determined to be legal, and was therefore influencing the result.

This was a potential breaking change: though it brought the compiler into line with the specification, it also potentially turned working code into error code. What motivated the change?

LINQ features, and specifically expression trees. If you said something like:

(int x)=>x * 0 == 0

and converted that to an expression tree, do you expect that to generate the expression tree for

(int x)=>true

? Probably not! You probably expected it to produce the expression tree for "multiply x by zero and compare the result to zero". Expression trees should preserve the logical structure of the expression in the body.

When I wrote the expression tree code it was not clear yet whether the design committee was going to decide whether

()=>2 + 3

was going to generate the expression tree for "add two to three" or the expression tree for "five". We decided on the latter -- constants are folded before expression trees are generated, but arithmetic should not be run through the optimizer before expression trees are generated.

So, let's consider now the dependencies that we've just stated:

Arithmetic optimization has to happen before codegen.
Expression tree rewriting has to happen before arithmetic optimizations
Constant folding has to happen before expression tree rewriting
Constant folding has to happen before flow analysis
Flow analysis has to happen before expression tree rewriting (because we need to know if an expression tree uses an uninitialized local)

We've got to find an order to do all this work in that honours all those dependencies. The compiler in C# 2.0 did them in this order:

constant folding and arithmetic optimization at the same time
flow analysis
codegen

Where can expression tree rewriting go in there? Nowhere! And clearly this is buggy, because flow analysis is now taking into account facts deduced by the arithmetic optimizer. We decided to rework the compiler so that it did things in the order:

constant folding
flow analysis
expression tree rewriting
arithmetic optimization
codegen

Which obviously necessitates the breaking change.

Now, I did consider preserving the existing broken behaviour, by doing this:

constant folding
arithmetic optimization
flow analysis
arithmetic de-optimization
expression tree rewriting
arithmetic optimization again
codegen

Where the optimized arithmetic expression would contain a pointer back to its unoptimized form. We decided that this was too much complexity in order to preserve a bug. We decided that it would be better to instead fix the bug, take the breaking change, and make the compiler architecture more easily understood.

OTHER TIPS

The specification states that the definite assignment of something that is only assigned inside an if block is undetermined. The spec says nothing about compiler magic that removes the unnecessary if block. In particular, it makes for a very confusing error message as you change the if condition, and suddenly get an error about y not being assigned "huh? I haven't changed when y is assigned!".

The compiler is free to perform any obvious code removal it wants to, but first it needs to follow the specification for the rules.

Specifically, section 5.3.3.5 (MS 4.0 spec):

5.3.3.5 If statements For an if statement stmt of the form:

if ( expr ) then-stmt else else-stmt

v has the same definite assignment state at the beginning of expr as at the beginning of stmt.

If v is definitely assigned at the end of expr, then it is definitely assigned on the control flow transfer to then-stmt and to either else-stmt or to the end-point of stmt if there is no else clause.

If v has the state “definitely assigned after true expression” at the end of expr, then it is definitely assigned on the control flow transfer to then-stmt, and not definitely assigned on the control flow transfer to either else-stmt or to the end-point of stmt if there is no else clause.

If v has the state “definitely assigned after false expression” at the end of expr, then it is definitely assigned on the control flow transfer to else-stmt, and not definitely assigned on the control flow transfer to then-stmt. It is definitely assigned at the end-point of stmt if and only if it is definitely assigned at the end-point of then-stmt.

Otherwise, v is considered not definitely assigned on the control flow transfer to either the then-stmt or else-stmt, or to the end-point of stmt if there is no else

For an initially unassigned variable to be considered definitely assigned at a certain location, an assignment to the variable must occur in every possible execution path leading to that location.

technically, the execution path exists where the if condition is false; if y was also assigned in the else, then fine, but... the specification explicitly makes no demand of spotting the if condition is always true.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow