Conditional Dependencies in Compiler Semantic Analysis Passes

Question 1

Why aren't standard optimization techniques adequate?

Presumably, the Excel formulas form a DAG with the leaves being primitive values and the nodes being computations/assignments. (If the Excel computation forms a cycle, then you need some kind of iterative solver assuming you want a fixpoint).

If you simply propagate the conditional by lifting it (a class compiler optimization), we start with your original equations, where each computation is evaluated in any order WRT to others, such that the result computes dag-like (that "anyorder" is an operator intending to model that):

X = if(COND == TRUE) then 0 else Y;
     anyorder
Y = if(COND == TRUE) then X else 1;

then lifting the conditional:

if (COND)  { X=0; } else { X = 1; }
      anyorder
if (COND)   { Y=X; } else { Y = 1; }

then

if (COND)  { X=0; anyorder Y=X; } else { X = Y; anyorder Y = 1; }

Each of the arms must be dag-like. The first arm is daglike evaluating the X=0 assignment first. The second arm is daglike evaluating Y=1 first. So, we get the answer you wanted:

if (COND)  { X=0; Y=X; } else { Y = 1; X = Y; }

So conventional transformations and knowledge about anyorder-if-daglike knowledges seems to give the right effect.

I'm not sure what you do if COND is computed as a function of the cells.

I suspect the way to do this is to generate a dependency graph of computations with with conditionals on the dependencies. You probably have to propagate/group those conditionals over the arcs more as less as I did over the syntax.

Question 2

Yes, literature exists, sorry I cannot quote any, I simply don't remember and would it just google up just as you can..

Basic algos for dependency and cycle analysis are really simple. I.e. detect symbols in the expression, build a set of expressions and dependencies in form:

    inps             expr      outs
cell_A6, cell_B7 -> expr3 -> cell_A7
cell_A1, cell_B4 -> expr1 -> cell_A5
cell_A1, cell_A5 -> expr2 -> cell_A6

and then by comparing and iteratively expanding/replacing sets of inputs/outputs:

step0:
cell_A6, cell_B7 -> expr3 -> cell_A7
cell_A1, cell_B4 -> expr1 -> cell_A5   <--1 note that cell_A5 ~ (A1,B4)
cell_A1, cell_A5 -> expr2 -> cell_A6      <--1 apply that knowledge here

so dependency
cell_A1, cell_A5 -> expr2 -> cell_A6
morphs into
cell_A1, cell_B4 -> expr2 -> cell_A6   <--2 note that cell_A6 ~ (A1,B4) and so on

Finally, you will get either a set of full dependencies, where you can easily detect circular dependencies, like for example:

cell_A1, cell_D6, cell_F7 -> exprN -> cell_D6

or, if none found - you will be able to determine a safe, incremental order of the execution.

If the expressions contain branches or sideeffects other than the 'returned value', you can apply various transformations to reduce/expand the expressions into new ones, or into groups of new expressions that will be of the form above. For example:

B5 = { if(A5 + A3 > 0) A3-1 else A5+1 }

so

   inps    ...       outs
A3, A5 -> theExpr -> B5

the condition can be 'lifted' and form two conditional rules:

A5 + A3 > 0  : A3 -> reducedexpr "A3-1" -> B5
A5 + A3 <= 0 : A5 -> reducedexpr "A5-1" -> B5

but now, your execution/analysis must also take care of the conditions before applying the rules. Lifting is only one of possible transformations.

However, you stil need something more than that, at least some an 'extension' for it. The hard part of your problem is that your expressions are complex, have branches, and you need to include user-random input to resolve branches to eliminate the dead branches and break dead dependencies.

Since the key is elimination of dead dependencies, you have to somehow detect dead branches. Conditions can be of any arbitrary complexity, and user-input is random, so you cannot work it out completely statically, really. After playing with transformations, you would still have to analyze the conditions and generate code accordingly. To do so, you would need to generate code for all possible combinations of the outcomes of the conditions, and all resulting branching and rule combinations, which is simply infeasible except for some trivial cases. With number of unknown the number of leafs can grow exponentially (2^N) which is a huge bloat after crossing some threshold.

Of course while analyzing conditions based on Bools, you can analyze, group and eliminate conflicting conditions like (a & b & !a)..

..but if your input values and conditions include NON-BOOL data, like integers or floating or strings, just imagine your condition is have a condition that executes some external weird statistical function and checks its result.. Ignore the 'weird' part and focus on 'external'. If you meet some expressions that use complex functions like AVG or MAX, you cannot chew through something like that statically(*). Even simple arithmetic is hard to analyze: (a+b)*(c+d) - you could derive a fact that c+d can be ignored when a+b==0, but this a really tough task to cover fully..

IIRC, doing a satisfiability analysis (SAT) for boolean expressions with basic operators is an NP-hard problem, not mentioning integers or floating points with all their math.. Calculating the result of expression is much easier than telling which values does it really depend on!!

So, since input values may be either hardcoded (cool) or user-supplied at runtime (doh!), your compler most probably will not be able to fully analyze it up front. Now link it with the fact marked as (*) and it's quite obvious that you can include some static analysis and try to eliminate some branches at 'compilation time', but still there might be some parts that must be delayed until the user provides the actual input.

So, if part of the analysis must be done at runtime, all the branch elimination is just an optional optimisation and I think you should focus on the runtime part now.

At minimal unoptimized version, your generated program could simply remember all the excel-expressions and wait for input data. Once the program is run and input is given, the program has to substitute the input in the expressions, and then try to iteratively reduce them to output values.

Writing such algo in imperative language is completely possible. Actually, you'd need to write it once, and later you'd just merge it with a different sets of rules derived from cell-formulas and done. Runtime part of the program would be the same, formulas would change.

You could then expand the 'compiler' side to try to help by i.e. preliminarily partially analyzing the dependencies and trying to reorder the rules so later they will be checked in a "better order", or by precalculating constants, or inlining some expressions and so on but as I said, it's all optimizations, not core feature.

Sadly, I cannot really tell you much anything serious about the "functional languages", but since usually their runtimes are 'very dynamic' and sometimes they even execute the code in terms of symbols and transformations, it could reduce the complexity of your 'compiler' and 'engine' part. The most valuable asset here is the dynamism. So, even a Ruby would do much better than C - but in no way it's a "compiled" language as you'd say.

For example, you could try to transform excel rules directly into functions:

def cell_A5 = expr1(cell_A1, cell_B4)
def cell_A7 = expr3(cell_A6, cell_B7)
def cell_A6 = expr2(cell_A1, cell_A5)

write it down as part of the program, then when at runtime when the user provides some values, you'd those would just redefine some of the parts of the program

cell_B7 = 11.2  // filling up undefined variable
cell_A1 = 23    // filling up undefined variable
cell_A5 = 13    // overwriting the function with a value

That's the power of dynamic platforms, nothing very 'functional' here. Dynamic platforms make it easy to fill/override bits. But then, once the user provided some bits and once the program has been "corrected on the fly", which one function would you call first?

The answer is somewhat sad.. You don't know.

If your dynamic language has some rule-engine built into it, you can try generating rules instead of functions and later rely on that engine to "fill up" everything that is possible to calculate.

But if it doesn't have rule engine, you are back to point one..

afterthought:

Hm.. sorry, I think I just wrote too much and too vaguely/chatty. If you think it's helpful, please drop me a comment. Otherwise I'll delete it after few days or a week.