Writing a compiler - bound-checked arrays with integer expression bounds (best practice)

Question 1

It's almost impossible to tell in EVERY case, for example, consider this:

void foo(int &x, int y)
{
   switch(y)
   {
      case 1:
         x = 11;
         break;
      case 2:
         x = 42;
         break;
      ...  // numbers 3-9 elided for brevity
      case 10:
         x = 97;
         break;
   }
}

int bar(int z)
{
    int a;
    foo(a, z);
}

Is a initialized or not? Well, depends on the value of z. If you have ALL the code available to you, you could, at least in theory, follow all paths to see what possible values z will have (assuming, of course, z doesn't come as input from an external source - in which case it's potentially badly written code if it doesn't range-check, but a lot of code will just happily accept that input is "ok").

So, you have to take one of two routes:

Choose to assume that variables that are "plausibly initialized" are indeed initialized, and only give a warning when it's certain that something is not initialized.
Choose to assume that variables that are "plausibly initialized" are indeed NOT initialized, and give a warning.

GCC does, at least sometimes, warn for things where you can, as a human deduce that it can not possibly be uninitialized (because it always takes one of several paths). We have found this at work at times, where a certain piece of code will compile just fine in one configuration (with low optimisation levels) and fails at a higher optimisation level, and given that we use -Werror, the build fails. In this particular case, there wasn't much overhead with using an extra initialization, but sometimes that can get annoying/inefficient too. You're never going to please everyone (but you could perhaps allow for a option of "be extra paranoid" and warn at all times it's plausibly uninitialized).

Of course, if it's your own language, and you don't care that much about performance (perhaps "when checking is enabled"), you could add an extra element for each variable to indicate if it has been initialized, and during expression evaluation determine if it's been initialized or not. It does however require a fair bit more instructions to check a boolean every time you use a variable [or on the first use, if you can determine that - but bear in mind that there may be branches!]

Or always initialize variables that aren't initialized to a "crazy value" (e.g. 0xdeaddead or some such) - this will nearly always lead to a crash when used in an array.

Of course, it's always better to catch as much as possible through the compile phase - it's just a matter of whether you can reliably do that (and how much effort/time it takes). Anything that you detect during testing after the code has been compiled "costs" more to find and fix.

Question 2

I think trying to initialize-check everything leads down the road to pain.

Consider what happens when elements of a[b*m] are conditionally initialized, i.e. the elements of a of which are initialized depends on the input arguments. You'd have to track not only an initialized bit in your "shadow array copy", but an entire conditional execution graph to make sure all execution paths are covered. And ultimately that's an undecidable problem on a Turing machine, even; to solve this you'd have to decide the halting problem (to tell if any particular subgraph of your execution graph even finishes executing).

You could do some heuristics to warn only over some subset of uninitialized cases, but that would just mean your compiler emits warnings sometimes and not other times. As a programmer you should know how infuriating seemingly undeterministic behavior like that is.

Question 3

Your title doesn't match your question text.

Assuming your question is: "How to detect use of undefined variables", if your language isn't intended for extreme performance you can always define a bit pattern for values that means "undefined" (-2^31 is great for 32 bit signed integers), and generate code that checks for undefined values on a fetch. This is pretty easy.

If your question is, "How to detect out-of-bound array accesses", and especially given that your language doesn't have pointers, each array can carry its own array bounds, and array accesses can check that the indexes are within limits. This is pretty easy.

If you want a high performance language, then the other two techniques are likely too expensive. You'll need to implement range analyses on expressions in the compiler, to estimate what the range of an indexed access is. This is pretty hard.