Question

I am hand-coding a compiler in C++ (C++98), and am looking for feedback/ideas how the following case should be handled (see title for what the grammar is supposed to be like for arrays):

{
    int b;
    int m, n;
    int a[n];

    b = a[b*m]; (1)
}

Bound-checking, run-time stack extension (and shrinking when leaving scope), etc, can be handled well in my framework.

My question relates to emitting warnings in the case of using uninitialized variables, as in (1) above. If a is an integer variable that is used before it is initialized, I allow it, but emit a warning message. As I want to make the compiler be ready to possibly have a linker later, calculating b and m backward and actually checking if this index of a has been initialized cannot be done (eg, in the most general case, a variable could be defined in a different file). As said, I know how to emit code to do the bound-checks at run-time; but...

...what is a best-practice way to extend emitting a warning when a variable is used uninitialized to the case of this variable being of the form a[(expr)]? As (expr) doesn't have to be an integer value (a mere number; it only needs to be of integer type), without evaluating (expr) (which I don't want to do as said just above), I cannot keep, say, a shadow array with entries marked initialized. In (gcc) C the case is simply ignored: a warning that both b and m are uninitialized is emitted, but none relating to using an uninitialised variable a[b*m]. This is obviously in line with C's view of arrays, and in C (but not the language I am working on, which has no concept of pointers) the expression is well-defined, other than b and m not being initialised, and the access being (probably) out of bound (an imminent stack overflow).

Is it best practice to simply not check if a[(expr)] has been initialized before usage; emit code; and wait for a run-time error, if any? Or...?

Was it helpful?

Solution

It's almost impossible to tell in EVERY case, for example, consider this:

void foo(int &x, int y)
{
   switch(y)
   {
      case 1:
         x = 11;
         break;
      case 2:
         x = 42;
         break;
      ...  // numbers 3-9 elided for brevity
      case 10:
         x = 97;
         break;
   }
}

int bar(int z)
{
    int a;
    foo(a, z);
}

Is a initialized or not? Well, depends on the value of z. If you have ALL the code available to you, you could, at least in theory, follow all paths to see what possible values z will have (assuming, of course, z doesn't come as input from an external source - in which case it's potentially badly written code if it doesn't range-check, but a lot of code will just happily accept that input is "ok").

So, you have to take one of two routes:

  1. Choose to assume that variables that are "plausibly initialized" are indeed initialized, and only give a warning when it's certain that something is not initialized.

  2. Choose to assume that variables that are "plausibly initialized" are indeed NOT initialized, and give a warning.

GCC does, at least sometimes, warn for things where you can, as a human deduce that it can not possibly be uninitialized (because it always takes one of several paths). We have found this at work at times, where a certain piece of code will compile just fine in one configuration (with low optimisation levels) and fails at a higher optimisation level, and given that we use -Werror, the build fails. In this particular case, there wasn't much overhead with using an extra initialization, but sometimes that can get annoying/inefficient too. You're never going to please everyone (but you could perhaps allow for a option of "be extra paranoid" and warn at all times it's plausibly uninitialized).

Of course, if it's your own language, and you don't care that much about performance (perhaps "when checking is enabled"), you could add an extra element for each variable to indicate if it has been initialized, and during expression evaluation determine if it's been initialized or not. It does however require a fair bit more instructions to check a boolean every time you use a variable [or on the first use, if you can determine that - but bear in mind that there may be branches!]

Or always initialize variables that aren't initialized to a "crazy value" (e.g. 0xdeaddead or some such) - this will nearly always lead to a crash when used in an array.

Of course, it's always better to catch as much as possible through the compile phase - it's just a matter of whether you can reliably do that (and how much effort/time it takes). Anything that you detect during testing after the code has been compiled "costs" more to find and fix.

OTHER TIPS

I think trying to initialize-check everything leads down the road to pain.

Consider what happens when elements of a[b*m] are conditionally initialized, i.e. the elements of a of which are initialized depends on the input arguments. You'd have to track not only an initialized bit in your "shadow array copy", but an entire conditional execution graph to make sure all execution paths are covered. And ultimately that's an undecidable problem on a Turing machine, even; to solve this you'd have to decide the halting problem (to tell if any particular subgraph of your execution graph even finishes executing).

You could do some heuristics to warn only over some subset of uninitialized cases, but that would just mean your compiler emits warnings sometimes and not other times. As a programmer you should know how infuriating seemingly undeterministic behavior like that is.

Your title doesn't match your question text.

Assuming your question is: "How to detect use of undefined variables", if your language isn't intended for extreme performance you can always define a bit pattern for values that means "undefined" (-2^31 is great for 32 bit signed integers), and generate code that checks for undefined values on a fetch. This is pretty easy.

If your question is, "How to detect out-of-bound array accesses", and especially given that your language doesn't have pointers, each array can carry its own array bounds, and array accesses can check that the indexes are within limits. This is pretty easy.

If you want a high performance language, then the other two techniques are likely too expensive. You'll need to implement range analyses on expressions in the compiler, to estimate what the range of an indexed access is. This is pretty hard.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top