Domanda

If every path through a program is tested, does that guarantee finding all bugs?

If not, why not? How could you go through every possible combination of program flow and not find the problem if one exists?

I hesitate to suggest that "all bugs" can be found, but maybe that is because path coverage isn't practical (as it is combinatorial) so it isn't ever experienced?

Note: this article gives a quick summary of coverage types as I think about them.

È stato utile?

Soluzione

If every path through a program is tested, does that guarantee finding all bugs?

No

If not, why not? How could you go through every possible combination of program flow and not find the problem if one exists?

Because even if you test all possible paths, you still haven't tested them with all possible values or all possible combinations of values. For example (pseudocode):

def Add(x as Int32, y as Int32) as Int32:
   return x + y

Test.Assert(Add(2, 2) == 4) //100% test coverage
Add(MAXINT, 5) //Throws an exception, despite 100% test coverage

It is now two decades since it was pointed out that program testing may convincingly demonstrate the presence of bugs, but can never demonstrate their absence. After quoting this well-publicized remark devoutly, the software engineer returns to the order of the day and continues to refine his testing strategies, just like the alchemist of yore, who continued to refine his chrysocosmic purifications.

-- E. W. Dijkstra (Emphasis added. Written in 1988. It's been considerably more than 2 decades now.)

Altri suggerimenti

In addition to Mason's answer, there is also another problem: coverage does not tell you what code was tested, it tells you what code was executed.

Imagine you have a testsuite with 100% path coverage. Now remove all assertions and run the testsuite again. Voilà, the testsuite still has 100% path coverage, but it tests absolutely nothing.

Here's a simpler example to round things off. Consider the following sorting algorithm (in Java):

int[] sort(int[] x) { return new int[] { x[0] }; }

Now, let's test:

sort(new int[] { 0xCAFEBABE });

Now, consider that (A) this particular call to sort returns the correct result, (B) all code paths have been covered by this test.

But, obviously, the program does not actually sort.

It follows that coverage of all code paths is not sufficient to guarantee that the program has no bugs.

Consider the abs function, that returns the absolute value of a number. Here is a test (Python, imagine some test framework):

def test_abs_of_neg_number_returns_positive():
    assert abs(-3) == 3

This implementation is correct, but it only gets 60% code coverage:

def abs(x):
    if x < 0:
        return -x
    else:
        return x

This implementation is wrong, but it gets 100% code coverage:

def abs(x):
    return -x

Yet another addition to Mason's answer, a program's behavior may depend on the runtime environment.

The following code contains a Use-After-Free:

int main(void)
{
    int* a = malloc(sizeof(a));
    int* b = a;
    *a = 0;
    free(a);
    *b = 12; /* UAF */
    return 0;
}

This code is Undefined Behavior, depending on the configuration (release|debug), OS and compiler it will yield different behaviors. Not only path coverage won't guarantee that you will find the UAF, but your test suite will typically not cover the various possible behaviors of the UAF that depend on the configuration.

On another note, even if path coverage were to guarantee finding all bugs, it is unlikely that it can be achieved in practice on any program. Consider the following one:

int main(int a, int b)
{
    if (a != b) {
        if (cryptohash(a) == cryptohash(b)) {
            return ERROR;
        }
    }
    return 0;
} 

If your test-suite can generate all paths for this, then congratulations you're a cryptographer.

It's clear from the other answers that 100% code coverage in tests does not mean 100% code correctness, or even that all bugs that could be found by testing, will be found (never mind bugs that no test could catch).

Another way of answering this question is one from practice:

There are, in the real world, and indeed on your own computer, many pieces of software that are developed using a set of tests that give 100% coverage and which yet still have bugs, including bugs that better testing would identify.

An entailed question therefore, is:

What is the point of code coverage tools?

Code coverage tools help to identify areas one has neglected to test. That can be fine (the code is demonstrably correct even without testing) it can be impossible to resolve (for some reason a path cannot be hit), or it can be the location of a great stinking bug either now or following future modifications.

In some ways spell-check is comparable: Something can "pass" spell-check and be misspelled in such a way as to match a word in the dictionary. Or it can "fail" because correct words are not in the dictionary. Or it can pass and be utter nonsense. Spell-check is a tool that helps you identify places you may have missed in your proof-reading, but just as it cannot guarantee complete and correct proof-reading, so code-coverage cannot guarantee complete and correct testing.

And of course the incorrect way to use spell-check is famously to go with every suggestion ewe sea it suggest so the ducking thing becomes worse then if ewe left it a loan.

With code coverage it can be tempting, especially if you've a near-perfect 98%, to fill in cases so that the remaining paths are hit.

That is the equivalent of righting with spell-check sew that it's all words weather or knot it's all the appropriate words. The result is a ducking mess.

However, if you consider what tests the non-covered paths really need, the code-coverage tool will have done its job; not in promising you correctness, but it pointing out some of the work that needed to be done.

Path coverage cannot tell you whether all the required features have been implemented. Leaving out a feature is a bug, but path coverage will not detect it.

Part of the issue is that 100% coverage only guarantees that the code will function correctly after a single execution. Some bugs like memory leaks may not be apparent or cause issue after a single execution, but over time will cause problems for the application.

For example, say you have an application which connects to a database. Perhaps in one method the programmer forgets to close the connection to the database when they are done with their query. You could run several tests over this method and find no errors with it's functionality, but your database server may run into a scenario where it is out of available connections because this particular method did not close the connection when it was done and the open connections must now timeout.

If every path through a program is tested, does that guarantee finding all bugs?

As already said, the answer is NO.

If not, why not?

Besides what is being said, there are bugs appearing at different levels, which can't be tested with unit tests. Just to mention few :

  • bugs caught with integration tests (unit tests shouldn't use real resources after all)
  • bugs in requirements
  • bugs in design and architecture

What does it mean for every path to be tested?

The other answers are great, but I just want to add that the condition "every path through a program is tested" is itself vague.

Consider this method:

def add(num1, num2)
  foo = "bar"  # useless statement
  $global += 1 # side effect
  num1 + num2  # actual work
end

If you write a test that asserts add(1, 2) == 3, a code coverage tool will tell you that every line is exercised. But you haven't actually asserted anything about the global side effect or the useless assignment. Those lines executed, but haven't really been tested.

Mutation testing would help find issues like this. A mutation testing tool would have a list of pre-determined ways to "mutate" the code and see if the tests still pass. For example:

  • One mutation might change the += to -=. That mutation would not cause a test failure, so it would prove that your test doesn't assert anything meaningful about the global side effect.
  • Another mutation might delete the first line. That mutation would not cause a test failure, so it would prove that your test doesn't assert anything meaningful about the assignment.
  • Still another mutation might delete the third line. That would cause a test failure, which in this case, shows that your test does assert something about that line.

In essense, mutation tests are a way to test your tests. But just like you'll never test the actual function with every possible set of inputs, you'll never run every possible mutation, so again, this is limited.

Every test we can do is a heuristic to move toward bug-free programs. Nothing is perfect.

Well... yes actually, if every path “through” the program is tested. But that means, every possible path through the entire space of all possible states the program can have, including all variables. Even for a very simple statically compiled program – say, an old Fortran number cruncher – that's not feasible, though it can at least be imaginable: if you have just two integer variables, you're basically dealing with all possible ways to connect points on a two-dimensional grid; it actually looks a lot like Travelling Salesman. For n such variables, you're dealing with an n-dimensional space, so for any real program, the task is completely untractable.

Worse: for serious stuff, you have not just a fixed number of primitive variables, but create variables on the fly in function calls, or have variable-size variables... or anything like that, as possible in a Turing-complete language. That makes the state space infinite-dimensional, shattering all hopes of full coverage, even given absurdly powerful testing equipment.


That said... actually things aren't quite so bleak. It is possible to proove entire programs to be correct, but you'll have to give up a few ideas.

First: it's highly advisable to switch to a declarative languange. Imperative languages, for some reason, have always been by far the most popular, but the way they mix together algorithms with real-world interactions makes it extremely difficult to even say what you mean by “correct”.

Much easier in purely functional programming languages: these have a clear distinction between the real interesting properties of mathematical functions, and the fuzzy real-world interactions you can't really say anything about. For the functions, it is very easy to specify “correct behavior”: if for all possible inputs (from the argument types) the corresponding desired result comes out, then the function behaves correctly.

Now, you say that's still intractable... after all, the space of all possible arguments is in general also infinite-dimensional. True – though for a single function, even naïve coverage testing leads you way further than you could ever hope for in an imperative program! However, there is an incredible powerful tool that changes the game: universal quantification / parametric polymorphism. Basically, this allows you to write functions on very general kinds of data, with the guarantee that if it works for a simple example of the data, it will work for any possible input at all.

At least theoretically. It's not easy to find the right types that are really so general that you can completely proove this – usually, you need a dependently-typed language, and these tend to be rather difficult to use. But writing in a functional style with parametric polymorphism alone already boosts your “security level” enourmously – you won't necessarily find all bugs, but you'll have to hide them quite well so the compiler doesn't spot them!

Autorizzato sotto: CC-BY-SA insieme a attribuzione
scroll top