Is data-driven testing bad?

https://stackoverflow.com/questions/7676622

07-02-2021
|

Question

I've started using googletest to implement tests and stumbled across this quote in the documentation regarding value-parameterized tests

You want to test your code over various inputs (a.k.a. data-driven testing). This feature is easy to abuse, so please exercise your good sense when doing it!

I think I'm indeed "abusing" the system when doing the following and would like to hear your input and opinions on this matter.

Assume we have the following code:

template<typename T>
struct SumMethod {
     T op(T x, T y) { return x + y; }   
};

// optimized function to handle different input array sizes 
// in the most efficient way
template<typename T, class Method> 
T f(T input[], int size) {
    Method m;
    T result = (T) 0;
    if(size <= 128) {
        // use m.op() to compute result etc.
        return result;
    }
    if(size <= 256) {
        // use m.op() to compute result etc.
        return result;
    }
    // ...
}

// naive and correct, but slow alternative implementation of f()
template<typename T, class Method>
T f_alt(T input[], int size);

Ok, so with this code, it certainly makes sense to test f() (by comparison with f_alt()) with different input array sizes of randomly generated data to test the correctness of branches. On top of that, I have several structs like SumMethod, MultiplyMethod, etc, so I'm running quite a large number of tests also for different types:

typedef MultiplyMethod<int> MultInt;
typedef SumMethod<int> SumInt;
typedef MultiplyMethod<float> MultFlt;
// ...
ASSERT(f<int, MultInt>(int_in, 128), f_alt<int, MultInt>(int_in, 128));
ASSERT(f<int, MultInt>(int_in, 256), f_alt<int, MultInt>(int_in, 256));
// ...
ASSERT(f<int, SumInt>(int_in, 128), f_alt<int, SumInt>(int_in, 128));
ASSERT(f<int, SumInt>(int_in, 256), f_alt<int, SumInt>(int_in, 256));
// ...
const float ep = 1e-6;
ASSERT_NEAR(f<float, MultFlt>(flt_in, 128), f_alt<float, MultFlt>(flt_in, 128), ep);
ASSERT_NEAR(f<float, MultFlt>(flt_in, 256), f_alt<float, MultFlt>(flt_in, 256), ep);
// ...

Now of course my question is: does this make any sense and why would this be bad?

In fact, I have found a "bug" when running tests with floats where f() and f_alt() would give different values with SumMethod due to rounding, which I could improve by presorting the input array etc.. From this experience I consider this actually somewhat good practice.

Solution

I think the main problem is testing with "randomly generated data". It is not clear from your question whether this data is re-generated each time your test harness is run. If it is, then your test results are not reproducible. If some test fails, it should fail every time you run it, not once in a blue moon, upon some weird random test data combination.

So in my opinion you should pre-generate your test data and keep it as a part of your test suite. You also need to ensure that the dataset is large enough and diverse enough to offer sufficient code coverage.

Moreover, As Ben Voigt commented below, testing with random data only is not enough. You need to identify corner cases in your algorithms and test them separately, with data tailored specifically for these cases. However, in my opinion, additional testing with random data is also beneficial when/if you are not sure that you know all your corner cases. You may hit them by chance using random data.

OTHER TIPS

The problem is that you can't assert correctness on floats the same way you do ints.

Check correctness within a certain epsilon, which is a small difference between the calculated and expected values. That's the best you can do. This is true for all floating point numbers.

I think I'm indeed "abusing" the system when doing the following

Did you think this was bad before you read that article? Can you articulate what's bad about it?

You have to test this functionality sometime. You need data to do it. Where's the abuse?

One of the reasons why it could be bad is that data driven tests are harder to maintain and in longer period of time it's easier to introduce bugs in tests itself. For details look here: http://googletesting.blogspot.com/2008/09/tott-data-driven-traps.html

Also from my point of view unittests are the most useful when you are doing serious refactoring and you are not sure if you didn't changed the logic in wrong way. If your random-data test will fail after that kind of changes, then you can guess: is it because of data or because of your changes?

However, I think it could be useful (same as stress tests which also are not 100% reproducible). But if you are using some continuous integration system, I'm not sure if data-driven tests with huge amount of random generated data should be included into it. I would rather make separate deployment which periodically make a lot of random tests at once (so the chance of discovering something bad should be quite high every time when you run it). But it's too resource heavy as the part of normal tests suite.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow