Unexpected Code Coverage Reduction

https://softwareengineering.stackexchange.com/questions/400745

04-03-2021
|

Question

I am adding unit tests to some existing code which initially was not tested at all.

I have set up my CI system to fail builds where the code coverage percentage is reduced compared to previous build - mainly to set a path of continuing improvement.

I then encountered an unexpected and annoying (although mathematically correct) situation that can be reduced to the following:

Before refactor:
Method 1: 100 lines of code, 10 covered --> 10% coverage
Method 2: 20 lines of code, 15 covered --> 75% coverage

Total: 25 / 120 --> ~20% coverage
After refactor:
Method 1: 100 lines of code, 10 covered --> 10% coverage (untouched)
Method 2: 5 lines of code, 5 covered --> 100% coverage

Total: 15 / 105 --> ~14% coverage

So even though IMO the situation was improved, my CI system obviously does not agree.

Admittedly, this is a very esoteric problem, and would probably disappear when the bulk of the code will be covered better, but I would appreciate insights and ideas (or tools/configurations) that might allow me to keep enforcing an "improvement" path for coverage with my CI system.

The environment is Java, Jenkins, JaCoCo.

Solution

You can mitigate the effect to some degree by allowing the relative code coverage to reduce when the total number of uncovered lines also reduces, or when the total number of lines reduces, since this are pretty clear signs of a refactoring which sets a new base line for your coverage metrics.

In your example, the total number of uncovered lines reduces from 95 lines to 90 lines, and the total number of lines from 120 to 105. That should give you enough confidence that the relative coverage is quite unimportant in this situation. However, if you add new code, your metrics reflects the expectation of not allowing less relative coverage for the new code than the relative coverage you had before.

Side note: be aware, none of these metrics tells you if the tests are covering the most sensible parts of the code base.

OTHER TIPS

The problem I see here is that you have made the code coverage a trigger for build failure. I do believe that code coverage should be something that is routinely reviewed, but as you have experienced, you can have temporary reductions in your pursuit of higher code coverage.

In general, build failures should be predictable. The following make good build failure events:

Code will not compile
Tests will not run
Tests fail
Post build packaging fails (i.e. can't make containers, etc.)
Packages cannot be published to repositories

All of these are pass/fail, quantitative measures, they work (binary value 1) or they don't (binary value 0).

Code quality should be monitored because they are qualitative. NOTE: percentages are a qualitative measure even though they have a number associated with it. The following are qualitative measures:

Cyclomatic complexity (a number associated with a qualitative concept, i.e. the number has a qualitative meaning)
Maintainability index (a number associated with a qualitative concept, i.e. the number has a qualitative meaning)
Percentage covered lines/branches/methods (qualitative summary of quantitative results)
Static analysis results (no numbers involved)

As with any trend, when you look over past releases you can find momentary reductions while the overall trend improves or stays the same (100% remains 100%). If the long term trend is toward less tested or maintainable code then you have a team problem you need to address. If the long term trend is toward higher quality code based on your measurements, then the team is doing their jobs.

Have you considered not using code coverage metrics?

I'm not going to argue that code coverage isn't something that you should look at. It absolutely is. It's good to keep track of what was covered before a build and after a build. It's also good to make sure that you're providing coverage over new and modified lines of code in a change (and, depending on your environment, may be tools that can facilitate this).

But as you're seeing right now, you can refactor and end up removing lines that were covered such that the percentage of lines covered go down. I'd agree with your reasoning - the logical coverage has not changed. You've also probably simplified other aspects of your code (I'd be curious in how various complexity measures have changed before and after your refactoring, for example). But the math is correct - your percentage of lines covered has, in fact, gone down.

Additionally, code coverage says absolutely nothing about the effectiveness or quality of the tests. They also say nothing about the risk behind the part of the code. It's relatively easy to write tests that cover trivial lines or check the most simple logic while not giving much confidence in the quality of the system.

This is called Simpson’s paradox, and is a well known statistical issue with your approach.
You could even construct cases where you refactor and afterwards every single method has a higher coverage, but the overall coverage still went down.

You will need to find other ways to catch 'regressions' of the type you wanted to catch, not with overall percentages (although I liked the approach when reading it).

While all answers are telling not to use code coverage as a quality metrics, none really answered the question. What to do with reduced coverage?
The answer is quite simple: use absolute numbers, not percent. What is more important then coverage percent are number of untested functions and number of untested branches. Also, it would be great to get list of untested functions and branches.

One option that might help with this is to switch to from line coverage to branch coverage. I say 'might' help because you can run into a similar situation with uncovered branches if you reduce branches in a function.

Overall, though, branch coverage is a more meaningful metric than line coverage. And I'll hazard a guess that when you refactor a long method into a short one, you usually reduce lines of code than you will branches. Of course, if it's really nasty code, then you might end up pruning a lot of branches. Even in that case, I think you are still better off counting branches.

What you are doing is not wrong. While code coverage is not an amazing metric, it is a useful metric. It not telling you that your code is perfectly tested. But not having some level of standard you expect from your tests is also a bad idea.

One way to solve this problem is to simply change the limit if you believe the change is justified. The whole point of this kind of hard limit is to notify you quickly when something bad happens. In many cases, this reduction will be unintentional and it needs to be fixed by introducing more and better tests. But if you identify the error to be a false positive, and you can prove to yourself and your team that the reduction is expected, then it is perfectly fine to reduce the limit.

Another way is to turn that reduction into "soft" failure. For example, Jenkins can have a "unstable" result of a build. In unstable build, the final artifacts are still usable, but there might be some degradation of functionality. That gives you time to decide what to do, if tests need to be expanded or limits lowered. This could also be implemented using two different builds : one that has the build artifacts with lesser checks and limits, that when it fails, not even the artifacts are usable. And second, with much stricter limits that would fail if any of your quality metrics gets broken.

In addition to answer from Berin Loritsch, I would like to also highlight a technique called Mutation Testing (see https://pitest.org/ for a Java tool).

Mutation testing provides metrics on which lines of codes could be mutated to behave differently and not result in Unit test to fail.

In your case, if the refactoring was done well, you should see the Mutation Testing metrics improve (when expressed as a percentage of Unit test Line Coverage).

This link describes a couple of (java-related) mutation testing tools: https://pitest.org/java_mutation_testing_systems/

Before refactor: 95 untested lines.

After refactor: 90 untested lines.

You should always consider how many untested lines you have!

The code coverage percentage is a good measure of this only if your code always grows, which is unfortunate.

Another way to put it: whenever you manage to reduce the amount of lines in your code, without breaking any tests, you have all rights to ignore code coverage percentage!

The problem is: You have one huge function with very little code coverage, so in order to get the percentage up, you need to add covered lines of code. If you add 20 lines / 10 covered that improves the percentage more than 5 lines / 5 covered. Even though the second is most likely better (unless speed was important, and your first function was "if (easy case) { handle easy case quickly } else { handle complex case slowly }" and the second was "handle all cases slowly" ).

So we conclude: You have a measurable thing where improving the measurement doesn't improve the quality of your code. Which means that you should only use this as a target when common sense is applied at the same time. From a book about "motivation": The most important thing is to motivate people in such a way that following the motivation produces better results.

Imagine there's a function with a line "int x = 100;". To improve code coverage I change it to "int x = 0; x++; x++; (repeat 100 times)". Instead of one covered line I have 101 covered lines. Total code coverage goes up, and I get a bonus.

Here's a solution that doesn't involve changing the definition of the code coverage metric or requesting / giving yourself an exemption to the policy. Although, the policy as stated is arguably flawed and you might be better off counting the total number of uncovered lines or enforcing a per-file lower bound (e.g. 80%) instead of requiring monotonicity.

Improve the code coverage of Method 1 prior to or at the same time as refactoring Method 2.

Here's the current situation stated in the question.

                    Old World                   New World

           covered  total   %           covered  total   %
 Method 1       10    100  10                10    100  10
 Method 2       15     20  75                 5      5 100
Aggregate       25    120  20.8              15    105  14.3

In the New World, the total number of covered lines must be at least 22 for the metric to fail to decrease. (The ceiling of 105 * 25 / 120 is 22).

So, all you need to do is add tests that cover 7 more lines of the file where Method 1 and Method 2 live.

This solution does have the drawback of including an unrelated change in the same commit in order to satisfy the code coverage monotonicity rule. It still might be worth doing if this one change to Method 2 is not important enough to justify changing the policy. It also might be worth doing if you want to get this change in first before you change the policy.

Licensed under: CC-BY-SA with attribution

Not affiliated with softwareengineering.stackexchange