What to do with bugs that do not repro?

https://softwareengineering.stackexchange.com/questions/1376

testing
bug

16-10-2019
|

Pergunta

I have a tester that while testing will have an error occur (ok so far), but then he frequently reports it right away. We (the developers) then later find that the tester has not tried to reproduce the issue and (when asked) cannot find a way to make it happen again.

Now these are still bugs, I don't want to ignore them. But without repro steps I am kind of stuck. Sometimes there is a stack trace (though frequently it is not useful because this is compact framework and there are no line numbers). But when there is one I can take the stack trace and crack open the code and start guessing, but that does not lead to testable "fixes".

What do you do in scenarios like this?

Solução

A bug without context is not a bug, it's a fluke. The problem could be your code, it could be a third party library, it could be the hardware, or it could be solar radiation causing a single bit to flip on it's own. If you can't reproduce it with at least some regularity (even if only "it happens once every 10 or 20 times I do X"), it's not much better than your tester telling you "Something somewhere went wrong somehow - fix it".

You may have to explain to your tester that his job is not to just generate input until something breaks. If it were, you could replace him with a random number generator. Part of his job is to identify bugs, which entails identifying how to produce them.

Outras dicas

Ultimately if neither the developer nor the tester can reproduce the bug it should be closed but marked as such.

However, how long it takes you to get to that point is debatable.

Some people would argue that if it's not immediately reproducible then it should be closed forthwith.

I usually strive to try to get more information from the originator of the problem. There may be something they forgot in the original report. Having a conversation about the required steps can often reveal the missing information.

One final thought - closed as "no-repro" doesn't mean fixed. If there is a real problem it will reveal itself sooner or later and having all the information you can will help when you can finally reproduce the problem.

A few more suggestions:

Add logging (and not just a keylogger :}) to your product code. "No repro" bugs may be flukes, but they may be memory or state corruption that only occurs on a dirty system used in unanticipated ways (i.e. like a customers computer). Logging or tracing information can help you figure out what may have been wrong when the tester found the fluke.
Scan the rest of the "no repro" bugs in the database (or whatever you use for bug tracking). Often, the flukes clump together in one area of the product. If it looks like one component is at fault, code review the component for possible flakiness, add additional logging to that component - or both.
Take half an hour or so and watch your tester test. Their approach may give you an idea of what went wrong (e.g. "interesting - I didn't know you could get to that dialog that way"). You also may find that they skip a dialog or configuration step unintentionally. It's worth the time investment to get in their head a bit.

I do QA on a large commercial code, this irritating scenario does come up way too often. Usually it is indicative of not having ironclad proceedures for building the binary on all the platforms we support. So if the developer builds his own code (which he has to do to debug and fix), and doesn't follow the same build proceedure to the letter, there is a chance that system dependent bugs will appear to magically vanish (or appear). Of course these things usually get closed with "works for me" in the bug database, and if they fail the next time that problem is run, the bug can be re-opened. Whenever I suspect a bug may be system dependent, I try to test it on a variety of platforms and report under which conditions it happens. Oftentimes a memory corruption issue onlt shows up if the corrupted data is of large enough magnitude to cause a crash. Some platforms (HW and OS combinations) may crash closer to the actual source of the corruption, and this can be very valuable for the poor guy that has to debug it.

The tester needs to do some value added, beyond just reporting that his system shows a failure. I spend a lot of time screening out false positives -maybe the platform in question was overloaded, or the network had a glitch. And yes sometimes you can get something that is truly affected by random timing events, hardware bugs can often be like proto example: If two data requests come back a exactly the same clock period, and the hardware logic for handling the potential conflict is faulty, then the bug will only show up intermittently. Likewise with parallel processing, unless by careful design you've constrained the solution to be independent of which processor happened to be faster, you can get bugs that only happen once in a blue moon, and their statistical imporbablity makes debugging a nightmare.

Also our code is being updated, usually many times daily, tracking down an exact sourcecode revision number for when it went south can be very useful information for the debugging effort. The tester shouldn't be in an adversarial relationship with the debuggers and developers, he is there as part of a team to improve the quality of the product.

There are two sorts of bug that are not reproducible:

1) Those that a tester (or user) has seen once but has either not been able to or not attempted to reproduce.

In these situations you should:

Very briefly check the basic course of actions which showed the defect to ensure that it's not reproducible.
Speak to the tester / user to see if there is any other information which may help.
Cross reference them with any other defects that might be related to see if you have enough information to look into them based on multiple instances. You may find that this one issue does not give you enough information to go on however when coupled with a number of other issues it may suggest to you something not right that is worth investigating.
If you still don't have enough to go on then you need to explain to the user / tester that you don't have enough information. Outline to them politely what enough information would look like and why it's needed.

2) Those where they can not be reliably reproduced, however there is enough evidence (in terms of repeated occurrences) to suggest that the defect does exist, then I tend to see that these are developer issues and that the developer - supported by the tester / user - needs to investigate.

This is likely to be slow and painful, you're likely going to have to walk the code, add more logging, look at the data and speak to the testers / users in depth but if there is enough evidence to suggest that it's likely there is an issue you do need to take ownership of it and do whatever needs to be done to fix it.

It sounds like this happens relatively frequently - which makes me wonder, is it because most of the bugs are genuinely hard to repro, or is it for some other reason that he's not trying? Do you know why he isn't trying to reproduce the issue? Is it because he doesn't realise how important it is to you? Or is it perhaps that he has other pressures - a test manager who just wants him to get through the allotted tests quickly and throw the bugs over the wall, for example? Or maybe he's just not sure how to go about it?

I'd agree with others that working on better logging is a priority. In the meantime, if you suspect that lack of tester skill/confidence may be an issue, then I really like this article from Danny Faught on bug isolation - you could point him at that for a start.

If the problem turns out to be due to management pressure - you have my sympathies, as that's a tough one to crack, especially if testers & programmers report to different managers and the managers aren't inclined to "help out" another team.

Typically I note that it is not reproducible, but leave it open until that batch of testing or iteration is complete.

If it has not been reproduced by that point it is closed, but can be reopened if it is encountered again.

stick a keylogger on this tester's workstation!

Well, the first task is to have a reproducible test system. Your tester must have a well-defined process - automatic if at all possible.

Have these three conditions:

Same binary
Same steps
Same machine

If the bug sporadically appears with the above 3 conditions, begin to isolate further. Consider each level of the system stack and its configuration.

One way to detect memory management errors is to run the program on multiple OSs with multiple compilers. Valgrind can also help.

However, typically parallel systems are liable to induce non-repro bugs. Things like buffer sizes and processing speeds, asynch io, database locks, variable memory write interleavings; all of those can generate issues. And so forth and so on.

First of all, you should have a rigorous testing procedure (but I understand you, in my company what you have described happens frequently).

Depending on the severity of the bug, you can invest some time on it or (better) ignore it until repro steps are provided.

Licenciado em: CC-BY-SA com atribuição

Não afiliado a softwareengineering.stackexchange