Any good strategies for dealing with 'not reproducible' bugs?

https://stackoverflow.com/questions/1050552

bug-tracking

20-08-2019
|

Question

Very often you will get or submit bug reports for defects that are 'not reproducible'. They may be reproducible on your computer or software project, but not on a vendor's system. Or the user supplies steps to reproduce, but you can't see the defect locally. Many variations on this scenario of course, so to simplify I guess what I'm trying to learn is:

What is your company's policy towards 'not reproducible' bugs? Shelve them, close them, ignore? I occasionally see intermittent, non reproducible bugs in 3-rd party frameworks, and these are pretty much always closed instantly by the vendor... but they are real bugs.

Have you found any techniques that help in fixing these types of bugs? Usually what I do is get a system info report from the user, and steps to reproduce, then search on keywords, and try to see any sort of pattern.

Solution

Verify the steps used to produce the error

Oftentimes the people reporting the error, or the people reproducing the error, will do something wrong and not end up in the same state, even if they think they are. Try to walk it through with the reporting party. I've had a user INSIST that the admin privileges were not appearing correctly. I tried reproducing the error and was unable to. When we walked it through together, it turned out he was logging in as a regular user in that case.

Verify the system/environment used to produce the error

I've found many 'irreproducible' bugs and only later discovered that they ARE reproducible on Mac OS (10.4) Running X version of Safari. And this doesn't apply only to browsers and rendering, it can apply to anything; the other applications that are currently being run, whether or not the user is RDP or local, admin or user, etc... Make certain you get your environment as close to theirs as possible before calling it irreproducible.

Gather Screenshots and Logs

Once you have verified that the user is doing everything correctly and still getting a bug, and that you're doing exactly what they do, and you are NOT getting the bug, then it's time to see what you can actually do about it. Screenshots and logs are critical. You want to know exactly what it looks like, and exactly what was going on at the time.

It is possible that the logs could contain some information that you can reproduce on your system, and once you can reproduce the exact scenario, you might be able to coax the error out of hiding.

Screenshots also help with this, because you might discover that "X piece has loaded correctly, but it shouldn't have because it is dependent on Y" and that might give you a hint. Even if the user can describe what were doing, a screen shot could help even more.

Gather step-by-step description from the user

It's very common to blame the users, and not trust anything that they say (because they call a 'usercontrol' a 'thingy') but even though they might not know the names of what they're seeing, they will still be able to describe some of the behaviour they are seeing. This includes some minor errors that may have occured a few minutes BEFORE the real error occurred, or possibly slowness in certain things that are usually fast. All these things can be clues to help you narrow down which aspect is causing the error on their machine and not yours.

Try Alternate Approachs to produce the error

If all else fails, try looking at the section of code that is causing problems, and possibly refactor or use a workaround. If it is possible for you to create a scenario where you start with half the information already there (hopefully in UAT) ask the user to try that approach, and see if the error still occurs. Do you best to create alternate but similar approaches that get the error into a different light so that you can examine it better.

OTHER TIPS

Short answer: Conduct a detailed code review on the suspected faulty code, with the aim of fixing any theoretical bugs, and adding code to monitor and log any future faults.

Long answer: To give a real-world example from the embedded systems world: we make industrial equipment, containing custom electronics, and embedded software running on it.

A customer reported that a number of devices on a single site were experiencing the same fault at random intervals. Their symptoms were the same in each case, but they couldn't identify an obvious cause.

Obviously our first step was to try and reproduce the fault in the same device in our lab, but we were unable to do this.

So, instead, we circulated the suspected faulty code within the department, to try and get as many ideas and suggestions as possible. We then held a number of code review meetings to discuss these ideas, and determine a theory which: (a) explained the most likely cause of the faults observed in the field; (b) explained why we were unable to reproduce it; and (c) led to improvements we could make to the code to prevent the fault happening in the future.

In addition to the (theoretical) bug fixes, we also added monitoring and logging code, so if the fault were to occur again, we could extract useful data from the device in question.

To the best of my knowledge, this improved software was subsequently deployed on site, and appears to have been successful.

Error-reporting, log files, and stern demands to "Contact me immediately if this happens again."

If it happens in one context, and not in another, we try to enumerate the difference between both, and eliminate them.

Sometimes this works (e.g. other hardware, dual core vs. hyperthreading, laptop-disk vs. workstation disk, ...).

Sometimes it doesn't. If it's possible, we may start remote-debugging. If that doesn't help, we may try get our hands on the customer's system.

But of course, we don't write too many bugs in the first place :)

resolved "sterile" and "spooky"

We have two closed bug categories for this situation.

sterile - cannot reproduce.

spooky - it's acknowledged there is a problem, but it just appears intermittently, isn't quite understandable, and gives everyone a faint case of the creeps.

Well, you try your best to reproduce it, and if you can't, you take a long think and consider how such a problem might arise. If you still have no idea, then there's not much you can do about it.

Some of the new features in Visual Studio 2010 will help. See:

I add logging to the exception handling code throughout the program. You need a method to collect the logs (users can email it, etc.)

Preemptive checks for code versions and sane environments are a good thing too. With the ease of software updates these days the code and environment the user is running has almost certainly not been tested. It didn't exist when you released your code.

With a web project I'm developing at the moment I'm doing something very similar to your technique. I'm building a page that I can direct users to in order to collect information such as their browser version and operating system. I'll also be collecting the apps registry info so i can have a look at what they've been doing.

This is a very real problem. I can only speak for web development, but I find users are rarely able to give me the basic information I would need to look into the issue. I suspect it's entirely possible to do something similar with other kinds of development. My plan is to keep working on this system to make it more and more useful.

But my policy is never to close a bug simply because I can't reproduce it, no matter how annoying it may be. And then there's the cases when it's not a bug, but the user has simply gotten confused. Which is a different type of bug I guess, but just as important.

You talk about problems that are reproducible but only on some systems. These are easy to handle:

First step: By using some sort of remote software, you let the customer tell you what to do to reproduce the problem on the system that has it. If this fails, then close it.

Second step: Try to reproduce the problem on another system. If this fails, make an exact copy of the customers system.

Third step: If it still fails, you have no option than to try to debug it on the customer system.

Once you can reproduce it, you can fix it. Doesn't matter on what system.

The tricky issue are truly non-reproducible issues, that is things that happen only intermittently. For that I'll have to chime in with the reports, logs and stern demands attitude. :)

Sometimes the bug is not reproducible even in a pre-production environment that is the exact duplicate of the production environment. Concurrency issues are notorious for this.

The reason can be simply because of the Heisenberg effect, i.e. observation changes behaviour. Another reason can be because the chances are very small of hitting the combination of events that triggers the bug.

Sometimes you are lucky and you have audit logs that you can playback, greatly increasing the chances of recreating the issue. You can also stress the environment with high volumes of transactions. This effectively compresses time so that if the bug occurs say once a week, you may be able to reliably reproduce it in 1 day if you stress the system to 7 X the production load.

The last resort is whitebox testing where you go through the code line by line writing unit tests as you go.

Logging is your friend!

Generally what happens when we discover a bug that we can't reproduce is we either ask the customer to turn on more logging (if its available), or we release a version with extra logging added around the area we are interested in. Generally speaking the logging we have is excellent and has the ability to be very verbose, and so releasing versions with extra logging doesn't happen often.

You should also consider the use of memory dumps (which IMO also falls under the umbrella of logging). Producing a minidump is so quick that it can usually be done on production servers, even under load (as long as the number of dumps being produced is low).

The way I see it: Being able to reproduce a problem is nice because it gives you an environment where you can debug, experiement and play around in more freely, but - reproducing a bug is by no means essential to debug it! If the bug is only happening on someone else system then you still need to diagnose and debug the problem in the same way, its just that this time you need to be cleverer about how you do it.

It is important to categorize such bugs (rarely reproducible) and act on them differently than bugs that are frequently reproducible based on specific user actions.

Clear issue description along with steps to reproduce and observed behavior: Unambiguous reporting helps in understanding of the issue by entire team eliminating incorrect conclusions. For example, user reporting blank screen is different than HMI freeze on user action. Sequence of steps and approx timing of user action is also important.Did the user immediately select the option after screen transition or waited for a few minutes? An interesting bug concerning timing is a car allergic to vanilla ice-cream that baffled automotive engineers.
System config and startup parameters: Sometimes even hardware configuration and application software version (including drivers and firmware version) may do a trick or two. Mismatch of version or configuration can result in issues that are difficult to reproduce in other setups. Hence these are essential details to be captured. Most bug reporting tools have these details as mandatory parameters to report while logging an issue.
Extensive Logging: This is dependent on the logging facilities followed in concerned projects. While working with embedded Linux systems, we not only provide general diagnostic logs, but also system level logs like dmesg or top command logs. You may never know that wrong part is not the code flow but the abnormal memory usage/CPU usage. Identify the type of the issue and report the relevant logs for investigation.
Code Reviews and Walk-through: Dev teams cannot wait forever to reproduce these issues at their end and then take action. Bug report and available logs should be investigated and various possibilities be identified on this basis from design and code. If required, they should prepare hotfix on possible root causes and circulate the hotfix among teams including the tester who identified it to see if bug is reproducible with it.
Don't close these issues based on observation by a single tester/team after a fix is identified and checked in: Perhaps the most important part is approach followed to close these issues. Once fix of these issues has been checked in, all testing/validation teams at different locations should be informed on it for running intensive tests and identifying regression errors if any. Only all (practically most of them) of them reports as non-reproducible, a closure assessment has to be done by senior management.

If it is not reproduce able get logs, screen shots of exact steps to reproduce.

There's a nice new feature in Windows 7 that allows the user to record what they're doing and then send a report - it comes through as a doc with screen-shots of every stage. Hopefully it'll help in the cases where it's the user interacting with the application in an order that the developer wouldn't think of. I've seen plenty of bugs where it's just a case that the developer's logical way of using the app doesn't fit with how end users actually do it... resulting in lots of subtle errors.

The accepted answer is the best general approach. At a high level, it's worth weighing the importance of fixing the bug against what you could add as a feature or enhance that would benefit the user. Could a 'non-reproducible' bug take two days to fix? Could a feature be added in that time that gives users more benefit than that bug fix? Maybe the users would prefer the feature. I've been fixated at times as a developer on imperfections I can see, and then users are asked for feedback and none of them actually mention the bug(s) that I can see, but the software is missing a feature that they really want!

Sometimes, simple persistence in attempting to reproduce the bug whilst debugging can be the most effective approach. For this strategy to work, the bug needs to be 'intermittent' rather than completely 'non-reproducible'. If you can repeat a bug even one time in 10, and you have ideas about the most likely place it's occurring, you can place breakpoints at those points then doggedly attempt to repeat the bug and see exactly what's going on. I've experienced this to be more effective than logging in one or two cases (although logging would be my first go-to in general).

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow