Is it reasonable to insist on reproducing every defect before diagnosing and fixing it?

https://softwareengineering.stackexchange.com/questions/213898

30-09-2020
|

Question

I work for a software product company. We have large enterprise customers who implement our product and we provide support to them. For example, if there is a defect, we provide patches, etc. In other words, It is a fairly typical setup.

Recently, a ticket was issued and assigned to me regarding an exception found by a customer in a log file that has to do with concurrent database access in a clustered implementation of our product. So this customer's specific configuration may well be critical in the occurrence of this bug. All we got from the customer was their log file.

The approach I proposed to my team was to attempt to reproduce the bug in a configuration setup similar to that of the customer and get a comparable log. However, they disagree with my approach saying that I don't need to reproduce the bug as it's overly time-consuming and will require simulating a server cluster on VMs. My team suggests I simply "follow the code" to see where the thread- and/or transaction-unsafe code is and put in the change working off of a simple local development, which is not a cluster implementation like the environment from which the occurrence of the bug originates.

To me, working out of an abstract blueprint (program code) rather than a tangible, visible manifestation (runtime reproduction) seems difficult, so I wanted to ask a general question:

Is it reasonable to insist on reproducing every defect and debug it before diagnosing and fixing it?

Or:

If I am a senior developer, should I be able to read multithreaded code and create a mental picture of what it does in all use case scenarios rather than require to run the application, test different use case scenarios hands-on, and step through the code line by line? Or am I a poor developer for demanding that kind of work environment?

Is debugging for sissies?

In my opinion, any fix submitted in response to an incident ticket should be tested in an environment simulated to be as close to the original environment as possible. How else can you know that it will really remedy the issue? It is like releasing a new model of a vehicle without crash testing it with a dummy to demonstrate that the air bags indeed work.

Last but not least, if you agree with me:

How should I talk with my team to convince them that my approach is reasonable, conservative and more bulletproof?

Solution

Is it reasonable to insist on reproducing every defect and debug it before diagnosing and fixing it?

You should give it your best effort. I know that sometimes there are conditions and environments that are so complex they can't be reproduced exactly, but you should certainly try if you can.

If you never reproduced the bug and saw it for yourself, how can you be 100% certain that you really fixed it? Maybe your proposed fix introduces some other subtle bug that won't manifest unless you actually try to reproduce the original defect.

If I am a senior developer, should I be able to read (multithreaded) code and create a mental picture of what it does in all use case scenarios rather than require to run the application, test different use case scenarios hands on, and step through the code line by line? Or am I a poor developer for demanding that kind of work environment? Is debugging for sissies?

I would not trust someone who runs the code "in their head", if that's their only approach. It's a good place to start. Reproducing the bug, and fixing it and then demonstrating that the solution prevents the bug from reoccurring - that is where it should end.

How should I talk with my team to convince them that my approach is reasonable, conservative and more bulletproof?

Because if they never reproduced the bug, they can't know for certain that it is fixed. And if the customer comes back and complains that the bug is still there, that is not a Good Thing. After all, they are paying you big $$$ (I assume) to deal with this problem.

If you fail to fix the problem properly, you've broken faith with the customer (to some degree) and if there are competitors in your market, they may not remain your customer.

OTHER TIPS

How do they intend to verify that the bug in question was fixed? Do they want to ship untested code to the user and let them figure it out? Any test setup that was never shown to reproduce the error can't be relied upon to show absence of the error. You certainly don't need to reproduce the entire client environment, but you do need enough to reproduce the error.

I don't think it is unreasonable to attempt to reproduce every bug before fixing. However if you attempt to reproduce it and you can't then it becomes more of a business decision on whether or not blind patches are a good idea.

Ideally, you want to be able to reproduce each bug so that, at the very least, you can test that it's been fixed.

But... That may not always be feasible or even physically possible. Especially with 'enterprise' type software where each installation is unique. There's also the cost/benefit evaluation. A couple of hours of looking over code and making a few educated guesses about a non-critical problem may cost far less than having a technical support team spend weeks trying to set up and duplicate a customer's environment exactly in hopes of being able to duplicate the problem. Back when I worked in 'Enterprise' world, we would often just fly coders out and have them fix bugs on site, because there was no way to to duplicate the customer's setup.

So, duplicate when you can, but if you can't, then harness your knowledge of the system, and try to identify the culprit in code.

I don't think you should make a reproducing the error a requirement to look at the bug. There are, as you've mentioned, several ways to debug the issue - and you should use all of them. You should count yourself lucky that they were able to give you a log file! If you or someone at your company is able to reproduce the bug, great! If not, you should still attempt to parse the logs and find the circumstances under which the error occurred. It may be possible, as your colleagues suggested, to read the code, figure out what conditions the bug could happen, then attempt to recreate the scenario yourself.

However, don't release the actual fix untested. Any change you make should go through the standard dev, QA testing, and integration testing routine. It may prove difficult to test - you mentioned multithreaded code, which is notoriously hard to debug. This is where I agree with your approach to create a test configuration or environment. If you have found a problem in the code, you should find it much simpler to create the environment, reproduce the issue, and test the fix.

To me, this is less a debugging issue and more of a customer service issue. You've received a bug report from a customer; you have a responsibility to do due diligence to find their issue and fix it.

In my opinion ... as the decision maker, you must be able to justify your position. If the goal of the 3rd line support department is to fix bugs in the shortest time frame with the acceptable effort from the client, then any approach must comply with that goal. Furthermore, if the approach can be proven to give the fastest expected results, then there should be no problem convincing the team.

Having worked in support, I have always reasonably expected the client to be able to give some "script" of actions they performed to consistently reproduce the bug and if not consistently then candidate examples which have produced the bug.

If I were new to the system and had no background with the code, my first steps would be to attempt to identify the possible sources of the error. It may be that the logging is insufficient to identify a candidate code. Depending on the client, I might be inclined to give them a debug version in order that they might be able to give you back log files which give further clues as to the position of the offending code.

If I am able to quickly identify the code block then visual mapping of the flow may be enough to spot the code. If not, then unit test based simulation may be enough. It may be that setting up a client replicating environment takes less time, especially if there is a great deal of replicability of the problem.

I think you may find that your approach should be a combination of the proposed solutions and that knowing when to quit one and move on to the next is key to getting the job done efficiently.

I am quite sure that the team will support the notion that if there is a chance their solution will find the bug quicker, then giving them a suitable time frame to prove that will not impact too much on the time it takes to fix the bug whichever route you take.

Is it reasonable to insist on reproducing every defect and debug it before diagnosing and fixing it?

I say yes, with some caveats.

I think it's okay to read through the code and try to find places that look like they may be problematic. Create a patch and send that to the client to see if that resolves the problem. If this approach continues to fail, then you may need to investigate other options. Just remember that while you might be addressing a bug it might not be the bug that was reported.
If you can't reproduce it within reason, and you can't find any red flags in the code, then it may require some closer coordination with the customer. I've flown out to customer sites before to do on site debugging. It's not the best dev environment, but sometimes if the problem is environmental, then finding the exact cause is going to be easiest when you can reproduce it consistently.

I've been on the customer side of the table in this scenario. I was working at a US government office that used an incredibly large Oracle database cluster (several terabytes of data and processing millions of records a day).

We ran into a strange problem that was very easy for us to reproduce. We reported the bug to Oracle, and went back and forth with them for weeks, sending them logs. They said they weren't able to reproduce the problem, but sent us a few patches that the hoped might address the problem. None of them did.

They eventually flew out a couple of developers to our location to debug the issue on site. And that was when the root cause of the bug was found and a later patch correctly addressed the problem.

If you're not positive about the problem, you can't be positive about the solution. Knowing how to reproduce the problem reliably in at least one test case situation allows you to prove that you know how to cause the error, and therefore also allows you to prove on the flip side that the problem has been solved, due to the subsequent lack of error in the same test case after applying the fix.

That said, race conditions, concurrency issues and other "non-deterministic" bugs are among the hardest for a developer to pin down in this manner, because they occur infrequently, on a system with higher load and more complexity than any one developer's copy of the program, and they disappear when the task is re-run on the same system at a later time.

More often than not, what originally looks like a random bug ends up having a deterministic cause that results in the bug being deterministically reproducible once you know how. The ones that defy this, the true Heisenbugs (seemingly random bugs that disappear when attempting to test for them in a sterile, monitored environment), are 99.9% timing-related, and once you understand that, your way forward becomes more clear; scan for things that could fail if something else were to get a word in edgewise during the code's execution, and when you find such a vulnerability, attempt to exploit it in a test to see if it exhibits the behavior you're trying to reproduce.

A significant amount of in-depth code inspection is typically called for in these situations; you have to look at the code, abandoning any preconceived notions of how the code is supposed to behave, and imagine scenarios in which it could fail in the way your client has observed. For each scenario, try to develop a test that could be run efficiently within your current automated testing environment (that is, without needing a new VM stack just for this one test), that would prove or disprove that the code behaves as you expected (which, depending on what you expected, would prove or disprove that this code is a possible cause of the clients' problems). This is the scientific method for software engineers; observe, hypothesize, test, reflect, repeat.

As with everything else in software development, the correct answer is a compromise.

In theory, you should never try to fix a bug if you cannot prove that it exists. Doing so may cause you to make unnecessary changes to your code that don't ultimately solve anything. And proving it means reproducing it first, then creating and applying a fix, then demonstrating that it no longer happens. Your gut here is steering you in the right direction -- if you want to be confident that you've resolved your customer's problem you need to know what caused it in the first place.

In practice, that is not always possible. Perhaps the bug only occurs on large clusters with dozens of users simultaneously accessing your code. Perhaps there is a specific combination of data operations on specific sets of data that triggers the bug and you have no idea what that is. Perhaps your customer ran the program interactively non-stop for 100's of hours before the bug manifested.

In any of those cases, there's a strong chance that your department is not going to have the time or money to reproduce the bug before you start work. In many cases, it's far more obvious to you, the developer, that there's a bug in the code that points you to the correct situation. Once you've diagnosed the problem, you may be able to go back and reproduce it. It's not ideal, but at the same time, part of your job as senior developer is to know how to read and interpret code, partly to locate these kind of buried bugs.

In my opinion, you are focusing on the wrong part of the question. What if you ultimately cannot reproduce the bug in question? Nothing is more frustrating to a customer than to hear "yeah, we know you crashed the program but we can't reproduce it, so it's not a bug." When your customer hears this, they interpret it as "we know our software is buggy but we can't bother to fix and fix the bugs so just cross your fingers." If it better to close a reported bug as "not reproducible", or to close it as "not reproducible, but we have made some reasonable changes to try to improve stability"?

Is it reasonable to insist on reproducing every defect and debug it before diagnosing and fixing it?

No, it very definitely isn't. That would be a stupid policy.

The problem I see with your question and your proposal is that they fail to make a distinction between

bug reports
failures (errors)
bugs (also sometimes called errors)

A bug report is communication about a bug. It tells you somebody thinks something is wrong. It may or may not be specific about what is supposed to be wrong.

A bug report is evidence of a failure.

A failure is an incident of something going wrong. A specific malfunction, but not necessarily with any clues as to what may have caused it.

A failure may be caused by a bug.

A bug is a cause of failures; something that can (in principle) be changed in order to prevent the failures it causes from occurring in the future.

Sometimes, when a bug is reported, the cause is immediately clear. In such a case, reproducing the bug would be nonsensical. At other times, the cause isn't clear at all: the bug report doesn't describe any particular failure, or it does but the failure is such that it doesn't provide a clue as to what might be the cause. In such cases, I feel your advice is justified - but not always: one doesn't insist on crashing a second $370 million space rocket before accepting to investigate what caused the first one to crash (a particular bug in the control software).

And there are also all sorts of cases in between; for instance, if a bug report does not prove, but only suggests, that a potential problem you were already aware of might play a role, this might be enough incentive for you to take a closer look at it.

So while insisting on reproducibility is wise for the tougher cases, it is unwise to enforce it as a strict policy.

Unless the error is evident, obvious and trivial, with a very specific error message, etc., it's often very difficult to fix a bug if the user or the maintainer is not able to replicate it.

Also, how would you prove to them that the bug is fixed if you cannot replicate the steps?

The problem with your case is that the user doesn't know either how the error ocurred, that is, in what screen of doing what operation. They just simply have the log.

I think your point is reasonable. If you had psychic powers, you possibly wouldn't be working for a salary.

I think you should tell your bosses that without being able to replicate the error it would take an uknown amount of time to find it out, and there's no garantee at all that you will.

The problem will be when some co-worker of yours finds the bug out of pure luck and fixes it.

Let's take it to the extreme, and assume that you found the bug much earlier: in your code, as you were writing it. Then you wouldn't have any qualms about fixing it right there -- you see a logic flaw in the code you just wrote, it doesn't do what you wanted it to do. You wouldn't feel a need to setup a whole environment to show that it's actually a bug.

Now a bug report comes in. There are several things you can do. One of them is to go back to the code and re-read it. Now suppose that on this second reading, you immediately find the bug in the code -- it simply doesn't do what you intended it to do, and you failed to notice when you wrote it. And, it perfectly explains the bug that just came in! You make the fix. It took you twenty minutes.

Did that fix the bug that caused the bug report? You can't be 100% sure (there may have been two bugs causing this same thing), but it probably did.

Another thing you could do is reproduce the customer's configuration as well as you can (a few days' work), and eventually reproduce the bug. In many cases, there are timing and concurrency issues that mean you can't reproduce the bug, but you can try a lot of time and sometimes see the same thing happen. Now you start debugging, find the error in the code, put it in the environment, and you try a lot of times again. You don't see the bug occurring anymore.

Did that fix the bug that caused the bug report? You still can't be 100% sure -- one, you may actually have seen a completely different bug that the customer did, two, maybe you didn't try often enough, and three, maybe the configuration is still slightly different and it's fixed on this system, but not the customer's.

So certainty is impossible to get in any case. But the first method is way faster (you can give the customer a patch faster too), is way cheaper and, if you find a clear coding bug that explains the symptom, is actually more likely to find the problem too.

So it depends. If it's cheap to setup a testing environment (or better: an automated test that shows the problem), then do that. But if it's expensive and/or the circumstances in which the bug shows are unpredictable, then it's always better to try to find the bug by reading the code first.

Reading the question, I don't see any fundamental opposition between your position and your team's.

Yes, you should give your best effort to reproduce the problem occurring in the client setting. But best effort mean that you should define some time box for that, and there may not be enough data in the log to actually reproduce the problem.

If so, all depends on the relationship with this customer. It can go from you won't have anything else from him, to your may send a developper on site with diagnosis tools and ability to run them on the failing system. Usually, we are somewhere in between and if initial data is not enough there are ways to get some more.
Yes, a senior developper should be able to read the code and is likely to find the reason of the problem following the log content. Really, it is often possible to write some unit test that exhibit the problem after carefully reading the code.

Suceeding writing such unit tests is nearly as good as reproducing the breaking functional environment. Of course, this method is not a guarantee either that you will find anything. Understanding the exact sequence of events leading to failure in some multi-threaded software can be really hard to find by just reading the code, and ability to debug live is likely to become critical.

Summarily, I would try for both approaches simultaneously and ask for either a live system exhibiting the problem (and showing that it is fixed afterward) or for some breaking unit test breaking on the problem (and also showing it is fixed after the fix).

Trying to just fix the code and send it in the wild, does indeed look very risky. In some similar cases that occured to me (where we failed to reproduce the defect internally), I made clear that if a fix went in the wild and failed to resolve the customer problem, or had any other unexpected negative consequences, the guy that proposed it would have to help the support team to find the actual problem. Including dealing with the customer if necessary.

Is it reasonable to insist on reproducing every defect and debug it before diagnosing and fixing it?

Since nobody said it in clear terms yet: Absolutely not!

Like everything else in software development, bugfixing means keeping in mind time, risk, and cost. Finding a balance between these is half of the job description of a developer.

Some bugs are not important enough to spend 2 days on, but important enough to spend 10 minutes on fixing them. Other bugs are non-deterministic and you already know a test environment can't prove they've been fixed. If setting up the test environment takes 2 days, you don't do it for these bugs. Instead you spend the time on smarter things, such as finding ways to set up a test environment in 5 minutes instead of 2 days.

And of course there are bugs where if you get them wrong a client will lose $100'000+. And bugs where the client will lose $100'000+ for every hour the bug isn't fixed. You need to look at the bug and make a decision. Blanket statements to treat all bugs the same don't work.

Sounds to me like you need more detailed logging.

While adding more logging cannot guarantee that you won't need to debug (or, in this case, reproduce the situation), it will give you a far better insight into what actually went wrong.

Especially in complicated/threading situations, or anything where you can't use a debugger, falling back on "debug by printf()" might be your only recourse. In which case, log as much as you can (more than you expect to need) and have some good tools for filtering the wheat from the chaff.

Very good question! My opinion is that if you can't reproduce the problem then you can't 100% for sure say that the fix you made will not:

a) actually fix the issue. b) create another bug

There are times when a bug occurs and I fix it and I don't bother to test it. I know 100% for sure that it works. But until our QA department says that it's working then I consider it still a possibility that there is still a bug present... or a new bug created from the fix.

If you can't reproduce the bug and then install the new version and confirm that it is fixed then you can't, with 100% certainty, say that the bug is gone.

I tried for a few minutes to think of an analogy to help you explain to others but nothing really came to mind. A vasectomy is a funny example but it's not the same situation :-)

[bug related to] concurrent database access, clustered implementation, multithreaded

Is it reasonable to insist on reproducing every defect and debug it before diagnosing and fixing it?

I'd not spend too much time trying to reproduce it. That looks like a synchronization problem and those are more often found by reasoning (starting from logs like the one you have to pinpoint the subsystem in which the issue occurs) than by being able to find a way to reproduce it and attacking it with a debugger. In my experience, reducing the optimization level of the code or sometimes and even activating additional instrumentation can be enough to add enough delay or the lacking synchronization primitive to prevent the bug to manifest itself.

Yes, if you don't have a way to reproduce the bug, you won't be able to be sure that you fix it. But if your customer doesn't give you the way to reproduce it, you may also be looking for something similar with the same consequence but a different root cause.

Both activities (code review and testing) are necessary, neither are sufficient.

You could spend months constructing experiments trying to repro the bug, and never get anywhere if you hand't looked at the code and formed a hypothesis to narrow the search space. You might blow months gazing into your navel trying to visualize a bug in the code, might even think you've found it once, twice, three times, only to have the increasingly impatient customer say, "No, the bug is still there."

Some developers are relatively better at one activity (code review vs constructing tests) than the other. A perfect manager weighs these strengths when assigning bugs. A team approach may be even more fruitful.

Ultimately, there may not be enough information to repro the bug, and you have to let it marinate for awhile hoping another customer will find a similar problem, giving you more insight into the configuration issue. If the customer who saw the bug really wants it fixed, they will work with you to collect more information. If this problem only ever arose one time, it's probably not a high priority bug even if the customer is important. Sometimes not working a bug is smarter than blowing man-hours flailing around looking for a really obscure defect with not enough information.

Licensed under: CC-BY-SA with attribution

Not affiliated with softwareengineering.stackexchange