
This is a "Share the Knowledge" question. I am interested in learning from your successes and/or failures.

Information that might be helpful...


  • Context: Language, Application, Environment, etc.
  • How was the bug identified ?
  • Who or what identified the bug ?
  • How complex was reproducing the bug ?

The Hunting.

  • What was your plan ?
  • What difficulties did you encounter ?
  • How was the offending code finally found ?

The Killing.

  • How complex was the fix ?
  • How did you determine the scope of the fix ?
  • How much code was involved in the fix ?


  • What was the root cause technically ? buffer overrun, etc.
  • What was the root cause from 30,000 ft ?
  • How long did the process ultimately take ?
  • Were there any features adversely effected by the fix ?
  • What methods, tools, motivations did you find particularly helpful ? ...horribly useless ?
  • If you could do it all again ?............

These examples are general, not applicable in every situation and possibly useless. Please season as needed.

It was actually in a 3rd party image viewer sub-component of our application.

We found that there were 2-3 of the users of our application would frequently have the image viewer component throw an exception and die horribly. However, we had dozens of other users who never saw the issue despite using the application for the same task for most of the work day. Also there was one user in particular who got it a lot more frequently than the rest of them.

We tried the usual steps:

(1) Had them switch computers with another user who never had the problem to rule out the computer/configuration. - The problem followed them.

(2) Had them log into the application and work as a user that never saw the problem. - The problem STILL followed them.

(3) Had the user report which image they were viewing and set up a test harness to repeat viewing that image thousands of times in quick succession. The problem did not present itself in the harness.

(4) Had a developer sit with the users and watch them all day. They saw the errors, but didn't notice them doing anything out of the ordinary to cause them.

We struggled with this for weeks trying to figure out what the "Error Users" had in common that the other users didn't. I have no idea how, but the developer in step (4) had a eureka moment on the drive in to work one day worthy of Encyclopedia Brown.

He realized that all the "Error Users" were left handed, and confirmed this fact. Only left-handed users got the errors, never Righties. But how could being left handed cause a bug?

We had him sit down and watch the left-handers again specifically paying attention to anything they might be doing differently, and that's how we found it.

It turned out that the bug only happened if you moved the mouse to rightmost column of pixels in the image viewer while it was loading a new image (overflow error because the vendor had a 1-off calculation for mouseover event).

Apparently, while waiting for the next image to load, the users all naturally moved their hand (and thus the mouse) towards the keyboard.

The one user who happened to get the error most frequently was one of those ADD types that compulsively moved her mouse around a lot impatiently while waiting for the next page to load, thus she was moving the mouse to the right much more quickly and hitting the timing just right so she did it when the load event happened. Until we got a fix from the vendor, we told her just to let go of the mouse after clicking (next document) and not touch it until it loaded.

It was henceforth known in legend on the dev team as "The Left Handed Bug"


This is from a long time ago (the late 1980s).

The company I worked for wrote a CAD package (in FORTRAN) that ran on various Unix workstations (HP, Sun, Silcon Graphics etc.). We used our own file format to store the data and when the package was started disk space was scarce so there was a lot of bit shifting used to store multiple flags in entity headers.

The type of the entity (line, arc, text etc) was multiplied by 4096 (I think) when stored. In addition this value was negated to indicate a deleted item. So to get the type we had code that did:

type = record[1] MOD 4096

On every machine except one this gave ±1 (for a line), ±2 (for an arc) etc. and we could then check the sign to see if was deleted.

On one machine (HP I think) we had a weird problem where the handling of deleted items was screwed up.

This was in the days before IDE's and visual debuggers so I had to insert trace statements and logging to try and track down the problem.

I eventually discovered that it was because while every other manufacturer implemented MOD so that -4096 MOD 4096 resulted in -1 HP implemented it mathematically correctly so that -4096 MOD 4096 resulted in -4097.

I ended up having to go through the entire code base saving the sign of the value and making it positive before performing the MOD and then multiplying the result by the sign value.

This took several days.

My toughest was years back when Turbo Pascal was big, though it might have been one of the early C++ IDEs of that time. As sole developer (and third guy in at this startup) I had written something like a simplified salesperson-friendly CAD program. It was great at the time, but developed a nasty random crash. It was impossible to reproduce, but happened frequently enough that I set off on a bug hunt.

My best strategy was to single-step in the debugger. The bug happened only when the user had entered enough of a drawing and maybe had to be in a certain mode or zoom state, so there was a lot of tedious setting and clearing breakpoints, running normally for a minute to enter a drawing, and then step through a large chunk of code. Especially helpful were breakpoints that would skip some adjustable number of times then break. This whole exercise had to be repeated several times.

Eventually I narrowed it down to a place where a subroutine was being called, being given a 2 but from within it saw some gibberish number. I could have caught this earlier, but had not stepped into this subroutine, assuming that it got what it was given. Blinded by assuming the simplest of things were okay!

It turned out to be stuffing a 16 bit int on the stack, but the subroutine expecting 32-bit. Or something like that. The compiler did not automatically pad all value to 32 bit, or do sufficient type checking. It was trivial to fix, just part of one line, hardly any thought required. But to get there took three days of hunting and questioning the obvious.

So I have personal experience with that anecdote about the pricey consultant comes in, after a while makes one tap somewhere, and charges $2000. The executives demand a breakdown, and it's $1 for the tap, $1999 for knowing where to tap. Except in my case, it was time not money.

Lessons learned: 1) use the best compilers, where "best" is defined as including checking for as many problems as computer science knows how to check for, and 2) question the simple obvious things, or at least verify their proper functioning.

Since then all difficult bugs have been truly difficult, as I know to check the simple things more thoroughly than seems necessary.

Lesson 2 also applies to the toughest electronics bug I ever fixed, also with a trivial fix, but several smart EEs had been stumped for months. But this isn't an electronics forum, so I'll say no more of that.

The networking data race condition from hell

I was writing a networking client/server (Windows XP/C#) to work with a similar application on a really old (Encore 32/77) workstation written by another developer.

What the application did essentially was share/manipulate certain data on the host to control the host process running the system with our fancy PC based multi-monitor touchscreen UI.

It did this with a 3 layered structure. The communications process read/wrote data to/from the host, did all of the necessary format conversions (endianness, floating point format, etc) and wrote/read the values to/from a database. The database acted as a data intermediary between the comms and touchscreen UIs. The touchscreen UI's app generated touch screen interfaces based on how many monitors were attached to the PC (it automatically detected this).

In the time frame given a packet of values between the host and our pc could only send 128 values max across the wire at a time with a max latency of ~110ms per round trip (UDP was used with a direct x-over ethernet connection between the computers). So, the number of variables allowed based on the variable number of attached touchscreens was under strict control. Also, the host (although having a pretty complex multi-processor architecture with shared memory bus used for real time computing) had about 1/100th the processing power of my cell phone so it was tasked to do as little processing as possible and it's server/client had to be written in assembly to assure this (the host was running a full real time simulation that couldn't be affected by our program).

The issue was. Some values, when changed on the touchscreen wouldn't take just the newly entered value but would cycle randomly between that value and the previous value. That and only on a few specific values on a few specific pages with a certain combination of pages ever exhibited the symptom. We almost missed the issue completely until we started running it through the initial customer acceptance process

To pin down the issue I picked one of the oscillating values:

  • I checked the Touchscreen app, it was oscillating
  • I checked the database, oscillating
  • I checked the comms app, oscillating

Then I broke out wireshark and started manually decoding packet captures. Result:

  • Not oscillating but the packets didn't look right, there was too much data.

I stepped through every detail of the comms code a hundred times finding no flaw/error.

Finally I started firing off emails to the other dev asking in detail how his end worked to see if there was something I was missing. Then I found it.

Apparently, when he sent data he didn't flush the array of data before transmission so, essentially, he was just overwriting the last buffer used with the new values overwriting the old, but the old values not overwritten still being transmitted.

So, if a value was at position 80 of the data array and the list of values requested changed to less than 80 but that same value was contained within the new list, then both values would exist in the data buffer for that specific buffer at any given time.

The value being read from the database depended on the time slice of when the UI was requesting the value.

The fix was painfully simple. Read in the number of items incoming on the data buffer (It was actually contained as part of the packet protocol) and don't read the buffer beyond that number of items.

Lessons learned:

  • Don't take modern computing power for granted. There was a time when computers didn't support ethernet and when flushing an array could be considered expensive. If you really want to see how far we've come, imagine a system that has virtually no form of dynamic memory allocation. IE, the executive process had to pre-allocate all of the memory for all of the programs in order and no program could grow beyond that boundary. IE, allocating more memory to a program without recompiling the whole system could cause a massive crash. I wonder if people will talk about the pre-garbage collection days in the same light someday.

  • When doing networking with custom protocols (or handling binary data representation in general) make sure you read the spec until you understand every function of every value being sent across the pipe. I mean, read it until your eyes hurt. People handle data by manipulating individual bits or bytes have very clever and efficient ways of doing things. Missing the tiniest detail could break the system.

The overall time to fix was 2-3 days with most of that time spent working on other things when I got to frustrated with this.

SideNote: The host computer in question didn't support ethernet by default. The card to drive it was custom made and retrofitted and the protocol stack virtually didn't exist. The developer I was working with was one hell of a programmer, he not only implemented a stripped down version of UDP and a mimimal fake ethernet stack (the processor wasn't powerful enough to handle a full ethernet stack) on the system for this project but he did it in less than a week. He had also been one of the original project team leaders who had designed and programmed the OS in the first place. Lets just say, anything he ever had to share about computers/programming/architecture no matter how long winded or how much I already new, I'd listen to every word. There is nothing more valuable than working with good people who have a genuine passion for what they do.

The Background

  • In a mission critical WCF application driving a website and providing backend trasactional processing..
  • Large Volume application(hundreds of calls per second)
  • Multiple server multiple instances
  • hundreds of passed unit test and countless QA attacks

The Bug

  • When moved to production the server would run fine for a random amount of time then begin to rapidly degrade and take the box CPU to 100%.

How I found it

At first I was sure this was a normal performance problem so I create elaborate logging. Checked performance on every call talked to the database people about utilization watched the servers for issues. 1 week

Then I was sure I had a thread contention issue. I checked my deadlocks attempted to create the situation create tools to attempt to create the situation in debug. With growing management frustration I turned to my peers how suggested things from restarting the project from scratch to limiting the server to one thread. 1.5 weeks

Then I looked at Tess Ferrandez blog created a user dump file and annalized it with windebug the next time the server took a dump. Found that all my threads were stuck int the dictionary.add function.

The long the short one small dictionary that just kept track of which log to write x threads errors to was not synchronized.

We had an application that was talking to a hardware device that, in some cases, would fail to operate correctly if the device was physically unplugged until it had been plugged back in and soft-reset twice.

The problem turned out to be that an application running at startup was occasionally segfaulting when it was trying to read from a filesystem that hadn't yet been mounted (for example, if a user configured it to read from an NFS volume). At start up the application would send some ioctls to the driver to initialize the device, then read configuration settings and send more ioctls to put the device in the correct state.

A bug in the driver was causing an invalid value to be written to the device when the initialization call was made, but the value was overwritten with valid data once the calls were made to put the device in a specific state.

The device itself had a battery and would detect if it lost power from the motherboard, and would write a flag into volatile memory indicating that it had lost power, it would then enter a specific state the next time it was powered on, and a specific instruction needed to be sent to clear the flag.

The problem was that if the power was removed once the ioctls had been sent to initialize the device (and wrote the invalid value to the device) but before valid data could be sent. When the device was powered back on, it would see the flag had been set and try to read the invalid data that had been sent from the driver due to the incomplete initalization. This would put the device in an invalid state where the powered-off flag had been cleared but the device would not receive further instructions until it had been reinitialized by the driver. The second reset would mean that the device was not trying to read the invalid data that had been stored on it, and would receive correct configuration instructions, allowing it to be put into the correct state (assuming the application sending the ioctls didn't segfault).

In the end it took about two weeks to figure out the exact set of circumstances that was causing the problem.

For an University project we were writing an Distributed P2P Nodes system that share files, this supported multicasting to detect each other, multiple rings of nodes and a nameserver so a node is assigned to a client.

Written in C++ we used POCO for this as it allows nice IO, Socket and Thread programming.

There were two bugs that arise that annoyed us and made us lose a lot of time, a really logic one:

Randomly, a computer was sharing his localhost IP instead of it's remote IP.

This caused clients to connect to the node on the same PC or nodes to connect with themselves.

How did we identify this? When we improved the output in the nameserver we discovered at a later moment when we rebooted the computers that our script to determine the IP to give was wrong. Randomly, the lo device was listed first instead of the eth0 device... Really stupid. So now we hardcoded to requist it from eth0 as this is shared among all university computers...

And now a more annoying one:

Randomly, the packet flow would randomly pause.
When the next client connects it would continue...

This happened really random and as more than one computer is involved it got more annoying to debug this problem, the university computers do not allow us to run Wireshark on those so we are left with guessing if the problem was at the sending side or the receiving side.

With a lot of output in the code we just took the assumption that sending the commands goes fine,
this left us wondering where the real problem was... It seemed that the way POCO polls is wrong and that we instead should check for available characters on the incoming socket.

We took the assumption that this worked as more simpler tests in a prototype involving less packets didn't cause this issue, so this caused us to just assume that the poll statement was working but... It wasn't. :-(

Lessons learned:

  • Don't make stupid assumptions like the order of the network devices.

  • Frameworks don't always do their job (either implementation or documentation) right.

  • Provide enough output in the code, if not allowed there be sure to log extended details to a file.

  • When code hasn't been unit tested (because it's too difficult) don't assume things to work.

I'm still on my most difficult bug hunt. It's one of those sometimes its there and sometimes its not bugs. Thats why I'm here, at 6:10am the next day.


  • Context: Language, Application, Environment, etc.
    • PHP OS Commerce
  • How was the bug identified ?
    • Random order's that work part way the randomly fail and redirect issues
  • Who or what identified the bug ?
    • Client, and the redirect issue was obvious
  • How complex was reproducing the bug ?
    • I havent been able to reproduce, but client has been able to.

The Hunting.

  • What was your plan?
    • Add debug code, fill order, analize data, repeat
  • What difficulties did you encounter ?
    • Lack of repeatable problems and horrible code
  • How was the offending code finally found ?
    • lots of offending code was found.. just not exactly what i needed.

The Killing.

  • How complex was the fix ?
    • very
  • How did you determine the scope of the fix ?
    • there was no scope... it was everywhere
  • How much code was involved in the fix ?
    • All of it? I dont think there was a file untouched


  • What was the root cause technically ? buffer overrun, etc.
    • bad coding practice
  • What was the root cause from 30,000 ft ?
    • I would rather not say...
  • How long did the process ultimately take ?
    • forever and a day
  • Were there any features adversely effected by the fix ?
    • feature? or is it a bug?
  • What methods, tools, motivations did you find particularly helpful ? ...horribly useless ?
  • If you could do it all again ?............
    • ctrl+a Del

I had to fix some confusing concurrency stuff last semseter, but the bug that still stands out the most for me was in a text based game I was writing in PDP-11 assembly for a homework assignment. It was based on Conway's Game of Life and for some strange reason a large part of the information next to the grid was constantly being overwritten with information that shouldn't have been there. The logic was also pretty straightforward, so it was very confusing. After going over it a bunch of times to rediscover that all the logic is correct I suddenly noticed what was the problem. This thing: .

In PDP-11 this little dot next to a number makes it base 10 instead of 8. It was next to a number that bounded a loop that was supposed to be limited to the grid, whose size was defined with the same numbers but in base 8.

It still stands out for me because the of the amount of damage such a tiny 4 pixel sized addition caused. So what's the conclusion? Don't code in PDP-11 assembly.

Main-Frame Program Stopped Working For No Reason

I just posted this to another question.See Post Here

It happened because they installed a newer version of the compiler on the Main-Frame.

Update 06/11/13: (Original answer was deleted by OP)

I inherrited this main-frame application. One day, out of the clear blue it stopped working. That's it... poof it just stopped.

My job was to get it working as fast as possible. The source code had not been modified for two years, but all of the sudden it just stopped. I tried to compile the code and it broke on line XX. I looked at line XX and I could not tell what would make line XX break. I asked for the detailed specs for this application and there were none. Line XX was not the culprit.

I printed out the code and started reviewing it from the top down. I started to create a flowchart of what was going on. The code was so convoluted I could hardly even make sense of it. I gave up trying to flowchart it. I was afraid to make changes without knowing how that change would effect the rest of the process, especially since I had no details of what the application did.

So, I decided to start at the top of the source code and add whitespce and line brakes to make the code more readable. I noticed, in some cases, there were if conditions that combined ANDs and ORs and it wasn't clearly distinguishable between what data was being ANDed and what data was being ORed. So i started putting parenthesis around the AND and OR conditions to make them more readable.

As I slowly moved down cleaning it up, I would periodically save my work. At one point I tried compiling the code and a strange thing happend. The error had jumped passed the original line of code and was now further down. So I continued, speparating the AND and OR conditions with parens. When I got done cleaning it up it worked. Go Figure.

I then decided to visit the operations shop and ask them if they had recently installed any new components on the main-frame. They said yes, we recently upgraded the compiler. Hmmmm.

It turns out that the old compiler evaluated expression from left to right regardless. The new version of the compiler also evaluated expressions from left to right but ambiguous code meaning unclear combination of ANDs and ORs could not be resolved.

Lesson I learned from this... ALWAYS, ALWAYS, ALWAYS use parens to separated AND conditions and OR conditions when they are used in conjuction with each other.


  • Context: Web Server (C++) which allows customers to check-in themselves
  • Bug: When requesting the page, it would simply not respond, the whole farm that is, and the processes would be killed (and relaunched) because they took too long (only a few seconds is allowed) to serve the page
  • Some users did complain, but it was extremely sporadic so mostly unnoticed (people just tend to hit "Refresh" when a page is not served). We did notice the core dumps though ;)
  • We actually never managed to reproduce in our local environments, the bug appeared a few times in Test systems but never showed up during Performance Tests ??

The Hunting.

  • Plan: Well, since we had memory dumps and logs, we wanted to analyze them. Since it was affecting the whole farm and we had have some databases issues in the past we suspected the database (single DB for several servers)
  • Difficulties: A full server dump is huge, and so they are cleared quite frequently (not to run out of space), so we had to be quick to grab one when it occurred... We persisted. The dumps showed various stacks (never any DB stuff, so much for that), it failed while preparing the page itself (not in the previous computations), and confirmed what the logs showed, preparing the page would sometimes take a long time, even though it's just a basic template engine with pre-computed data (traditional MVC)
  • Getting to it: After some more samples and some thinking we realized that the time was taken reading data from the HDD (the page template). Since it was concerning the whole farm we first looked for scheduled jobs (crontab, batches) but the timings never matched from one occurrence to another... It finally occurred to me that this always happened a few days before the activation of a new version of the software and I had a AhAh! moment... it was caused by the distribution of the software! Delivering several hundreds of megabytes (compressed) can put a little dent on the disk performance :/ Of course the distribution is automated and the archive pushed to all servers at once (multicast).

The Killing.

  • Fix Complexity: switching to compiled templates
  • Code Affected: none, a simple change in the build process


  • Root cause: operational issue or lack of forward planning :)
  • Timescale: it took months to track down, a matter of days to fix and test, a few weeks for QA and Performance testing and deployment -- no hurry there, since we knew that deploying the fix would trigger the bug... and nothing else... kinda pervert really!
  • Adverse side-effects: impossibility to switch templates at runtime now that they are baked in the delivered code, we didn't use the feature much though, since generally switching templates means that you've got more data to pour in. Using css is mostly sufficient for "small" layout changes.
  • Methods, tools: gdb + monitoring! Just took us time to suspect the disk, and then identify the cause of the spikes of activity on the monitoring graph...
  • Next time: treat all IO as adverse!

The hardest one never got killed because it never could be reproduced other than in the full production environment with the factory operating.

The craziest one I did kill:

The drawings are printing gibberish!

I look at the code and I can't see anything. I pull a job out of the printer queue and examine it, it looks fine. (This was in the dos era, PCL5 with embedded HPGl/2--actually, very good for plotting drawings and no headaches of building a raster image in limited memory.) I direct it to another printer that should understand it, it prints fine.

Roll back the code, the problem is still there.

Finally I manually make a simple file and send it to the printer--gibberish. It turns out that it wasn't my bug at all but the printer itself. The maintenance company had flashed it to the latest version when they were fixing something else and that latest version had a bug. Getting them to understand they had taken out critical functionality and had to flash it back to an earlier version was harder than finding the bug itself.

One that was even more vexing but since it was only on my box I wouldn't put in first place:

Borland Pascal, DPMI code to deal with some unsupported APIs. Run it, occasionally it worked, usually it went boom trying to deal with an invalid pointer. It never produced a wrong result, though, like you would expect from stomping on a pointer.

Debug--if I single-stepped through the code it would always work correctly, otherwise it was just as unstable as before. Inspection always showed the right values.

The culprit: There were two.

1) Borland's library code had a major bug: Real mode pointers were being stored in pointer variables in protected mode. The problem is that most real mode pointers have invalid segment addresses in protected mode and when you try to copy the pointer it loaded it into a register pair and then saved it.

2) The debugger would never say anything about such an invalid load in single-step mode. I don't know what it did internally but what was presented to the user looked completely correct. I suspect that it wasn't actually executing the instruction but simulating it instead.

This is just a very simple bug that somehow I turned into a nightmare for me.

Background: I was working on making my own Operating System. Debugging is very difficult(trace statements is all you can have, and sometimes not even that)

Bug: Instead of doing two thread switches at usermode, it would instead general protection fault.

The bug hunt: I spent probably a week or two trying to fix this problem. Inserting trace statements everywhere. Examining generated assembly code(from GCC). Printing out each and every value I could.

The problem: Somewhere early in the bug hunt, I had placed a hlt instruction in the crt0. The crt0 is basically what bootstraps a user program for use in an operating system. This hlt instruction causes a GPF when executed from user mode. I placed it there and basically forgot about it. (originally the problem was something of a buffer overflow or memory allocation error)

The fix: Remove the hlt instruction :) After removing it, everything worked smooth.

What I learned: When trying to debug a problem, don't lose track of the fixes you try. Do regular diffs against the latest stable source control version and see what you've changed recently when nothing else works

