How to avoid logical mistakes in code, when TDD didn't help?

https://softwareengineering.stackexchange.com/questions/374074

06-02-2021
|

Question

I was recently writing a small piece of code which would indicate in a human-friendly way how old an event is. For instance, it could indicate that the event happened “Three weeks ago” or “A month ago” or “Yesterday.”

The requirements were relatively clear and this was a perfect case for test driven development. I wrote the tests one by one, implementing the code to pass each test, and everything seemed to work perfectly. Until a bug appeared in production.

Here's the relevant piece of code:

now = datetime.datetime.utcnow()
today = now.date()
if event_date.date() == today:
    return "Today"

yesterday = today - datetime.timedelta(1)
if event_date.date() == yesterday:
    return "Yesterday"

delta = (now - event_date).days

if delta < 7:
    return _number_to_text(delta) + " days ago"

if delta < 30:
    weeks = math.floor(delta / 7)
    if weeks == 1:
        return "A week ago"

    return _number_to_text(weeks) + " weeks ago"

if delta < 365:
    ... # Handle months and years in similar manner.

The tests were checking the case of an event happening today, yesterday, a four days ago, two weeks ago, a week ago, etc., and the code was built accordingly.

What I missed is that an event can happen a day before yesterday, while being one day ago: for instance an event happening twenty six hours ago would be one day ago, while not exactly yesterday if now is 1 a.m. More exactly, it's one point something, but since the delta is an integer, it will be just one. In this case, the application displays “One days ago,” which is obviously unexpected and unhandled in the code. It can be fixed by adding:

if delta == 1:
    return "A day ago"

just after computing the delta.

While the only negative consequence of the bug is that I wasted half an hour wondering how this case could happen (and believing that it has to do with time zones, despite the uniform use of UTC in the code), its presence is troubling me. It indicates that:

It's very easy to commit a logical mistake even in a such simple source code.
Test driven development didn't help.

Also worrisome is that I can't see how could such bugs be avoided. Aside thinking more before writing code, the only way I can think of is to add lots of asserts for the cases that I believe would never happen (like I believed that a day ago is necessarily yesterday), and then to loop through every second for the past ten years, checking for any assertion violation, which seems too complex.

How could I avoid creating this bug in the first place?

Solution

These are the kinds of errors you typically find in the refactor step of red/green/refactor. Don't forget that step! Consider a refactor like the following (untested):

def pluralize(num, unit):
    if num == 1:
        return unit
    else:
        return unit + "s"

def convert_to_unit(delta, unit):
    factor = 1
    if unit == "week":
        factor = 7 
    elif unit == "month":
        factor = 30
    elif unit == "year":
        factor = 365
    return delta // factor

def best_unit(delta):
    if delta < 7:
        return "day"
    elif delta < 30:
        return "week"
    elif delta < 365:
        return "month"
    else:
        return "year"

def human_friendly(event_date):
    date = event_date.date()
    today = now.date()
    yesterday = today - datetime.timedelta(1)
    if date == today:
        return "Today"
    elif date == yesterday:
        return "Yesterday"
    else:
        delta = (now - event_date).days
        unit = best_unit(delta)
        converted = convert_to_unit(delta, unit)
        pluralized = pluralize(converted, unit)
        return "{} {} ago".format(converted, pluralized)

Here you've created 3 functions at a lower level of abstraction which are much more cohesive and easier to test in isolation. If you left out a time span you intended, it would stick out like a sore thumb in the simpler helper functions. Also, by removing duplication, you reduce the potential for error. You would actually have to add code to implement your broken case.

Other more subtle test cases also more readily come to mind when looking at a refactored form like this. For example, what should best_unit do if delta is negative?

In other words, refactoring isn't just for making it pretty. It makes it easier for humans to spot errors the compiler can't.

OTHER TIPS

Test driven development didn't help.

It seems like it did help, its just that you didn't have a test for the "a day ago" scenario. Presumably, you added a test after this case was found; this is still TDD, in that when bugs are found you write a unit-test to detect the bug, then fix it.

If you forget to write a test for a behavior, TDD has nothing to help you; you forget to write the test and therefore don't write the implementation.

an event happening twenty six hours ago would be one day ago

Tests won't help much if a problem is poorly defined. You're evidently mixing calendar days with days reckoned in hours. If you stick to calendar days, then at 1 AM, 26 hours ago is not yesterday. And if you stick to hours, then 26 hours ago rounds to 1 day ago regardless of the time.

You can't. TDD is great about protecting you from possible issues you are aware of. It doesn't help if you run into issues you've never considered. Your best bet is to have someone else testing the system, they may find the edge cases you never considered.

There are two approaches I normally take that I find can help.

First, I look for the edge cases. These are places where the behavior changes. In your case, behavior changes at several points along the sequence of positive integer days. There is an edge case at zero, at one, at seven, etc. I would then write test cases at and around the edge cases. I'd have test cases at -1 days, 0 days, 1 hours, 23 hours, 24 hours, 25 hours, 6 days, 7 days, 8 days, etc.

The second thing I'd look for is patterns of behavior. In your logic for weeks, you have special handling for one week. You probably have similar logic in each of your other intervals not shown. This logic is not present for days, though. I would look at that with suspicion until I could either verifiably explain why that case is different, or I add the logic in.

You can not catch logical errors that are present in your requirements with TDD. But still, TDD helps. You found the error, after all, and added a test case. But fundamentally, TDD only ensures that the code conforms to your mental model. If your mental model is flawed, test cases will not catch them.

But keep in mind , whilst fixing the bug, the test cases you already had made sure no existing, functioning behavior was broken. That is quite important, it is easy to fix one bug but introduce another.

In order to find those errors beforehand, you usually try to use equivalence-class based test cases. using that principle, you would choose one case from every equivalence class, and then all edge cases.

You would choose a date from today, yesterday, a few days ago, exactly one week ago and several weeks ago as the examples from each equivalence class. When testing for dates, you would also make sure that your tests did not use the system date, but use a pre-determined date for comparison. This would also highlight some edge cases: You would make sure to run your tests at some arbitrary time of the day, you would run it with directly after midnight, directly before midnight and even directly at midnight. This means for each test, there would be four base times it is tested against.

Then you would systematically add edge cases to all the other classes. You have the test for today. So add a time just before and after the behavior should switch. The same for yesterday. The same for one week ago etc.

Chances are that by enumerating all edge cases in a systematic manner and writing down test cases for them, you find out that your specification is lacking some detail and add it. Note that handling dates is something people often get wrong, because people often forget to write their tests so that they can be run with different times.

Note, however, that most of what I have written has little to do with TDD. Its about writing down equivalence classes and making sure your own specifications are detailed enough about them. That is the process with which you minimize logical errors. TDD just makes sure your code conforms to your mental model.

Coming up with test cases is hard. Equivalence-class based testing is not the end of it all, and in some cases it can significantly increase the number of test cases. In the real world, adding all those tests is often not economically viable (even though in theory, it should be done).

The only way I can think of is to add lots of asserts for the cases that I believe would never happen (like I believed that a day ago is necessarily yesterday), and then to loop through every second for the past ten years, checking for any assertion violation, which seems too complex.

Why not? This sounds like a pretty good idea!

Adding contracts (assertions) to code is a pretty solid way of improving its correctness. Generally we add them as preconditions on function entry and postconditions on function return. For example, we could add a postcondition that all returned values are either of form "A [unit] ago" or "[number] [unit]s ago". When done in a disciplined way, this leads to design by contract, and is one of the most common ways of writing high-assurance code.

Critically, the contracts aren't intended to be tested; they are just as much specifications of your code as your tests are. However, you can test via the contracts: call the code in your test and, if none of the contracts raise errors, the test passes. Looping through every second of the past ten years is a bit much. But we can leverage another testing style called property-based testing.

In PBT instead of testing for specific outputs of the code, you test that the output obeys some property. For example, one property of a reverse() function is that for any list l, reverse(reverse(l)) = l. The upside of writing tests like this is you can have the PBT engine generate a few hundred arbitrary lists (and a few pathological ones) and check they all have this property. If any don't, the engine "shrinks" the failing case to find a minimal list that breaks your code. It looks like you're writing Python, which has Hypothesis as the main PBT framework.

So, if you want a good way to find more tricky edge cases you might not think of, using contracts and property-based testing together will help a lot. This doesn't replace writing unit tests, of course, but it does augment it, which is really the best we can do as engineers.

This is an example where adding a bit of modularity would have been useful. If an error-prone code segment is used multiple times, it's good practice to wrap it in a function if possible.

def time_ago(delta, unit):
    delta_str = _number_to_text(delta) + " " + unit;
    if delta == 1:
        return delta_str + " ago"
    else:
        return delta_str = "s ago"

now = datetime.datetime.utcnow()
today = now.date()
if event_date.date() == today:
    return "Today"

yesterday = today - datetime.timedelta(1)
if event_date.date() == yesterday:
    return "Yesterday"

delta = (now - event_date).days

if delta < 7:
    return time_ago(delta, "day")

if delta < 30:
    weeks = math.floor(delta / 7)
    return time_ago(weeks, "week")

if delta < 365:
    months = math.floor(delta / 31)
    return time_ago(months, "month")

Test driven development didn't help.

TDD works best as a technique if the person writing the tests is adversarial. This is difficult if you are not pair-programming, so another way to think about this is:

Don't write tests to confirm the function under test works as you made it. Write tests that deliberately break it.

This is a different art, that applies to writing correct code with or without TDD, and one perhaps as complex (if not more so) than actually writing code. Its something you need to practice, and its something there is no single, easy, simple answer for.

The core technique to writing robust software, is also the core technique to understanding how to write effective tests:

Understand the preconditions for a function - the valid states (i.e. what assumptions are you making about the state of the class the function is a method of) and valid input parameter ranges - each data type has a range of possible values - a subset of which will be handled by your function.

If you do simply nothing more than explicitly testing these assumptions on function entry, and ensuring that a violation is logged or thrown and/or the function errors out with no further handling you can quickly know if your software is failing in production, make it robust and error tolerant, and develop your adversarial test writing skills.

NB. There is a whole literature on Pre and Post Conditions, Invariants and so on, along with libraries that can apply them using attributes. Personally I am not a fan of going so formal, but its worth looking into.

This is one of the most important facts about software development: It is absolutely, utterly impossible to write bug-free code.

TDD won't save you from introducing bugs corresponding to test cases you didn't think of. It also won't save you from writing an incorrect test without realizing it, then writing incorrect code that happens to pass the buggy test. And every other single software development technique ever created has similar holes. As developers, we are imperfect humans. At the end of the day, there is no way to write 100% bug-free code. It never has and never will happen.

This isn't to say that you should give up hope. While it's impossible to write completely perfect code, it's very possible to write code that has so few bugs that appear in such rare edge cases that the software is extremely practical to use. Software that does not exhibit buggy behavior in practice is very much possible to write.

But writing it requires us to embrace the fact that we will produce buggy software. Almost every modern software development practice is at some level built around either preventing bugs from appearing in the first place or protecting ourselves from the consequences of the bugs we inevitably produce:

Gathering thorough requirements allows us to know what incorrect behavior looks like in our code.
Writing clean, carefully-architected code makes it easier to avoid introducing bugs in the first place and easier to fix them when we identify them.
Writing tests allows us to produce a record of what we believe many of the worst possible bugs in our software would be and prove that we avoid at least those bugs. TDD produces those tests before the code, BDD derives those tests from the requirements, and old-fashioned unit testing produces tests after the code is written, but they all prevent the worst regressions in the future.
Peer reviews mean that every time code is changed, at least two pairs of eyes have seen the code, decreasing how frequently bugs slip into master.
Using a bug tracker or a user story tracker that treats bugs as user stories means that when bugs appear, they're kept track of and ultimately dealt with, not forgotten about and left to consistently get in users' ways.
Using a staging server means that before a major release, any show-stopper bugs have a chance to appear and be dealt with.
Using version control means that in the worst-case scenario, where code with major bugs is shipped to customers, you can perform an emergency rollback and get a reliable product back into your customers' hands while you sort things out.

The ultimate solution to the problem you've identified is not to fight the fact that you can't guarantee you'll write bug-free code, but rather to embrace it. Embrace industry best practices in all areas of your development process, and you will consistently deliver code to your users that, while not quite perfect, is more than robust enough for the job.

You simply did not have thought of this case before and therefore didn't have a test case for it.

This happens all the time and is just normal. It's always a trade-off how much effort you put in creating all possible test cases. You can spent infinite time to consider all test cases.

For an aircraft autopilot you would spend much more time than for a simple tool.

It often helps to think about the valid ranges of your input variables and test these boundaries.

In addition, if the tester is a different person than the developer, often more significant cases are found.

(and believing that it has to do with time zones, despite the uniform use of UTC in the code)

That's another logical mistake in your code for which you don't have a unit test yet :) - your method will return incorrect results for users in non-UTC timezones. You need to convert both "now" and the event's date to user's local timezone before calculating.

Example: In Australia, an an event happens at 9am local time. At 11am it will be displayed as "yesterday" because the UTC date has changed.

Let somebody else write the tests. This way somebody unfamiliar with your implementation might check for rare situations that you haven't thought of.
If possible, inject test cases as collections. This makes adding another test as easy as adding another line like yield return new TestCase(...). This can go in the direction of exploratory testing, automating the creation of test cases: "Let's see what the code returns for all the seconds of one week ago".

You appear to be under the misconception that if all of your tests pass, you have no bugs. In reality, if all of your tests pass, all the known behaviour is correct. You still don't know if the unknown behaviour is correct or not.

Hopefully, you are using code coverage with your TDD. Add a new test for the unexpected behaviour. Then you can run just the test for the unexpected behaviour to see what path it actually takes through the code. Once you know the current behaviour, you can make a change to correct it, and when all the tests pass again, you'll know you've done it properly.

This still doesn't mean that your code is bug free, just that it is better than before, and once again all the known behaviour is correct!

Using TDD correctly doesn't mean you will write bug free code, it means you will write fewer bugs. You say:

The requirements were relatively clear

Does this mean that the more-than-one-day-but-not-yesterday behaviour was specified in the requirements? If you missed a written requirement, it's your fault. If you realised the requirements were incomplete as you were coding it, good for you! If everybody who worked on the requirements missed that case, you're no worse than the others. Everyone makes mistakes, and the more subtle they are, the easier they are to miss. The big take away here is that TDD does not prevent all errors!

It's very easy to commit a logical mistake even in a such simple source code.

Yes. Test driven development does not change that. You can still create bugs in the actual code, and also in the test code.

Test driven development didn't help.

Oh, but it did! First of all, when you noticed the bug you already had the complete test framework in place, and just had to fix the bug in the test (and the actual code). Secondly, you don't know how many more bugs you would have had if you had not done TDD in the beginning.

Also worrisome is that I can't see how could such bugs be avoided.

You can't. Not even NASA has found a way to avoid bugs; we lesser humans certainly don't, either.

Aside thinking more before writing code,

That is a fallacy. One of the greatest benefits of TDD is that you can code with less thinking, because all those tests at least catch regressions pretty well. Also, even, or especially with TDD, it is not expected to deliver bug-free code in the first place (or your development speed will simply grind to a halt).

the only way I can think of is to add lots of asserts for the cases that I believe would never happen (like I believed that a day ago is necessarily yesterday), and then to loop through every second for the past ten years, checking for any assertion violation, which seems too complex.

This would clearly conflict with the tenet of only coding what you actually need right now. You thought you needed those cases, and so it was. It was a non-critical piece of code; as you said there was no damage except you wondering about it for 30 minutes.

For mission-critical code, you actually could do what you said, but not for your everyday standard code.

How could I avoid creating this bug in the first place?

You don't. You trust in your tests to find most regressions; you keep to the red-green-refactor-cycle, writing tests before/during actual coding, and (important!) you implement the minimum amount necessary to make the red-green switch (not more, not less). This will end up with a great test coverage, at least a positive one.

When, not if, you find a bug, you write a test to reproduce that bug, and fix the bug with the least amount of work to make said test go from red to green.

You just discovered that no matter how hard you try, you'll never be able to catch all possible bugs in your code.

So what this means is that even attempting to catch all bugs is an exercise in futility, and so you should only use techniques such as TDD as a way of writing better code, code that has fewer bugs, not 0 bugs.

That in turn means you should spend less time using these techniques, and spend that saved time working on alternative ways to find the bugs that slip through the development net.

alternatives such as integration testing, or a test team, system testing, and logging and analysing those logs.

If you cannot catch all bugs, then you must have a strategy in place for mitigating the effects of the bugs that slip past you. If you have to do this anyway, then putting more effort into this makes more sense than trying (in vain) to stop them in the first place.

After all, its pointless spending a fortune in time writing tests and the first day you give your product to a customer it falls over, particularly if you then have no clue how to find and resolve that bug. Post-mortem and post-delivery bug resolution is so important and needs more attention than most people spend on writing unit tests. Save the unit testing for the complicated bits and don't try for perfection up front.

Licensed under: CC-BY-SA with attribution

Not affiliated with softwareengineering.stackexchange