Should I refactor the code that is marked as “don't change”?

https://softwareengineering.stackexchange.com/questions/345593

08-01-2021
|

Question

I am dealing with a pretty big codebase and I was given a few months to refactor existing code. The refactor process is needed because soon we will need to add many new features to our product and as for now we are no longer able to add any feature without breaking something else. In short: messy, huge, buggy code, that many of us have seen in their careers.

During refactoring, from time to time I encounter the class, method or lines of code which have comments like

Time out set to give Module A some time to do stuff. If its not timed like this, it will break.

Do not change this. Trust me, you will break things.

I know using setTimeout is not a good practice, but in this case I had to use it

My question is: should I refactor the code when I encounter such warnings from the authors (no, I can't get in touch with authors)?

Solution

It seems you are refactoring "just in case", without knowing exactly which parts of the codebase in detail will be changed when the new feature development will take place. Otherwise, you would know if there is a real need to refactor the brittle modules, or if you can leave them as they are.

To say this straight: I think this is a doomed refactoring strategy. You are investing time and money of your company for something where noone knows if it will really return a benefit, and you are on the edge of making things worse by introducing bugs into working code.

Here is a better strategy: use your time to

add automatic tests (probably not unit tests, but integration tests) to the modules at risk. Especially those brittle modules you mentioned will need a full test suite before you change anything in there.
refactor only the bits you need to bring the tests in place. Try to minimize any of the necessary changes. The only exception is when your tests reveal bugs - then fix them immediately (and refactor to the degree you need to do so safely).
teach your colleagues the "boy scout principle" (AKA "opportunistic refactoring"), so when the team starts to add new features (or to fix bugs), they should improve and refactor exactly the parts of the code base they need to change, not less, not more.
get a copy of Feather's book "Working effectively with legacy code" for the team.

So when the time comes when you know for sure you need to change and refactor the brittle modules (either because of the new feature development, or because the tests you added in step 1 reveal some bugs), then you and your team are ready to refactor the modules, and more or less safely ignore those warning comments.

As a reply to some comments: to be fair, if one suspects a module in an existing product to be the cause of problems on a regular basis, especially a module which is marked as "don't touch", I agree with all of you. It should be reviewed, debugged and most probably refactored in that process (with support of the tests I mentioned, and not necessarily in that order). Bugs are a strong justification for change, often a stronger one than new features. However, this is a case-by-case decision. One has to check very carefully if it is really worth the hassle to change something in a module which was marked as "don't touch".

OTHER TIPS

Yes, you should refactor the code before you add the other features.

The trouble with comments like these is that they depend on particular circumstances of the environment in which the code base is running. The timeout programmed at a specific point may have been truly necessary when it was programmed.

But there are any number of things that might change this equation: hardware changes, OS changes, unrelated changes in another system component, changes in typical data volumes, you name it. There is no guarantee that in the present day it is still necessary, or that it is still sufficient (the integrity of the thing it was supposed to protect might have been broken for a long time - without proper regression tests you might never notice). This is why programming a fixed delay in order to allow another component to terminate is almost always incorrect and works only accidentally.

In your case, you don't know the original circumstances, and neither can you ask the original authors. (Presumably you also don't have proper regression/integration testing, or you could simply go ahead and let your tests tell you whether you broke something.)

This might look like an argument for not changing anything out of caution; but you say there will have to be major changes anyway, so the danger of upsetting the delicate balance that used to be achieved in this spot is already there. It is much better to upset the apple cart now, when the only thing you're doing is the refactoring, and be certain that if things break it was the refactoring that caused it, than to wait until you are making additional changes simultaneously and never be sure.

My question is: should I refactor the code when I encounter such warnings from the authors

No, or at least not yet. You imply that the level of automated testing is very low. You need tests before you can refactor with confidence.

for now we are no longer able to add any feature without breaking something else

Right now you need to focus on increasing stability, not refactoring. You might do refactoring as part of stability increase, but it is a tool to achieve your real goal - a stable codebase.

It sounds like this has become a legacy codebase, so you need to treat it a little differently.

Start by adding characterization tests. Don't worry about any spec, just add a test that asserts on the current behavior. This will help prevent new work from breaking existing features.

Every time you fix a bug, add a test case that proves the bug is fixed. This prevents direct regressions.

When you add a new feature, add at least some basic testing that the new feature works as intended.

Perhaps get a copy of "Working Effectively with Legacy Code"?

I was given a few months to refactor existing code.

Start by getting test coverage up. Start with areas that break the most. Start with areas that change the most. Then once you have identified bad designs, replace them one by one.

You almost never one to do one huge refactor, but instead a constant flow of small refactors every week. One big refactor has a huge risk of breaking things, and it's hard to test well.

Remember the G. K. Chesterton's fence: do not take down a fence obstructing a road until you understand why it was built.

You can find the author(s) of the code and the comments in question, and consult them to obtain the understanding. You can look at the commit messages, email threads, or docs if they exist. Then you'll either be able to refactor the marked fragment, or will write down your knowledge in the comments so that the next person to maintain this code could make a more informed decision.

Your aim is to reach an understanding what the code does, and why it was marked with a warning back then, and what would happen if the warning were ignored.

Before that, I'd not touch the code which is marked "do not touch".

Code with comments like the ones you showed would be top on my list of things to refactor if, and only if I have any reason to do so. The point is, that the code stinks so badly, you even smell it through the comments. It's pointless to try to frickle any new functionality into this code, such code needs to die as soon as it needs to change.

Also note that the comments are not helpful in the least: They only give warning and no reason. Yes, they are the warning signs around a tank of sharks. But if you want to do anything within the vicinity, there is little points to try to swim with the sharks. Imho, you should get rid of those sharks first.

That said, you absolutely need good test cases before you can dare to work on such code. Once you have those test cases, be sure to understand every tiny little bit you are changing, to ensure that you are really not changing behavior. Make it your top priority to keep all the behavioral peculiarities of the code until you can prove that they are without any effect. Remember, you are dealing with sharks - you must be very careful.

So, in the case of the time out: Leave it in until you understand precisely what the code is waiting for, and then fix the reason why the time out was introduced before you proceed to remove it.

Also, be sure that your boss understands what an endeavor it is that you are embarking on, and why it is needed. And if they say no, don't do it. You absolutely need their support in this.

Probably not a good idea to refactor code with such warnings, if things work OK as they are.

But if you must refactor...

First, write some unit tests and integration tests to test the conditions that the warnings are warning you about. Try to mimic production-like conditions as much as possible. This means mirroring the production database to a test server, running the same other services on the machine, etc... whatever you can do. Then attempt your refactoring (on an isolated branch, of course). Then run your tests on your new code to see if you can get the same results as with the original. If you can, then it is probably OK to implement your refactoring in production. But be ready to roll back the changes if things go wrong.

I say go ahead and change it. Believe it or not, not every coder is a genius and a warning like this means it could be a spot for improvement. If you find that the author was right, you could (gasp) DOCUMENT or EXPLAIN the reason for the warning.

I am expanding my comments into an answer because I think some aspects of the specific problem are either being overlooked, or used to draw the wrong conclusions.

At this point, the question of whether to refactor is premature (even though it will probably be answered by a specific form of 'yes').

The central issue here is that (as noted in some answers) the comments you quote strongly indicate that the code has race conditions or other concurrency/synchronization problems, such as discussed here. These are particularly difficult problems, for several reasons. Firstly, as you have found, seemingly unrelated changes can trigger the problem (other bugs can also have this effect, but concurrency errors almost always do.) Secondly, they are very hard to diagnose: the bug often manifests itself in a place that is distant in time or code from the cause, and anything you do to diagnose it may cause it to go away (Heisenbugs). Thirdly, concurrency bugs are very hard to find in testing. Partly, that is because of the combinatorial explosion: it is bad enough for sequential code, but adding the possible interleavings of concurrent execution blows it up to the point where the sequential problem becomes insignificant in comparison. In addition, even a good test case may only trigger the problem occasionally - Nancy Leveson calculated that one of the lethal bugs in the Therac 25 occurred in 1 of about 350 runs, but if you do not know what the bug is, or even that there is one, you do not know how many repetitions make an effective test. In addition, only automated testing is feasible at this scale, and it is possible that the test driver imposes subtle timing constraints such that it will never actually trigger the bug (Heisenbugs again).

There are some tools for concurrency testing in some environments, such as Helgrind for code using POSIX pthreads, but we do not know the specifics here. Testing should be supplemented with static analysis (or is that the other way round?), if there are suitable tools for your environment.

To add to the difficulty, compilers (and even the processors, at runtime) are often free to reorganize the code in ways that sometimes make reasoning about its thread-safety very counter-intuitive (perhaps the best-known case is the double-checked lock idiom, though some environments (Java, C++...) have been modified to ameliorate it.)

This code may have one simple problem that is causing all the symptoms, but it is more likely that you have a systemic problem that could bring your plans to add new features to a halt. I hope I have convinced you that you might have a serious problem on your hands, possibly even an existential threat to your product, and the first thing to do is to find out what is happening. If this does reveal concurrency problems, I strongly advise you to fix them first, before you even ask the question of whether you should do more general refactoring, and before you try to add more features.

I am dealing with a pretty big codebase and I was given a few months to refactor existing code. The refactor process is needed because soon we will need to add many new features to our product [...]

During refactoring, from time to time I encounter the class, method or lines of code which have comments like ["don't touch this!"]

Yes, you should refactor those parts especially. Those warnings were placed there by the previous authors to mean "don't idly tamper with this stuff, it's very intricate and there are a lot of unexpected interactions". As your company plans to further develop the software and add a lot of features, they've specifically tasked you with cleaning up this stuff. So you're not idly tampering with it, you're deliberately charged with the task of cleaning it up.

Find out what those tricky modules are doing and break it down into smaller problems (what the original authors should have done). To get maintainable code out of a mess, the good parts need to be refactored and the bad parts need to be rewritten.

This question is another variation on the debate of when/how to refactor and/or cleanup code with a dash of "how to inherit code". We all have different experiences and work in different organizations with different teams and cultures, so there is not any right or wrong answer except "do what you think you need to do, and do it in a way that doesn't get you fired".

I don't think there are many organizations that would take kindly to having a business process fall on its side because the supporting application needed code cleanup or refactoring.

In this specific case, the code comments are raising the flag that changing these sections of code should not be done. So if you proceed, and the business does fall on its side, not only do you not have anything to show to support your actions, there is actually an artifact that works against your decision.

So as is always the case, you should proceed carefully and make changes only after you understand every aspect of what you are about to change, and find ways to test the heck out of it paying very close attention to capacity and performance and timing because of the comments in the code.

But even still, your management needs to understand the risk inherent in what you are doing and agree that what you are doing has a business value that outweighs the risk and that you have done what can be done to mitigate that risk.

Now, let us all get back to our own TODO's and the things we know can be improved in our own code creations if only there were more time.

Yes, absolutely. These are clear indications that the person who wrote this code was unsatisfied with the code and likely poked at it until they got it to happen to work. It's possible they didn't understand what the real issues were or, worse, did understand them and were too lazy to fix them.

However, it's a warning that a lot of effort will be needed to fix them and that such fixes will have risks associated with them.

Ideally, you'll be able to figure out what the issue was and fix it properly. For example:

Time out set to give Module A some time to do stuff. If its not timed like this, it will break.

This strongly suggests that Module A doesn't properly indicate when it's ready to be used or when it has finished processing. Perhaps the person who wrote this didn't want to bother fixing Module A or couldn't for some reason. This looks like a disaster waiting to happen because it suggests a timing dependency being dealt with by luck rather than proper sequencing. If I saw this, I would very much want to fix it.

Do not change this. Trust me, you will break things.

This doesn't tell you much. It would depend on what the code was doing. It could mean that it has what appear to be obvious optimizations that, for one reason or another, will actually break the code. For example, a loop might happen to leave a variable at a particular value which another piece of code depends on. Or a variable might be tested in another thread and changing the order of variable updates might break that other code.

I know using setTimeout is not a good practice, but in this case I had to use it.

This looks like an easy one. You should be able to see what setTimeout is doing and perhaps find a better way to do it.

That said, if these kinds of fixes are outside the scope of your refactor, these are indications that trying to refactor inside this code may significantly increase the scope of your effort.

At a minimum, look closely at the code affected and see if you can at least improve the comment to the point that it more clearly explains what the issue is. That might save the next person from facing the same mystery you face.

The author of the comments most likely did not fully understand the code themselves. It they actually knew what they were doing, they would have written actually helpful comments (or not introduced the racing conditions in the first place). A comment like "Trust me, you will break things." to me indicates the author tries to change something which caused unexpected errors they didn't fully understand.

The code is probably developed through guessing and trial-and-error without a full understanding of what actually happens.

This means:

It is risky to change the code. It will take time to understand, obviously, and it probably does not follow good design principles and may have obscure effects and dependencies. It is most likely insufficiently tested, and if nobody fully understands what the code does then it will be difficult to write tests to ensure no errors or changes in behavior is introduced. Racing conditions (like the comments allude to) are especially onerous - this is one place where unit tests will not save you.
It is risky not to change the code. The code has a high likelihood of containing obscure bugs and racing conditions. If a critical bug is discovered in the code, or a high-priority business requirement change forces you to change this code on short notice, you are in deep trouble. Now you have all the problems outlined in 1, but under time pressure! Furthermore, such "dark areas" in code have a tendency to spread and infect other parts of code it touches.

A further complication: Unit tests will not save you. Usually the recommended approach to such fix legacy code is to add unit or integration tests first, and then isolate and refactor. But racing conditions cannot be captured by automated testing. The only solution is to sit down and think through the code until you understand it, and then rewrite to avoid the racing conditions.

This mean the task is much more demanding than just routine refactoring. You will have to schedule it as a real development task.

You may be able to encapsulate the affected code as part of regular refactoring, so at least the dangerous code is isolated.

The question I would ask is why someone wrote, DO NOT EDIT in the first place.

I have worked on a lot of code and some of it is ugly and took a lot of time and effort to get working, within the given constraints at the time.

I bet in this case, this has happened and the person who wrote the comment found that someone changed it and then had to repeat the whole exercise and put it back the way it way, in order to get the thing to work. After this, out of shear frustration, they wrote...DO NOT EDIT.

In other words, I don't want to have to fix this again as I have better things to do with my life.

By saying DO NOT EDIT, is a way of saying that we know everything that we are ever going to know right now and therefore in the future we will never learn anything new.

There should be at least a comment about the reasons for not editing. Like saying "Do Not Touch" or "Do Not Enter". Why not touch the fence? Why not enter? Wouldn't it be better to say, "Electric Fence, Do Not Touch!" or "Land Mines! Do Not Enter!". Then it obvious why, and yet you still can enter, but at least know the consequences before you do.

I also bet the system has no tests around this magic code and therefore no one can confirm that the code will work correctly after any change. Placing characterisation tests around the problem code is always a first step. See "Working with Legacy Code" by Michael Feathers for tips on how to break dependencies and get the code under test, before tackling changing the code.

I think in the end, it is shorted sighted to place restrictions on refactoring and letting the product evolve in a natural and organic way.

Licensed under: CC-BY-SA with attribution

Not affiliated with softwareengineering.stackexchange