Question

Given some legacy module in a ten year old project, where you have seen none of the code and not used the application, how long will it take you to grok 10000 lines of code from that module?

The permutations of possible setbacks and or further-complicating conditions is endless, I know. Let the above problem be a fun thought exercise.

How should you deal with this type of situation if you are the project manager?

And as the coder, how much time is it reasonable to spend preening the code?


This question is not about finding a formula where the input is lines-of-code times type-of-problem and the result is hours spent doing whatever.

This question is about exploring feasible strategies for predicting how much time is worth putting into refactoring a code base and for setting limits to such efforts.


This may seem theoretical, but from my perspective, it is super-worth-it to rewrite deprecated garbage and put 10-15% of allotted time into sheer cosmetic changes (actually demand 10-15% extra for any task). The project lead on the other hand could not care less about how ugly the stuff is, but the said person could soon retire...

Was it helpful?

Solution

I'll express a possibly controversial opinion here.

And as the coder, how much time is it reasonable to spend preening the code?

Almost none if you ask me! I've seen companies hire whole new teams of developers to maintain legacy codebases in ways that went beyond treating the legacy codebase as more than a black box -- as something to swim through the implementation details... and after a decade the new team still wasn't anywhere near as fluent as the original authors of the code. Meanwhile as they tried to preen and poke and prod at the implementation details of the legacy code, the software become increasingly buggy in existing, old features, not in the new ones, as the developers were trying to modify old code they couldn't possibly comprehend as well as the original authors.

This is for non-trivial code, of course, and generally code which wasn't exactly engineered against any reasonable standards.

In my opinion such legacy code should be treated as an opaque library, to be called, not modified. Old code, provided it isn't working its way towards complete obsolescence, should increasingly work its way to becoming a black box -- a stable package meant to be used, not changed. Interfaces should be identified and extracted, tests for correctness should be created where possible, and implementations should not be touched let alone fully understood unless they absolutely require changes (ex: they're buggy)... at which point if they're buggy and frequently need updates to implementation details, I'd consider replacing that unreliable section of legacy code with something new (same interface but new implementation) as potentially being faster than hoping some new developer will be able to comprehend old code which the original author didn't even write correctly and test properly. Trying to comprehend code which doesn't even work properly written by someone else ages ago is often a very fruitless endeavor. It's like trying to figure out how a combustion engine works by reverse engineering a broken one whose design was always prone to cause the engine to randomly explode.

As for the ability to decipher code, I often see it as a function of the interface. A well-documented interface where each function causes no more than a single side effect will often yield an implementation that's proportionally easy to understand. Meanwhile a counter-intuitive interface which, itself, is difficult to understand and use will often have implementations that are proportionally difficult to understand. What something is supposed to do for all possible input cases is the very first thing to understand before understanding how it does it, and some interfaces are actually so complex with so many disparate side effects that it's difficult to even answer the "what" question without raising additional "what if?" questions for tricky edge cases. In that case, the implementation will typically be as hopeless to decipher as the interface.

Of course sometimes a well-documented, clear interface can still use a very complex algorithm for the implementation, but relatively speaking, that source code will still often be relatively easier to understand if the interface is easy to understand. The first thing I'd want to do before wading inside someone else's implementation details is to study the interface required to interact with their code and ideally construct a test to check my assumptions and to make sure the interface is properly fulfilling its documented requirements. If the interface is simple and I can easily comprehend it, often the code implementing the interface will be as well and, if not, because the interface is so clear and easy to understand, it'll take little time to come up with a new implementation which fulfills the identical requirements.

As for the ultimate question about time required, I'd first want to do all the things mentioned above. There's some time needed upfront which is variable but not too variable to even figure out how to make the next time estimate if the goal is to familiarize ourselves with a legacy codebase.

OTHER TIPS

You must distinguish two different questions here.

  1. How much time is necessary to understand and deal with the old code base? That is hard to predict and largely out of your control. You can only make estimates based on your own and others' previous experience. People get somewhat better at this the more experience they have with legacy code, but the estimate will always have a large margin of error to it.

  2. How much time is worthwhile to spend on this? This question has a rather clear-cut answer. You can only afford to spend time dealing with old code as long as it isn't more time than you would need to replace it with new code. To be sure, estimations for new software projects are also uncertain, but they are nowhere near as uncertain as estimations about large unknown old software. Therefore, you should establish your best estimates of the costs of rewriting vs. refurbishing and simply compare the costs.

Lets compare this with the original author of the program. The original author (there may be more than one, but lets assume one for the moment) likely spent hundreds if not thousands of hours on a well thought-out working program used by many of size 10,000 lines.

Ideally the purpose to peruse code is to learn from it and become somewhat more fluent with knowing the interworkings of said program. What is the maximum you could ever hope to achieve by doing so? Well, the light at the end of the tunnel is being as fluent as the original author itself. My idea is that the hard part of writing code is first understanding the code you're writing. The actual writing part is just a simple typing exercise, no more. So the time it takes you to be as fluent as the original author is the time it took the author to write it (minus the type it took to type it up, which is almost negligible).

So if the original author took 1000 hours to write it, you'll conceivably take 1000 hours to become as fluent in the program as that author. Obviously you won't want to dedicate that amount of time, and fortunately for you, the more you have to learn from the code, the more you learn. This means that if you dedicate half of those 1000 hours, you'll know far more than 50% of the program. You'll know nearly 99%. Similarly, if you dedicate 10% of that time (100 hours), you should know near 95%.

Ultimately it's a matter of deciding how well you need to know the program that determines how long you should dedicate towards learning it, but if the ultimate point is to become an expert, 1% is a good starting point. In the case of 1000 hours, this amounts to 10 hours or a little over a day's time in order to become at least 50% fluent.

Assuming there are no issues to resolve in the program and have the time to dedicate, the more you dedicate to learning how it works, the better off you'll be of course, but time is money afterall. It's my personal opinion that 1% is a reasonable compromise. If you're 50% fluent, this means you stand a 50% chance of knowing where to check or knowing where the problem lies when a problem arises.

Of course you can simply tackle every problem as a scientist studying a big black box, but I wouldn't recommend it to be truly effective. As they say, "weeks of coding can save you hours of planning."

Licensed under: CC-BY-SA with attribution
scroll top