Unravelling Assembly Language Spaghetti Code

https://stackoverflow.com/questions/983574

13-09-2019
|

Question

I've inherited a 10K-line program written in 8051 assembly language that requires some changes. Unfortunately it's written in the finest traditions of spaghetti code. The program--written as a single file--is a maze of CALL and LJMP statements (about 1200 total), with subroutines having multiple entry and/or exit points, if they can be identified as subroutines at all. All variables are global. There are comments; some are correct. There are no existing tests, and no budget for refactoring.

A little background on the application: The code controls a communications hub in a vending application that is currently deployed internationally. It handles two serial streams simultaneously (with the help of a separate communications processor) and can be talking to up to four different physical devices, each from a different vendor. The manufacturer of one of the devices recently made a change ("Yeah, we made a change, but the software's absolutely the same!") which causes some system configurations to no longer work, and is not interested in unchanging it (whatever it was they didn't change).

The program was originally written by another company, transferred to my client, then modified nine years ago by another consultant. Neither the original company, nor the consultant, are available as resources.

Based on analysis of the traffic on one of the serial buses, I've come up with a hack, which appears to work, but it's ugly and doesn't address the root cause. If I had a better understanding of the program, I believe I could address the actual problem. I have about one more week before the code's frozen to support an end-of-the month ship date.

Original question: I need to understand the program well enough to make the changes without breakage. Has anyone developed techniques for working with this sort of mess?

I see some great suggestions here, but am limited by time. However I may have another opportunity in the future to pursue some of the more involved courses of action.

Solution

First, I would try to get in touch with those people who originally developed the code or who at least maintained it before me, hopefully getting enough information to get a basic understanding of the code in general, so that you can start adding useful comments to it.

Maybe you can even get someone to describe the most important APIs (including their signature, return values and purpose) for the code. If global state is modified by a function, this should also be made explicit. Similarly, start to differentiate between functions and procedures, as well as input/output registers.

You should make it very clear to your employer that this information is required, if they don't believe you, have them actually sit down with you in front of this code while you describe what you are supposed to do and how you have to do it (reverse engineering). Having an employer with a background in computing and programming will actually be helpful in that case!

If your employer doesn't have such a technical background, ask him to bring another programmer/colleague to explain your steps to him, doing so will actually show him that you are serious and honest about it, because it's a real issue - not just from your point of view (make sure to have colleagues who know about this 'project').

If available and feasible, I would also make it very clear, that contracting (or at the very least contacting) former developers/maintainers (if they are no longer working for your company, that is) to help document this code would be a pre-requisite to realistically improve the code within a short time span and to ensure that it can be more easily maintained in the future.

Emphasize that this whole situation is due to shortcomings in the previous software development process and that these steps will help improve the code base. So, the code base in its current form is a growing problem and whatever is done now to handle this problem is an investment for the future.

This in itself is also important to help them assess and understand your situation: To do what you are supposed to do now is far from trivial, and they should know about it - if only to set their expectations straight (e.g. regarding deadlines and complexity of the task).

Also, personally I would start adding unit tests for those parts that I understand well enough, so that I can slowly start refactoring/rewriting some code.

In other words, good documentation and source code comments are one thing, but having a comprehensive test suite is another important thing, noone can be realistically expected to modify an unfamiliar code base without any established way of testing key functionality.

Given that the code is 10K, I would also look into factoring out subroutines into separate files to make components more identifiable, preferably using access wrappers instead of global variables and also intuitive file names.

Besides, I would look into steps to further improve the readability of the source code by decreasing the complexity, having sub routines with multiple entry points (and possibly even different parameter signatures?) looks like a sure way to obfuscate the code unnecessarily.

Similarly, huge sub routines could also be refactored into smaller ones to help improve readability.

So, one of the very first things, I'd look into doing would be to determine those things that make it really complicated to grok the code base and then rework those parts, for example by splitting huge sub routines with multiple entry points into distinct sub routines that call each other instead. If this cannot be done due to performance reasons or call overhead, use macros instead.

In addition, if it is a viable option, I would consider incrementally rewriting portions of the code using a more high level language, either by using a subset of C, or at least by making fairly excessive use of assembly macros to help standardize the code base, but also to help localize potential bugs.

If an incremental rewrite in C is a feasible option, one possible way to get started would be to turn all obvious functions into C functions whose bodies are -in the beginning- copied/pasted from the assembly file, so that you end up with C functions with lots of inline assembly.

Personally, I would also try running the code in a simulator/emulator to easily step through the code and hopefully start understanding the most important building blocks (while examining register and stack usage), a good 8051 simulator with a built-in debugger should be made available to you if you really have to do this largely on your own.

This would also help you come up with the initialization sequence and main loop structure as well as a callgraph.

Maybe, you can even find a good open source 80851 simulator that can be easily modified to also provide a full callgraph automatically, just doing a quick search, I found gsim51, but there are obviously several other options, various proprietary ones as well.

If I were in your situation, I would even consider outsourcing the effort of modifying my tools to simplify working with this source code, i.e. many sourceforge projects accept donations and maybe you can talk your employer into sponsoring such a modification.

If not financially, maybe by you providing corresponding patches to it?

If you are already using a proprietary product, you might even be able to talk with the manufacturer of this software and detail your requirements and ask them if they are willing to improve this product that way or if they can at least expose an interface to allow customers to make such customizations (some form of internal API or maybe even simple glue scripts).

If they are not responsive, indicate that your employer has been thinking of using a different product for some time now and that you were the only one insisting on that particular product to be used ... ;-)

If the software expects certain I/O hardware and peripherals, you may even want to look into writing a corresponding hardware simulation loop to run the software in an emulator.

Ultimately, I know for a fact that I would personally much more enjoy the process of customizing other software to help me understand such a spaghetti code monster, than manually stepping through the code and playing emulator myself, no matter how many gallons of coffee I can get.

Getting a usable callgraph out of an open source 8051 emulator should not take much longer than say a weekend (at most), because it mostly means to look for CALL opcodes and record their addresses (position and target), so that everything's dumped to a file for later inspection.

Having access to an emulator's internals would actually also be great a way to further inspect the code, for example in order to find recurring patterns of opcodes (say 20-50+), that may be factored into standalone functions/procedures, this might actually help decrease the size and complexity of the code base even further.

The next step would probably be to examine stack and register usage. And to determine the type/size of function parameters used, as well as their value range - so that you can conceive corresponding unit tests.

Using tools like dot/graphviz to visualize the structure of the initialization sequence and the main loop itself, will be a pure joy compared to doing all this stuff manually.

Also, you'll actually end up with useful data and documents that can serve as the foundation for better documentation in the long run.

OTHER TIPS

I'm afraid there is no magic bullet to this kind of problem. I find the only solution is to print out the ASM file then to go somewhere quiet and to simulate running the program line by line in your mind (while writing the contents of the registers and memory locations on a notepad). After a while you find this doesn't take as long as you would expect. Be prepared to spend many hours doing this and drink gallons of coffee. After a while you will have a understanding of what it is doing and you can consider changes.

Does the 8051 have any unused IO ports ? If it does and you can't work out when certain routines are being called then add code to send these spare ports high or low. Then when the program is running watch these ports with an oscilloscope.

Good luck

I know this sounds crazy....but I am unemployed (I picked wrong time to tell the marjority partner to go to hell) and have some free time. I'd be willing to take a look at it. I used to write assembly for the apple ][ and the original PC. If I could play around with your code on the simulator for a couple hours I could give you an idea if I have a chance of documenting it for you (without runing my unplanned vacation). Since I know nothing about 8051 this might not be possible for someone like me, but the simulator looked promising. I wouldn't want any money to do this. Its enough just to get exposure to 8051 embedded development. I told you this would sound crazy.

Find another job- seriously! Failing that the book "working effectively with legacy code" might help- though I think it is referring to legacy code as code without unit tests.

I have done this sort of thing a couple of times. Some recommendations:

Start by reviewing the schematic, this should help you understand what ports and pins your desired changes impact.
Use grep to find all calls, branches, jumps and returns. This can help understand the flow and identify the chunks of code.
Look at the reset vector and interrupt table to identify the main lines.
Use grep to create a cross reference for all code labels and data references (if your assembler tools cannot do this for you).

Keep in mind Hofstadter's Law: It always takes longer than you expect, even when you take into account Hofstadter's Law.

Good luck.

How well do you understand the hardware platform this code is running on?

Is it been put into power down mode (Pcon=2) to save power If so how is it been woken up. (a reset or on hardware interrupt)
Do you have to wait a for the oscillator to stables after a power up before doing serial communications
Is it been put into sleep mode (Pcon=1)

Are there different versions of the hardware out in the field?

Make sure you have all the different hardware variations to test on.

Do not waste your time with a simulator – it is very hard to work with and you have to make a lot of assumptions about the hardware. Get yourself an In Circuit Emulator (ICE) and run on the hardware.

The software was written in assembler for a reason you need to find out why. i.e. - memory constraints - speed constraints

There may be a reason that this code is a mess

Have a look at the link file for:

XDATA SPACE, IDATA SPACE and CODE SPACE:

If there is no free code space or Xdata or Idata?

The original author may have Optimizationed it to fit into the memory space available.

If that is the case you need to talk to the original developer to find out what he did.

You don't need a special budget for refactoring and testing -- they save you money and let you work faster -- get to it. It's the technique you should use to add changes to legacy, inherited code because it's the cheapest way to do it without "without breakage".

Most of the time, I think there's a trade-off where you get more quality in exchange for spending more time, but with legacy code that you are unfamiliar with, I think it's faster to make tests -- you have to run the code before you ship it, right?

This is one of the few times i'm going to recommend you put your soft skills to work, and present your PM/Manager/CXO with your reasoning behind a re-write, and the time/cost savings involved with such an undertaking

Cut it into pieces.

I had some very similar problem with a 8052 software. So the company inherited such a beast, code ROM full (64Kbytes), about 1,5 megs of assembly spaghetti modules plus two 3000 lines PL/M modules composed this coding monstrosity. The original developers of the software were long dead (this does not mean there were nobody, but indeed nobody who would understand it as a whole), the compilers compiling these were from the middle 80s running on an MDS-70 emulator, and several critical modules were at the limits of these compilers. Like add one more global symbol, and the linker would crash. Add one more symbol to an ASM file, and the compiler would crash.

So how one could start cutting this up?

First you will need tools. Notepad++ for example is a very nice thing since it can be used to cross search along several files at once, ideal to find which modules refer a global symbol. This is probably the most crucial element.

If possible, get any papers you can find on the software. The most immediate problem to solve with these beasts is to understand how they are roughly composed, what is their architecture. This is usually not included in the software itself, not even if it is otherwise properly commented.

To get the architecture yourself, first you may try to build a call graph. It is simpler to do than a data flow graph since usually there are less cross-file calls & jumps than global variables. For this call graphs only consider global symbols assuming the source files are supposed to be modules (which is not necessarily true, but usually they should be).

To do this, use your tool for cross file search, create a large list (for example in OpenOffice Calc) where you collect which symbol is defined in which file, and which files refer to this symbol calling it.

Then steal some large (!) sheets from the plotter, and start sketching. If you are very proficient in some graph software, you may use it, but unless it's so, it is more likely to hold you back. So sketch up a call graph showing which file has calls to which other files (not showing the symbols themselves, with 50 or so files, you wouldn't be able to manage it).

Most likely the result of this will be a spaghetti. The goal is to straighten this out to get it a hierarchical tree with a root (which will be the file containing the program entry point) without loops. You may devour several sheets during this process iteratively straightening the beast out. You may also find certain files are so much inter-tangled that they can not be represented without loops. This case it is most likely that a single "module" got somehow separated in two files, or more conceptual modules were tangled up. Go back to your call list, and group the symbols so to cut up the problematic files in smaller independent units (you will need to check the file itself too for local jumps here to see your assumed cut is possible).

To the end unless you are already working somewhere else for your own good, you will get a hierarchical call graph with conceptual modules. From this it is possible to deduct the software's intentional architecture and work further.

The next goal is the architecture. By your previously made map you will need to navigate along the software, figure out it's threads (interrupt and main program tasks), and the rough purposes of each of the modules / source files. How you can do this and what you get here depends more on the application domain.

When these two are done, the "rest" is rather straightforward. By these you should essentially know what each part of the thing is supposed to do, and so you know what you are likely dealing with when you start working on a source file. It is important though that whenever you find something "fishy" in a source, that the program seems to do something irrelevant, to go back to your architecture and call graph, and make corrections if necessary.

To the rest the methods others mentioned apply well. I just outlined these to give some insight on what can be done in really hideous cases. I wish I had just 10K lines of code to deal with back then...

I'd say IanW's answer (just print it out and keep tracing) is probably the best. That said, I have a slightly off the wall idea:

Try running the code (probably the binary) through a dissembler that can reconstruct C code (if you can find one for the 8051). Maybe it will identify a few routines you can't (easily).

Maybe it'll help.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow