어셈블리 언어 스파게티 코드 풀기

https://stackoverflow.com/questions/983574

13-09-2019
|

문제

나는 약간의 변경이 필요한 8051 어셈블리 언어로 작성된 10K 라인 프로그램을 상속받았습니다.불행히도 그것은 스파게티 코드의 가장 훌륭한 전통으로 작성되었습니다.단일 파일로 작성된 이 프로그램은 CALL 및 LJMP 문(총 1200개 정도)으로 이루어진 미로와 같습니다. 서브루틴은 서브루틴으로 식별될 수 있는 경우 여러 개의 시작점 및/또는 종료점을 갖습니다.모든 변수는 전역 변수입니다.댓글이 있습니다.일부는 맞습니다.기존 테스트도 없고 리팩토링을 위한 예산도 없습니다.

응용 프로그램에 대한 약간의 배경 지식:코드는 현재 국제적으로 배포되는 판매 애플리케이션의 통신 허브를 제어합니다.이는 두 개의 직렬 스트림을 동시에 처리하며(별도의 통신 프로세서를 사용하여) 각각 다른 공급업체의 서로 다른 최대 4개의 물리적 장치와 통신할 수 있습니다.장치 중 하나의 제조업체가 최근 변경 사항을 적용했습니다("예, 변경했지만 소프트웨어는 완전히 동일합니다!"). 이로 인해 일부 시스템 구성이 더 이상 작동하지 않으며 변경을 취소하는 데 관심이 없습니다. 그들은 변하지 않았습니다).

이 프로그램은 원래 다른 회사에서 작성되어 내 고객에게 전송된 후 9년 전에 다른 컨설턴트에 의해 수정되었습니다.원래 회사나 컨설턴트 모두 리소스로 사용할 수 없습니다.

직렬 버스 중 하나의 트래픽 분석을 바탕으로 작동하는 것처럼 보이지만 보기 흉하고 근본 원인을 해결하지 못하는 해킹 방법을 생각해 냈습니다.프로그램에 대해 더 잘 이해한다면 실제 문제를 해결할 수 있다고 생각합니다.월말 배송 날짜를 지원하기 위해 코드가 동결되기까지 일주일 정도 더 시간이 있습니다.

원래 질문:깨지지 않고 변경하려면 프로그램을 충분히 이해해야 합니다.이런 종류의 혼란을 처리하는 기술을 개발한 사람이 있습니까?

여기에 몇 가지 훌륭한 제안이 있지만 시간이 제한되어 있습니다.그러나 앞으로 좀 더 관련된 행동 과정을 추구할 또 다른 기회가 있을 수도 있습니다.

해결책

First, I would try to get in touch with those people who originally developed the code or who at least maintained it before me, hopefully getting enough information to get a basic understanding of the code in general, so that you can start adding useful comments to it.

Maybe you can even get someone to describe the most important APIs (including their signature, return values and purpose) for the code. If global state is modified by a function, this should also be made explicit. Similarly, start to differentiate between functions and procedures, as well as input/output registers.

You should make it very clear to your employer that this information is required, if they don't believe you, have them actually sit down with you in front of this code while you describe what you are supposed to do and how you have to do it (reverse engineering). Having an employer with a background in computing and programming will actually be helpful in that case!

If your employer doesn't have such a technical background, ask him to bring another programmer/colleague to explain your steps to him, doing so will actually show him that you are serious and honest about it, because it's a real issue - not just from your point of view (make sure to have colleagues who know about this 'project').

If available and feasible, I would also make it very clear, that contracting (or at the very least contacting) former developers/maintainers (if they are no longer working for your company, that is) to help document this code would be a pre-requisite to realistically improve the code within a short time span and to ensure that it can be more easily maintained in the future.

Emphasize that this whole situation is due to shortcomings in the previous software development process and that these steps will help improve the code base. So, the code base in its current form is a growing problem and whatever is done now to handle this problem is an investment for the future.

This in itself is also important to help them assess and understand your situation: To do what you are supposed to do now is far from trivial, and they should know about it - if only to set their expectations straight (e.g. regarding deadlines and complexity of the task).

Also, personally I would start adding unit tests for those parts that I understand well enough, so that I can slowly start refactoring/rewriting some code.

In other words, good documentation and source code comments are one thing, but having a comprehensive test suite is another important thing, noone can be realistically expected to modify an unfamiliar code base without any established way of testing key functionality.

Given that the code is 10K, I would also look into factoring out subroutines into separate files to make components more identifiable, preferably using access wrappers instead of global variables and also intuitive file names.

Besides, I would look into steps to further improve the readability of the source code by decreasing the complexity, having sub routines with multiple entry points (and possibly even different parameter signatures?) looks like a sure way to obfuscate the code unnecessarily.

Similarly, huge sub routines could also be refactored into smaller ones to help improve readability.

So, one of the very first things, I'd look into doing would be to determine those things that make it really complicated to grok the code base and then rework those parts, for example by splitting huge sub routines with multiple entry points into distinct sub routines that call each other instead. If this cannot be done due to performance reasons or call overhead, use macros instead.

In addition, if it is a viable option, I would consider incrementally rewriting portions of the code using a more high level language, either by using a subset of C, or at least by making fairly excessive use of assembly macros to help standardize the code base, but also to help localize potential bugs.

If an incremental rewrite in C is a feasible option, one possible way to get started would be to turn all obvious functions into C functions whose bodies are -in the beginning- copied/pasted from the assembly file, so that you end up with C functions with lots of inline assembly.

Personally, I would also try running the code in a simulator/emulator to easily step through the code and hopefully start understanding the most important building blocks (while examining register and stack usage), a good 8051 simulator with a built-in debugger should be made available to you if you really have to do this largely on your own.

This would also help you come up with the initialization sequence and main loop structure as well as a callgraph.

Maybe, you can even find a good open source 80851 simulator that can be easily modified to also provide a full callgraph automatically, just doing a quick search, I found gsim51, but there are obviously several other options, various proprietary ones as well.

If I were in your situation, I would even consider outsourcing the effort of modifying my tools to simplify working with this source code, i.e. many sourceforge projects accept donations and maybe you can talk your employer into sponsoring such a modification.

If not financially, maybe by you providing corresponding patches to it?

If you are already using a proprietary product, you might even be able to talk with the manufacturer of this software and detail your requirements and ask them if they are willing to improve this product that way or if they can at least expose an interface to allow customers to make such customizations (some form of internal API or maybe even simple glue scripts).

If they are not responsive, indicate that your employer has been thinking of using a different product for some time now and that you were the only one insisting on that particular product to be used ... ;-)

If the software expects certain I/O hardware and peripherals, you may even want to look into writing a corresponding hardware simulation loop to run the software in an emulator.

Ultimately, I know for a fact that I would personally much more enjoy the process of customizing other software to help me understand such a spaghetti code monster, than manually stepping through the code and playing emulator myself, no matter how many gallons of coffee I can get.

Getting a usable callgraph out of an open source 8051 emulator should not take much longer than say a weekend (at most), because it mostly means to look for CALL opcodes and record their addresses (position and target), so that everything's dumped to a file for later inspection.

Having access to an emulator's internals would actually also be great a way to further inspect the code, for example in order to find recurring patterns of opcodes (say 20-50+), that may be factored into standalone functions/procedures, this might actually help decrease the size and complexity of the code base even further.

The next step would probably be to examine stack and register usage. And to determine the type/size of function parameters used, as well as their value range - so that you can conceive corresponding unit tests.

Using tools like dot/graphviz to visualize the structure of the initialization sequence and the main loop itself, will be a pure joy compared to doing all this stuff manually.

Also, you'll actually end up with useful data and documents that can serve as the foundation for better documentation in the long run.

다른 팁

이런 종류의 문제에 대한 마법의 총알이 없을까 봐 두렵습니다. 유일한 해결책은 ASM 파일을 인쇄 한 다음 조용한 곳으로 가서 마음 속의 라인별로 프로그램 실행을 시뮬레이션하는 것입니다 (메모장에 레지스터 및 메모리 위치의 내용을 작성하는 동안). 잠시 후에는 이것이 예상만큼 오래 걸리지 않습니다. 이 일을 많이하고 커피 한 갤런을 마실 준비를하십시오. 잠시 후에는 그것이 무엇을하고 있는지 이해하고 변화를 고려할 수 있습니다.

8051에는 사용되지 않은 IO 포트가 있습니까? 특정 루틴을 호출 할 때 운동 할 수없는 경우 코드를 추가하여 이러한 여분의 포트를 높거나 낮게 보냅니다. 그런 다음 프로그램이 실행될 때이 항구를 오실로스코프로 시청하십시오.

행운을 빕니다

나는 이것이 미쳤다는 것을 알고있다 .... 그러나 나는 실업자이다 (나는 marjority 파트너에게 지옥에 가라고 말하기 위해 잘못된 시간을 선택했다). 나는 그것을 기꺼이 살펴볼 것입니다. 나는 Apple에 대한 어셈블리를 쓰곤했고] [원래 PC. 시뮬레이터에서 몇 시간 동안 코드를 가지고 놀 수 있다면, 계획되지 않은 휴가를 운영하지 않고 당신을 위해 문서화 할 가능성이 있다면 아이디어를 줄 수 있습니다. 8051에 대해 알지 못하기 때문에 이것은 나와 같은 사람에게는 불가능할 수도 있지만 시뮬레이터는 유망한 것처럼 보였습니다. 나는 돈이 이것을하기를 원하지 않을 것입니다. 8051 임베디드 개발에 노출되는 것만으로도 충분합니다. 나는 이것이 미쳤다 고 말했다.

다른 직업을 찾으십시오- 진지하게! "레거시 코드로 효과적으로 작동"하는 책이 도움이 될 수는 있지만 유닛 테스트없이 레거시 코드를 코드라고 말하고 있다고 생각합니다.

나는 이런 종류의 일을 몇 번했다. 몇 가지 권장 사항 :

회로도를 검토하여 시작하면 원하는 변경 사항에 영향을 미치는 포트와 핀을 이해하는 데 도움이됩니다.
Grep을 사용하여 모든 통화, 분기, 점프 및 반품을 찾으십시오. 이것은 흐름을 이해하고 코드 덩어리를 식별하는 데 도움이 될 수 있습니다.
메인 라인을 식별하려면 재설정 벡터 및 인터럽트 테이블을보고 있습니다.
GREP를 사용하여 모든 코드 레이블 및 데이터 참조에 대한 크로스 참조를 만듭니다 (어셈블러 도구가 귀하를 위해이를 수행 할 수없는 경우).

Hofstadter의 법칙을 명심하십시오.Hofstadter의 법칙을 고려할 때에도 항상 예상보다 오래 걸립니다..

행운을 빕니다.

이 코드가 실행중인 하드웨어 플랫폼을 얼마나 잘 이해하십니까?

전원을 절약하기 위해 파워 다운 모드 (PCON = 2)에 넣었습니까? 그렇다면 어떻게 깨어 났습니까? (재설정 또는 하드웨어 인터럽트)
직렬 통신을 수행하기 전에 전원을 켜고 나면 발진기가 마구간에 대기 할 때까지 기다려야합니까?
수면 모드에 넣었습니까 (PCON = 1)

현장에 다른 버전의 하드웨어가 있습니까?

테스트 할 모든 다른 하드웨어 변형이 있는지 확인하십시오.

시뮬레이터로 시간을 낭비하지 마십시오. 작업하기가 매우 어렵고 하드웨어에 대해 많은 가정을해야합니다. 자신을 얻으십시오 회로 에뮬레이터 (ICE) 하드웨어에서 실행하십시오.

이 소프트웨어는 그 이유를 찾아야 할 이유로 어셈블러로 작성되었습니다. 즉, 메모리 제약 조건 - 속도 제약

이 코드가 엉망이라는 이유가있을 수 있습니다.

다음 링크 파일을 살펴보십시오.

XDATA 공간, IDATA 공간 및 코드 공간 :

무료 코드 공간이 없거나 XData 또는 Idata가없는 경우?

원래 저자는 사용 가능한 메모리 공간에 맞도록 최적화했을 수 있습니다.

이 경우입니다 그가 한 일을 찾으려면 원래 개발자와 대화해야합니다..

리팩토링 및 테스트를위한 특수 예산이 필요하지 않습니다. 비용을 절약하고 더 빨리 일할 수 있습니다. "파손없이"없이 가장 저렴한 방법이기 때문에 레거시, 상속 코드에 변경 사항을 추가하는 데 사용해야하는 기술입니다.

대부분의 경우, 더 많은 시간을 소비하는 대가로 더 많은 품질을 얻는 트레이드 오프가 있다고 생각하지만, 당신이 익숙하지 않은 레거시 코드를 사용하면 테스트를하는 것이 더 빠르다고 생각합니다. 이전에 코드를 실행해야합니다. 배송 해요?

이것은 몇 번 중 하나 중 하나입니다. 소프트 기술을 작동시키고 PM/Manager/CXO를 재 작성된 추론과 그러한 사업과 관련된 시간/비용 절감을 제시하는 것이 좋습니다.

조각으로 자릅니다.

8052 소프트웨어에서도 비슷한 문제가 있었습니다.그래서 이 회사는 코드 ROM 전체(64Kbytes), 약 1.5 메가바이트의 조립 스파게티 모듈과 2개의 3000 라인 PL/M 모듈로 구성된 이 엄청난 코딩 괴물을 물려받았습니다.소프트웨어의 원래 개발자는 오래전에 사망했습니다(아무도 없었다는 의미는 아니지만 실제로 그것을 전체적으로 이해하는 사람이 아무도 없었습니다). 이를 컴파일하는 컴파일러는 MDS-70 에뮬레이터에서 실행되는 80년대 중반의 것이었고 몇 가지 중요한 모듈은 이러한 컴파일러의 한계에 있었습니다.전역 심볼을 하나 더 추가하면 링커가 충돌합니다.ASM 파일에 기호를 하나 더 추가하면 컴파일러가 충돌합니다.

그렇다면 어떻게 이것을 자르기 시작할 수 있습니까?

먼저 도구가 필요합니다.예를 들어 Notepad++는 여러 파일을 동시에 교차 검색하는 데 사용할 수 있어 매우 유용하며, 전역 기호를 참조하는 모듈을 찾는 데 이상적입니다.이것이 아마도 가장 중요한 요소일 것이다.

가능하다면 소프트웨어에서 찾을 수 있는 서류를 구하십시오.이 짐승들과 함께 해결해야 할 가장 시급한 문제는 그들이 대략적으로 어떻게 구성되어 있는지, 그들의 아키텍처는 무엇인지 이해하는 것입니다.이는 일반적으로 소프트웨어 자체에 포함되지 않으며, 달리 적절하게 설명되어 있는 경우에도 마찬가지입니다.

아키텍처를 직접 얻으려면 먼저 다음을 시도할 수 있습니다. 호출 그래프 작성.일반적으로 전역 변수보다 파일 간 호출 및 점프가 적기 때문에 데이터 흐름 그래프보다 수행하기가 더 간단합니다.이 호출 그래프의 경우 소스 파일이 모듈이라고 가정하는 전역 기호만 고려합니다(반드시 사실은 아니지만 일반적으로 모듈이어야 함).

이렇게 하려면 파일 간 검색 도구를 사용하여 어떤 기호가 어떤 파일에 정의되어 있는지, 그리고 어떤 파일이 이 기호를 호출하는지 참조하는 큰 목록(예: OpenOffice Calc)을 만듭니다.

그런 다음 플로터에서 큰(!) 시트를 훔쳐 스케치를 시작합니다.일부 그래프 소프트웨어에 매우 능숙하다면 이를 사용할 수 있지만, 그렇지 않은 경우에는 사용을 방해할 가능성이 더 높습니다.따라서 다음을 보여주는 호출 그래프를 그려보세요. 파일 다른 파일에 대한 호출이 있습니다(기호 자체가 표시되지 않고 50개 정도의 파일이 있으면 관리할 수 없습니다).

아마도 이것의 결과는 스파게티가 될 것입니다.목표는 이를 정리하여 루프 없이 루트(프로그램 진입점을 포함하는 파일이 됨)가 있는 계층적 트리를 얻는 것입니다.이 과정에서 반복적으로 짐승을 곧게 펴는 동안 여러 장의 시트를 삼킬 수 있습니다.또한 특정 파일이 너무 많이 얽혀서 루프 없이는 표현할 수 없다는 것을 알 수 있습니다.이 경우 단일 "모듈"이 어떻게든 두 개의 파일로 분리되었거나 더 많은 개념적 모듈이 뒤엉켰을 가능성이 높습니다.통화 목록으로 돌아가서 기호를 그룹화하여 문제가 있는 파일을 더 작은 독립 단위로 잘라냅니다(여기서 로컬 점프에 대해 파일 자체도 확인하여 잘라낼 수 있는지 확인해야 합니다).

자신의 이익을 위해 이미 다른 곳에서 작업하고 있지 않는 한 결국에는 개념적 모듈이 포함된 계층적 호출 그래프를 얻게 됩니다.이를 통해 소프트웨어의 의도적인 아키텍처를 추론하고 추가 작업을 수행하는 것이 가능합니다.

다음 목표는 건축학.이전에 만든 맵을 사용하여 소프트웨어를 탐색하고 스레드(인터럽트 및 기본 프로그램 작업)와 각 모듈/소스 파일의 대략적인 목적을 파악해야 합니다.이를 수행하는 방법과 여기서 얻을 수 있는 내용은 애플리케이션 도메인에 따라 달라집니다.

이 두 가지 작업이 완료되면 "나머지"는 매우 간단해집니다.이를 통해 기본적으로 각 부분이 수행해야 하는 작업을 알아야 하며, 소스 파일 작업을 시작할 때 처리할 작업이 무엇인지 알 수 있습니다.그러나 소스에서 "비린내 나는" 것을 발견할 때마다 프로그램이 관련 없는 작업을 수행하는 것처럼 보이는 경우 아키텍처 및 호출 그래프로 돌아가서 필요한 경우 수정하는 것이 중요합니다.

나머지에는 다른 사람들이 언급한 방법이 잘 적용됩니다.저는 정말 끔찍한 경우에 무엇을 할 수 있는지에 대한 통찰력을 제공하기 위해 이러한 내용을 간략하게 설명했습니다.그 당시에는 처리할 코드가 10,000줄만 있었더라면 좋았을 텐데…

Ianw의 대답 (인쇄하고 계속 추적)이 아마도 최고 일 것입니다. 즉, 나는 벽 아이디어를 약간 떨어 뜨 렸습니다.

C 코드를 재구성 할 수있는 보급기를 통해 코드 (아마도 이진)를 실행해보십시오 (8051에 대해 찾을 수있는 경우). 어쩌면 (쉽게) 할 수없는 몇 가지 루틴을 식별 할 수 있습니다.

어쩌면 도움이 될 것입니다.

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 StackOverflow