Question

I'm now thinking to develop a disassembler, but as I know that it's very hard to build a disassembler I want to know the best/easiest language to turn my dream into a reality, also, a tutorial suggestion is very nice too ;-)

Was it helpful?

Solution

I recommend taking a look at using Python to write a disassembler. It has characteristics and capabilities that can be very handy when writing a disassembler.

  • bit-manipulation: logical operators perform operations on all bits of integer value
  • functional programming: using 'map' upon results of bitmask operations can be handy
  • powerful file-reading operations: file I/O is so easy to do in Python
  • nice capabilities for reading structured binary files (like .EXE files)

Python has other qualities that make it very useful for a program of any size. An x86 assembler that supports the current microprocessors' instruction set as opposed to the original 8086 instruction set is going to be a large program.

Having a language that makes it easy to do bit-masking is very useful when writing an assembler.

  • object-oriented: makes code-reuse easier and programs more understandable, less redundant
  • modular: modules and even packages can be used to keep program chunks to manageable size
  • concise and readable: so not much typing or head-scratching
  • interactive: makes it easier to develop/test incrementally
  • built-in symbolic debugger: handy when automated tests do not cut it
  • modern QA support: unittest similar to JUnit, doctest supports functional tests by example
  • built-in help: so you do not have to go flipping a book or launching a browser
  • terrific documentation: reference and tutorial material in PDF and HTML file formats
  • good IDE support: Eclipse, NetBeans, Emacs, etc. all give excellent support for Python
  • good support for serving web pages: includes support for HTML/HTTP and great 3rd party web frameworks out there too
  • great documentation generation: use doc string convention to document modules, classes, and methods and a utility that comes with Python will dynamically generate hyperlinked HTML documentation and serve it up for you to browse from a TCP/IP port

Python gives you the opportunity to have fun with your program as you develop it. There is a pretty big community of Python programmers out there. They are not legion like Java programmers are and C++ programmers used to be but there are tons around.

Python is a popular programming language at Google, Yahoo, and other modern web companies due to its power and flexibility. The Jython python-in-java interpreter grants even more power to both languages as there is a high degree of synergy and decent level of compatibility between them. There is a Jython podcast you can listen to if you do not like to read.

Python was invented at the beginning of the 1990's, making it even older than Java. Having existed this long, with a strong, steady following, it has evolved into a very sturdy, capable language with many examples and a decent community of programmers who use it for work and pleasure.

If you get stuck, the Python community is usually very helpful with ideas for how you can take a stab at a problem you are having using one or two handy Python features.

OTHER TIPS

The New Jersey Machine-Code Toolkit is a toolkit and a language for creating assemblers and disassemblers. I believe it supports C, C++, and Modula-3. The basis of a toolkit is a language for describing instruction sets; a disassembler is then generated automatically using the -dis option. This toolkit has been fairly widely used, but the descriptions of the popular instruction sets don't cover recent revisions.

You may decide it is more fun or more instructive to roll your own, but if you're dealing with a complicated instruction set, you may be hard pressed to match the efficiency of the Toolkit. Not that this matters on today's hardware :-)

Any general purpose language with decent byte and string operations could do this. Use a language you already know well. Learning a new language and learning how to write a disassembler at the same time is probably just going to make it harder for yourself.

You could write it in Assembly. That will really stretch your brain.

Real Raw Code - There is no substitute

Doesn't matter really; I think IDA Pro has a plugin model. I think a few people have Disassemblers that support Python plugins, so you may try that. But I don't think you have an idea of how difficult this will be; good luck though

I'd imagine any modern language would work equally well for this purpose. Consider which libraries you would want to use. For example, there are libraries out there that allow you to deal with different kinds of binaries (one of these is BFD). Think about this and choose the programming language that suits you best.

Disassemblers, that is, programs that convert absolute binary back to assembly language, are actually quite easy to build, albeit VERY tedious.

I did a Z8002 disassembler in FORTRAN 77, back in early 1983. I did a small disassembler for something I don't talk about in C, in 1991.

You're probably better off doing this in vanilla C, since about all you are going to be doing is reading memory words (or a binary file) and printing lots and lots of canned text strings.

I recently wrote a disassembler in Python. It was for an embedded RISC architecture and Python worked well. I was learning Python as I went, so I ended up reworking almost every function and class I wrote at least once. I found it especially useful to subclass the long type and write member functions that gave me a 4-byte word (or double-word depending on who you ask) expressed in various forms, e.g. returning a list of bits, bytes, nibbles, or half-words for various operand manipulations.

It would work reasonably well in Haskell. You could use the the binary package and it would be efficient as well. ADTs are quite nice.

I would recommend against writing it in Python. Python is quite slow, and while Haskell is likely a few times slower than C, I imagine that Python would be many times slower than C.

Ultimately, binary formats are low-level enough that I doubt it matters. You could write it in C relatively easily. There's no need for parser combinators or parser generators.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top