Question

So basically what I am trying to do is distinguish data from memory address during my analysis task towards assembly code.

Here is an example I can hardly deal with.

Suppose we have a variable val declared in .data section.

0x08048054    01 00 00 00

and here is one line of assembly code by disassembly the ELF file.

mov    $0x08048054,  %eax

So probably this is an indirect reference of variable val, like this :

    mov    $0x8048054,%eax
    mov    %edx,0x4(%esp)
    mov    %eax,(%esp)
    call printf

then I will transform $0x8048054 into variable name val like this:

    mov    val,%eax
    mov    %edx,0x4(%esp)
    mov    %eax,(%esp)
    call printf

But there is another situation, 0x8048054 is just used as a number in one calculation:

    mov    $0x8048054,%eax
    add     0x8(%ebp), %eax

which is probably equal to (I know we can hardly see this in real code, but this is possibility)

    b = 0x8048054 + argc;

and in this situation, I should not re-write $0x8048054 into val

so what I am thinking is that if I can figure out the type of %eax register, I can probably distinguish these two situations.

  • as for the first situation, %eax's type is pointer
  • the second one, its type is integer

Am I in the right way?

Could anyone give me some help?

Thank you!

Was it helpful?

Solution

One view of "type" is the set of operations which apply to a value.

So, the way to understand the "type" of a value in a register (or a memory location(s)), is to determine what operations the program applies to it. Each operation applied to the register suggests a set of possible types the value may be, e.g., "type constraints".

If a register is used in an operation to determine an address, which in turns causes a memory fetch (the x86 LEA instruction "forms an address", but doesn't cause a memory fetch!), then it is some kind of pointer. What kind of memory fetch hints as to the type of the pointer; if it is a byte fetch, it might be a "pointer to char", if it is a fetch of a value to a floating point unit, it may be a "pointer to a double". So, the way in which the register is used establishes some type constraints (e.g, "may be type T").

If the register is added to another, or added-to, it may be a pointer (e.g., pointer arithmetic) or a number (integer or natural). If the register is mutiplied or divided, it probably isn't a pointer.

But these analyses are limited to what you can determine by direct inspection of the few instructions which use the value of the register (e.g., those instructions that can be "reached" by the specific register value).

However, many machine operations are only copying values, often through registers. What you really want to do is a data flow analysis of where the register value came from, and where it goes to. All operators on the value which flows into, is in, or flows out of, the register should be used to establish type constraints. A better characterization of the type is the intersection of the type constraints of the value that (data)flowed through the register. (You have to worry about whether an invisible coercion has occurred: a pointer to a string, can be "invisibly converted" into a pointer to its first character on many architectures, without any specific machine instructions).

So your type inference process needs to do dataflow analysis on the whole program (and since some of the data flow depends on the type of values, this may be iterative), estimate the intersection of the types of each value, and then consider whether implied conversions may take place. (you may do this inference process in your head, but if you have to do it on a big program you will really need tools to manage the sheer volume of data).

In general, you can't do this perfectly; one can easily turn type inference into a Turing-halting problem:

if Turing(x) then op1(register1) else op2(register1) endif

[so, is register always used only in op1 or only in op2?] So you have take your estimates of the type with a grain of salt.

OTHER TIPS

Looks like you're on the right track - in general, the difference between a pointer and a number that just happens to look like a memory address is that a pointer will be dereferenced somewhere. Obviously you can only observe this when it happens, so you're going to have to analyse the code for the lifetime of that value to see how it's used.

If a value ends up in a register that is then used as the base register for a memory operation, it was a pointer. Anything else is a number-that-looks-like-a-pointer until proven otherwise. There might be short-cuts like seeing it passed as an argument to a function that you know takes a pointer (if you can assume the code is correct in the first place).

The complication comes in the fact that that value may be loaded, added to another value, shoved on the stack, passed around, stashed in another variable, etc., and eventually reloaded and dereferenced by a completely different part of the program.

For more ideas, I'd suggest looking at what the OS program loader does, since that typically needs to detect and fix up pointers, particularly for relocatable code.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top