Questions about AT&T x86 Syntax design

https://stackoverflow.com/questions/4193827

11-10-2019
|

Question

Can anyone explain to me why every constant in AT&T syntax has a '$' in front of it?
Why do all registers have a '%'?
Is this just another attempt to get me to do a lot of lame typing?
Also, am I the only one that finds: 16(%esp) really counterintuitive compared to [esp+16]?
I know it compiles to the same thing but why would anyone want to type a lot of '$' and '%'s without a need to? - Why did GNU choose this syntax as the default?
Another thing, why is every instruction in at&t syntax preceded by an: l? - I do know its for the operand sizes, however why not just let the assembler figure that out? (would I ever want to do a movl on operands that are not that size?)
Last thing: why are the mov arguments inverted?

Isn't it more logical that:

eax = 5
mov eax, 5

where as at&t is:

mov 5, eax
5 = a (? wait what ?)

Note: I'm not trying to troll. I just don't understand the design choices they made and I'm trying to get to know why they did what they did.

Solution

1, 2, 3 and 5: the notation is somewhat redundant, but I find it to be a good thing when developing in assembly. Redundancy helps reading. The point about "let the assembler figure it out" easily turns into "let the programmer who reads the code figure it out", and I do not like it when I am the one doing the reading. Programming is not a write-only task; even the programmer himself must read his own code, and the syntax redundancy helps quite a bit.

Another point is that the '%' and '$' mean that new registers can be added without breaking backward compatibility: no problem in adding, e.g., a register called xmm4, as it will be written out as %xmm4, which cannot be confused with a variable called xmm4 which would be written without a '%'.

As for the amount of typing: normally, when programming in assembly, the bottleneck is the brain, not the hand. If the '$' and '%' slow you down, then either you are thinking way faster than what is usually considered as doable for a human being, or, more probably, your task at hand is too mechanical and should not be done in assembly; it should be left to an automatic code generator, something colloquially known as a "C compiler".

The 'l' suffix was added to handle some situations where the assembler "cannot" figure it out. For instance, this code:

mov  [esp], 10

is ambiguous, because it does not tell whether you want to write a byte of value 10, or a 32-bit word with the same numerical value. The Intel syntax then calls for:

mov  byte ptr [esp], 10

which is quite ugly, when you think about it. The people at AT&T wanted to make something more rational, so they came up with:

movb   $10, (%esp)

and they preferred to be systematic, and have the 'b' (or 'l' or 'w') suffix everywhere. Note that the suffix is not always required. For instance, you can write:

mov   %al, (%ebx)

and let the GNU assembler "figure out" that since you are talking about '%al', the move is for a single byte. It really works ! Yet, I still find it better to specify the size (it really helps the reader, and the programmer himself is the first and foremost reader of his own code).

For the "inversion": it is the other way round. The Intel syntax mimics what occurs in C, in which values are computed on the right, then written to what is on the left. Thus, the writing goes right to left, in the "reverse" direction, considering that reading goes left-to-right. The AT&T syntax reverts to the "normal" direction. At least so they considered; since they were decided about using their own syntax anyway, they thought that they could use the operands in what they thought of as "the right ordering". This is mostly a convention, but not an illogical one. The C convention mimics mathematical notation, except that mathematics are about defining values ("let x be the value 5") and not about assigning values ("we write the value 5 into a slot called 'x'"). The AT&T choice makes sense. It is confusing only when you are converting C code to assembly, a task which should usually be left to a C compiler.

The last part of your question 5 is interesting, from an historical point of view. The GNU tools for x86 followed the AT&T syntax because at that time, they were trying to take hold in the Unix world ("GNU" means "GNU is Not Unix") and competing with the Unix tools; Unix was under control of AT&T. This is before the days of Linux or even Windows 3.0; PC were 16-bit systems. Unix used the AT&T syntax, hence GNU used AT&T syntax.

The good question is then: why did AT&T found it smart to invent their own syntax ? As described above, they had some reasons, which were not without merit. The cost of using your own syntax, of course, is that it limits interoperability. In those days, a C compiler or assembler made no real sense as a separate tool: in a Unix system, they were meant to be provided by the OS vendor. Also, Intel was not a big player in the Unix world; big systems mostly used VAX or Motorola 680x0 derivatives. Nobody had figured out that the MS-Dos PC would turn into, twenty years later, the dominant architecture in the desktop and server worlds.

OTHER TIPS

1-2, 5: They probably chose to prefix registers and such to make it easier to parse; you know directly at the first character what kind of token it is.

4: No.

6: Again, probably to make it easier for the parser to figure out what instruction to output.

7: Actually this makes more sense in a grammatical meaning, move what to where. Perhaps the mov instruction should be an ld instruction.

Don't get me wrong, I think AT&T syntax is horrible.

The GNU assembler's AT&T syntax traces its origins to the Unix assembler ¹, which itself took its input syntax mostly from the PDP-11 PAL-11 assembler (ca. 1970).

Can anyone explain to me why every constant in AT&T syntax has a '$' in front of it?

It allows to distinguish immediate constants from memory addresses. Intel syntax does it the other way around, with memory references as [foo].

Incidentally, MASM (the Microsoft Assembler) doesn't need a distinction at the syntax level, since it can tell whether the operand is a symbolic constant, or a label. Other assemblers for x86 actively avoid such guesses, since they can be confusing to readers, e.g: TASM in IDEAL mode (it warns on memory references not in brackets), nasm, fasm.

PAL-11 used # for the Immediate addressing mode, where the operand followed the instruction. A constant without # meant Relative addressing mode, where a relative address followed the instruction.

Unix as used the same syntax for addressing modes as DEC assemblers, with * instead of @, and $ instead of #, since @ and # were apparently inconvenient to type ².

Why do all registers have a '%'?

In PAL-11, registers were defined as R0=%0, R1=%1, ... with R6 also referred to as SP, and R7 also referred to as PC. The DEC MACRO-11 macro-assembler allowed referring to registers as %x, where x could be an arbitrary expression, e.g. %3+1 referred to %4.

Is this just another attempt to get me to do a lot of lame typing?

Nope.

Also, am I the only one that finds: 16(%esp) really counterintuitive compared to [esp+16]?

This comes from the PDP-11 Index addressing mode, where a memory address is formed by summing the contents of a register and an index word following the instruction.

I know it compiles to the same thing but why would anyone want to type a lot of '$' and '%'s without a need to? - Why did GNU choose this syntax as the default?

It came from the PDP-11.

Another thing, why is every instruction in at&t syntax preceded by an: l? - I do know its for the operand sizes, however why not just let the assembler figure that out? (would I ever want to do a movl on operands that are not that size?)

gas can usually figure it out. Other assemblers also need help in particular cases.

The PDP-11 would use b for byte instructions, e.g: CLR vs CLRB. Other suffixes appeared in VAX-11: l for long, w for word, f for float, d for double, q for quad-word, ...

Last thing: why are the mov arguments inverted?

Arguably, since the PDP-11 predates Intel microprocessors, it is the other way around.

According to gas info-page, through the BSD 4.2 assembler.
Unix Assembler Reference Manual §8.1 - Dennis M. Ritchie

The reason AT&T syntax inverts operand order compared to Intel is most likely because the PDP-11, on which Unix was originally developed, uses the same order of operands.

Intel and DEC simply chose opposite orders.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow