Why didn't Intel made the high order part of their CPUs' registers available?

https://stackoverflow.com/questions/5317928

24-10-2019
|

Question

When programming in assembly and doing some sort of string manipulation I use al, ah and sometimes others to hold characters because this allows me to keep more data in my registers. I think this is a very handy feature, but Intel's engineers seem don't agree with me, because they didn't make the two high order bytes of the registers accessible (or am I wrong?). I don't understand why. I thought about this for a while and my guesses are:

They would make the CPU too complicated
They would be useless
perhaps both of the above

I came up with number two because I've never seen a compiled program (say with gcc) use al or bh or any of them.

Solution

Although it's a little clumsy, you can just swap the halves of a register with rol reg,16 (or ror reg,16, if you prefer). On the Netbust CPUs (Pentium IV) that's quite inefficient, but on most newer (or older) CPUs you normally have a barrel shifter to do that in one clock.

As for why they didn't do it, it's pretty simple: they'd need to thoroughly redesign the instruction encoding if they really wanted to do it. In the original design, they used up all the codes that would fit in the sizes of fields they used to specify a register. In fact, they already use something of a hack where the meaning of an encoding depends on the mode, and there are address size and operand size prefixes if you need to use a different size. For example, to use AX when you're running in 32-bit mode, the instruction will have an operand override prefix before the instruction itself. If they'd really wanted to badly enough, they could have extended that concept to specify things like "the byte in bits 16-23 of register X", but it'd make decoding more complex, and decoding x86 instructions is already relatively painful.

OTHER TIPS

Short answer is because of how it evolved from 16 bits.

Why is there not a register that contains the higher bytes of EAX?

Beyond the instruction encoding issue that Jerry correctly mentions, there are other things at work here as well.

Most non-trivial CPUs are pipelined: this means that in ordinary operation, instructions begin executing before previous instructions have finished execution. This means that the processor must detect any dependencies of an instruction on earlier instructions and prevent the instruction from executing until the data (or condition flags) on which it depends are available[1].

Having names for different parts of a register complicates this dependency tracking. If I write:

mov  ax,  dx
add  eax, ecx

then the core needs to know that ax is part of eax, and that the add should wait until the result of the move is available. This is called a partial register update; although it seems very simple, hardware designers generally dislike them, and try to avoid needing to track them as much as possible (especially in modern out-of-order processors).

Having names for the high halves of the registers adds an additional set of partial register names that must be tracked, which adds die area and power usage, but delivers little benefit. At the end of the day, this is how CPU design decisions are made: a tradeoff of die area (and power) vs. benefit.

Partial register updates aren't the only thing that would be complicated by having names for the high parts of the register, but it's one of the simplest to explain; there are many other small things that would need to become more complicated in a modern x86 CPU to support it; considered in aggregate, the additional complexity would be substantial.

[1] There are other ways to resolve dependencies, but we ignore them here for simplicity; they introduce similar problems.

To add to what Jerry and Stephen have said so far.

First thoughts are you have to try to be conservative with your opcodes/instruction encoding. Going in it started with ax, ah, and al. Is there a value added when going to eax to provide byte based access to that upper register (beyond the rotates or shifts that are already there to provide that)? Not really. If you are doing byte operations why are you using a 32 bit register and why using the upper bytes? Perhaps optimize the code differently taking advantage of what is available or tolerating what is available and taking advantage in other areas.

I think there is a reason that the majority of the world's instruction sets do not have this four names for the same register thing. And I dont think it is patents that are at play. In its day it was probably a cool feature or design. Probably had its roots in transitioning folks from 8 bit processors into this 8/16 bit thing. Anyway, I think al, ah, ax, eax was bad design and everyone learned from that. As Stephen mentioned you have hardware issues at play, if you were strictly to implement this in direct logic it is a mess, a rats nest of muxes to wire everything up (bad for speed and bad for power), then you get into the timing nightmare Stephen was taking about. But there is a history of microcoding for this instruction set so you are essentially emulating these instructions with some other processor and in the same way it adds to that nightmare. The wise thing to do would have been to re-define ax to be 32 bit and get rid of ah and al. Wise from a design perspective but unwise for portability (good for engineering, bad for marketing, sales, etc). I think the reason why that tired old instruction set is not limited to history books and museums is (among a few other reasons) because of reverse compatibility.

I highly recommend learning a number of other instruction sets, both new and old. msp430, ARM, thumb, mips, 6502, z80, PIC (the old one that isnt a mips), etc. Just to name a few. Seeing the differences and similarities between instruction sets is very educational IMO. And depending on how deep you go into the understanding (variable word length vs fixed length, etc) understanding what choices we available to intel when making this 16 to 32 bit and more recently 32 bit to 64 bit transition, while trying to retain market share.

I think the solution they chose at the time was the right choice, insert a formerly undefined opcode in front of what normally decodes as a 16 bit opcode turning it into a 32 bit opcode. Or sometimes not if there are no immediate values that follow (requiring the knowledge of how many to read). It seemed in line with the instruction set at the time. So it is back to Jerry's answer, the reason is a combination of the design of the 8/16 bit instruction set the history and reasons for expanding it. Granted they could have just as easily used similar encoding to provide access to the upper 16 bits in an ax,ah,al fashion, and they could have just as easily multiplied the four base registers A,B,C,D into 8 or 16 or 32 general purpose registers (A,B,C,D,E,F,G,H,...) while remaining reverse compatible.

In fact, traditional x86 opcodes allow both operand size selection (sometimes as specific instruction encoding, sometimes via prefix bytes) and register number selection bits. For register selection, there's always three bits in the instruction encoding. This allows for a total of eight registers.

Originally, there were four, AX/BX/BP/SP for 16bit and AL/AH/BL/BH for 8bit.

Adding two more gave CX/DX plus CL/CH/DL/DH. No more 8bit regs left, but still two unused values in the register selection for 16bit.

Which were provided in another rev of Intel's architecture by the index regs DI/SI.

That done, they had exhausted the 3 register selection bits (and made it impossible to provide 8bit regs for SI/DI/BP/SP).

The way AMD64 64bit mode managed to double the register set is therefore by using prefix bytes ("use the new regs"-prefix), similar to how traditional x86 code chose between 16 and 32bit operations. Same method was used to provide 8bit registers where there have been none "traditionally", i.e. for SP/BP/SI/DI.

To illustrate, see, for example, the following instruction encodings:

0:     00 c0                add    %al,%al
2:     00 c1                add    %al,%cl
4:     00 c2                add    %al,%dl
6:     00 c3                add    %al,%bl
8:     00 c4                add    %al,%ah
a:     00 c5                add    %al,%ch
c:     00 c6                add    %al,%dh
e:     00 c7                add    %al,%bh
10: 40 00 c4                add    %al,%spl
13: 40 00 c5                add    %al,%bpl
16: 40 00 c6                add    %al,%sil
19: 40 00 c7                add    %al,%dil

And, for [ 16bit / 64bit ] / 32bit, side-by side since it's so illustrative:

0   : [66/48] 01 c0     add   %?ax,%?ax
2/3 : [66/48] 01 c1     add   %?ax,%?cx
4/6 : [66/48] 01 c2     add   %?ax,%?dx
6/9 : [66/48] 01 c3     add   %?ax,%?bx
8/c : [66/48] 01 c4     add   %?ax,%?sp
a/f : [66/48] 01 c5     add   %?ax,%?bp
c/12: [66/48] 01 c6     add   %?ax,%?si
e/15: [66/48] 01 c7     add   %?ax,%?di

The prefix 0x66 marks a 16bit operation, and 0x48 is one of the prefix bytes for a 64bit op (it'd be a different one if your target and/or source were one of the "new" high-numbered registers).

To get back to your original question, how to access the high bits; well, newer CPUs have SSE instructions for the purpose; every 8/16/32/64bit field of the vector register is separately accessible via e.g. shuffle instructions, and in fact a lot of string manipulation code provided by Intel / AMD in their optimized libraries these days doesn't use the normal CPU registers anymore but the vector registers instead. If you need symmetry between upper / lower halves (or other fractions) of some larger value, use the vector registers.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow