The Dark Corners of x86
In 1981, IBM rushed a microcomputer to the market. IBM, at the time, had large mainframe and microcomputer divisions, and needed a minicomputer to complement these lines. The PC, unlike their other lines, was built with off-the-shelf components, including the 8088 CPU from Intel, a cut-down version of the 16-bit 8086 chip designed to work with 8-bit peripherals.
The PC became a huge commercial success, in part due to the fact that the off-the-shelf design made it easy for clone manufactures to produce compatible devices. Each successive generation has had the ability to run legacy operating systems right back to the original version of DOS, and the CPU architecture has gradually accreted features over the last two and a half decades.
A modern x86 chip feels like the designers all sat around the bong, saying "Hey, wouldn't it be cool if..." and giggling maniacally to themselves as they pictured the bemused expressions of people trying to work with their creations. Of course, this (probably) did not actually happen. Instead, features that looked like a good idea at the time were gradually added until we were left with the kludge known as the modern x86 architecture.
Floating Point-and-Laugh
Most people are familiar with the CISC versus RISC war, but before that there were two other competing design ideologies fighting for dominance—stack and register-based instruction sets. In a register-based instruction set, you have a bank of explicitly addressed registers into which values can be loaded, manipulated, and then stored back into memories. Stack machines, in contrast, have a set of one or more stacks onto which instructions and data are pushed or popped.
Consider a simple program to add two numbers from memory locations 1 and 2 and store the result in memory location 3. On a (hypothetical) simple register machine, the program to do this might look something like this:
LD r1, 1 LD r2, 2 ADD r3, r1, r2 ST 3, r3
The values are explicitly loaded into the first two registers. These are added together and the result stored in the third register, the contents of which is then stored into memory. On a stack machine, the sequence of operations would be something more like this:
PUSH 1 PUSH 2 ADD POP 3
Note in the stack machine, registers are not explicitly addressed; only one can be accessed at a time, namely that at the top of the stack. Stack machines are very easy to generate code for, but they have a number of problems. The most obvious one is that it is very difficult to extract instruction-level parallelism from a stack-based machine. When register machines started introducing pipelining and superscalar architectures, stack machines were quickly left behind.
All except one, that is. A little chip known as the 8087 was stack-based. This chip was found as a co-processor in some PCs for handling floating point operations. The 80287 and 80387 succeeded it for use with the 80286 and 80386, and with it the i486 a descendent was incorporated on-die.
The 8087 had eight registers ST0–ST7, the ST, of course, standing for "stack." In fact, 8087 isn't a pure stack-based implementation. Values are loaded by pushing them on to the stack, and stored by popping them off, but other operations can manipulate them. For example, the FADD instruction takes two operands, a source and a destination, adds them and stores the result in the destination. The source can be either memory or a floating point register, and the destination can be any floating point register. In a pure stack-based approach, FADD would take no operands and add ST0 and ST1, storing the result in ST1 and marking ST1 as the top of the stack. To make matters even more confusing, there is also an instruction that does exactly this (FADDP, for anyone who hasn't run away screaming yet).
One rarely used feature of the 8087 was the ability to load and store values in Binary Coded Decimal (BCD) form. BCD is popular in the financial sector, because it allows decimal values to be represented accurately. A floating point value gives you a value with a certain number of digits of precision in binary; however, some values that can be represented in a short decimal string are recurring values in binary. This leads to some variable (and difficult to spot) imprecision, which is bad when you are dealing with money. BCD gives you a fixed number of decimal digits, which is more likely to be what the financial regulators require. The trade-off is that BCD values take up more space than binary floating point values, because each byte stores two decimal digits, giving a range of 0–99, rather than 0–255. The x87 BCD operations allow developers to combine the precision of binary floating point arithmetic with the storage efficiency of BCD.
Could it be any more confusing? Actually, yes. The x86 instruction set seems to have taken a cue from quantum mechanics: The act of observing it changes it. These ST0–7 registers look exactly like general-purpose floating point registers from some positions and like a stack from others. With the last of the Pentium series, Intel added something new: MMX. Now, ST0–ST8 can also look like 64-bit MMX registers, used for integer-vector operations. This decision, like the rest of x86, looked sensible at the time. Because MMX and x87 registers were the same, existing operating systems could context-switch between applications without needing to store any additional state.