- CPU Wars, Part 1: When the Chips Are Down
- Predicting the Future
- The EPIC Battle Between CISC and RISC
- Scalable Is (Still) the New Fast
- Do We Still Need x86?
- Whats Next?
The EPIC Battle Between CISC and RISC
Early CPUs were intended to be programmed in assembly languages. Their instruction sets looked quite like a high-level programming language. The most famous example of this design was the VAX, which included things like an "evaluate polynomial" instruction.
Later versions of these chips didn’t actually include all of these instructions in hardware; they had a core instruction set that included the basic features and implemented the rest using microcode—publicly visible instructions that were then translated into micro-operations (μops) for executing. This approach had the advantage that, since many of the instructions were implemented in software, it was cheap to release bug fixes. The original floppy disk was designed for distributing microcode updates.
As higher-level languages started becoming popular, more code was being generated by compilers than by humans. Compiler writers started discovering that complex instructions didn’t map directly to language constructs, so they used simpler ones. The more complex instructions were just wasting space on the die.
The idea of just exposing a simple instruction set to compiler writers was born. Reduced Instruction Set Computing (RISC) wasn’t just about a small instruction set—it was about orthogonality. Older CISC designs often had several ways of doing the same thing, but compiler writers would just use the most efficient one.
Early RISC designs had very few instructions. Most omitted even multiply and add instructions, since these operations could be implemented using a combination of adds and shifts. This turned out not to be such a great idea. The minimum amount of time in which an instruction can complete is one cycle, and chips with divide instructions eventually were able to complete them in fewer instructions than a chip that executed the shifts and adds, especially on floating-point values, where extra normalization steps are required and the mantissa and exponent must be handled separately.
Modern RISC CPUs include quite a lot of features, and RISC is almost a misleading name. Typically, they’re still load-store architectures (that is, operations involve loading values into registers, operating on them, and then storing the results back in memory), while CISC architectures have instructions for operating directly on values in memory (which are implemented by load-operate-store μops). The instruction sets are still more or less orthogonal; there’s only one sensible way of doing something. Of course, in theory you could implement any of the other instructions with just a load, a store, an add, and a conditional jump, but more complex instructions are added only where they’re less costly than compound instructions. An example is the vector unit found on many RISC chips, which executes an instruction on two to four values.
Intel’s x86 architecture is the last surviving CISC chip, and has a particularly baroque architecture, including things like string-comparison instructions. All x86 CPUs since the Pentium have contained a more RISC-y core and have translated these instructions into sequences of μops that are executed internally. Starting with the Core microarchitecture, Intel has done this in reverse, reassembling sequences of μops into sequences that can be executed with a single instruction.
The only real difference between a RISC and a CISC chip these days is the public instruction set; the internal instruction sets are likely to be similar. RISC and CISC are not the only possible alternatives, however. RISC came from a desire to simplify the core, and a group at Yale in the early 1980s worked out that you could take this design even further. A pipelined CPU has to do a lot of work to determine which instructions can be executed concurrently. Imagine the following sequence of operations:
- A = B + C
- D = E + F
- G = A + D
The first two of these operations are completely independent, but the third can’t be executed until both of the others are finished. The CPU has to devote some effort to detecting this situation, and making sure that it waits. A CPU that supports out-of-order execution might try inserting some later instructions between 2 and 3.
A Very Long Instruction Word (VLIW) processor doesn’t do this at all. Instead, instructions are grouped into blocks that can be executed independently. This technique required the compiler to do a lot more work, though. It had to detect these dependencies at compile time, but it also had to know what execution units the CPU had. For example, there was no point in issuing a block of four integer operations to a CPU that had two integer units, a floating-point unit, and two load-store units. This arrangement meant that a lot of long instructions would have space in them (wasting a lot of instruction cache space), and the next-generation CPU that had more execution units couldn’t run existing code any faster.
Intel took the VLIW concept and tweaked it slightly to produce Explicitly Parallel Instruction Computing (EPIC). Instead of fixed-length blocks of parallel instructions, Intel added a flag to the instruction that allows the start and end of parallel blocks to be marked. If a long block of instructions can be executed in parallel, a future generation can execute it faster; if a block is very short, it doesn’t waste any space.