- CPU Wars, Part 3: Put Your Left ARM In
- Itanium (or, What Happens If We Add Another Few MBs of Cache?)
- The Obligatory x86 Segment
The Obligatory x86 Segment
Sadly, I can’t avoid talking about x86. We all use x86 chips, so a huge amount of money is made selling them. Every year, Intel and AMD ask, "How can we make our chips better?" and get the answer "Let’s try throwing an enormous heap of money at them and see what happens." Typically, this strategy leads to some quite uninspiring—although functional—designs.
Since the Pentium era, which was the first generation when AMD’s own designs were competitive, the two companies have taken it in turns to hold the performance lead. Until 2006, AMD’s K8 microarchitecture (the Athlon 64 and Opteron chips) held a good lead over Intel. Now, the situation is reversed.
Intel’s previous microarchitecture, NetBurst, was widely regarded as a complete disaster. It ran at high clock-speeds, with a very low number of instructions per clock, and generated huge amounts of heat. It was completely inappropriate for low-power applications such as laptops or high-density servers (coincidentally, the two fastest-growing markets for x86 chips).
For laptops, Intel dug out the Pentium III design and got its Israeli team to update that a bit. Since the Pentium III design was quite old, adding extra cache was easy. SSE2 (the then-current version of Intel’s vector extensions) was also added, and the branch predictor from NetBurst (one of the few things it did well) was incorporated. This chip, the Pentium-M, helped Intel hang onto much of its market while working on the real successor to NetBurst.
Intel has a long history of confusing product names, and seems in no hurry to abandon that plan. The last processor released using the Pentium-M microarchitecture was sold as the Core, while the first processor based on the Core microarchitecture was the Core 2.
The Core 2 was the chip that finally gave Intel the edge over AMD—for a while. It includes x86-64 support in the form of the EM64T extensions (similar to AMD64, but not quite identical). One of the most interesting features of the Core microarchitecture is the addition of µop fusion. All x86 chips since the Pentium have had a RISC-like core and have broken x86 instructions into simpler instructions (µops) that run on the real core. The Core 2 is capable of reassembling some sequences of µops into more complex instructions than can still be executed in a single clock cycle.
Both AMD and Intel are now shipping chips with hardware support for virtualization. This fact is particularly important for x86, which incorporates a small number of privileged instructions that don’t trap when executed in unprivileged modes. This setup requires hacks such as binary rewriting to virtualize x86, which incurs a speed penalty. Most other architectures don’t have this limitation.
On the roadmap for 2007 is SSE4. While updates to vector instruction sets usually aren’t that exciting, SSE4 might be, because it’s rumored to include scatter-gather instructions. Typically, vector units require their data to be stored contiguously in memory, and aligned on a vector-sized boundary. This approach is fine for code developed for the vector unit, and GCC has some extensions that make this very easy to do (see my article "Vector Programming with GCC"), but it’s difficult for existing code. Scatter-gather instructions would allow values to be loaded into a vector from nonadjacent values in memory, making it much easier for a compiler to make use of vector instructions.
One final parting shot from Intel is its intention to put x86 in the embedded space with a Pentium-M System on Chip (SoC). This will consume 22W at 1.2 GHz, which seems a little higher than comparable offerings from outside x86-land, such as PA Semi’s offering, which dissipates a maximum of 15W with two cores at 1.5 GHz. I suspect that this new product isn’t aimed at mobile devices as much as at things like the Apple TV. In this market, the slightly increased power usage is offset by the fact that it runs exactly the same software that the development computer uses.
AMD is also trying to push x86 into the embedded space, but with a much more conservative design. Its Geode chips are basically 486 designs, clocked up to the 200–300 MHz range. I have a Geode-based system that I use as a router, and it draws about 7W for the whole system.
One feature where AMD and Intel fans will always argue is the front-side interconnect. Intel uses a classical Front Side Bus (FSB) design, while AMD has moved to HyperTransport (HT). With current FSB speeds, both of these systems translate to "more bandwidth than you need" for the vast majority of users. Things get a bit more interesting when you start to scale up the number of processors.
The other advantage that AMD gains from having the memory controller on-die is related to virtualization. Each virtual machine needs to have its own virtual memory space, which the operating system then partitions for application. This extra layer of indirection can be done more efficiently by new AMD chips, since they can offload much of it to the on-die memory controller, while Intel chips must rely on support from external chips, and thus get less tight integration in their VM extensions.
An FSB design means that all of the RAM is the same distance from all of the CPUs. This presentation is quite nice, since most operating systems like to pretend that they have a flat address space. The problem comes from the fact that the bandwidth talking to the RAM is shared, so although it’s the same distance from all CPUs, that distance is further the more CPUs you add.
HyperTransport is a point-to-point interconnect developed by an industry consortium of practically everyone except Intel. One big advantage of HT is that it isn’t tied to a specific processor. In theory, you can plug any HT CPU into an HT motherboard. This year, some of the RISC vendors are expected to start releasing HT chips (Sun and IBM are both members of the HyperTransport Consortium), which should reduce development costs. The big bonus is likely to come from the ability to use HT to communicate with coprocessors. Since the acquisition of ATi, AMD now has a set of talented GPU designers. It seems logical to expect to see them producing HT-enabled GPUs in the next year. These systems will have a very low CPU-to-GPU communication latency, making them ideal for general-purpose GPU work.
Intel is also likely to begin moving their GPU designs on-die. It seems likely that we will end up with a situation similar to the SSE Vs 3DNow divide, in which both manufacturers include incompatible 3D-targeted instructions. Exactly what form these instructions will take is not yet certain.
Companies such as Xilinx are also producing HT-enabled devices, but in their case the devices are FPGAs. Having an FPGA on the HT bus means that an application (or, more typically, a kernel module) can load a coprocessor design, giving some very fast application-specific logic.
AMD chips currently integrate a memory controller on-die. This arrangement makes memory access for the local CPU very fast, but other chips have one or more extra hops to access remote memory. This solution is good when you have lots of CPUs, since the operating system scheduler can keep processes near their memory, but for small numbers of chips it can be a disadvantage when compared to a front-side bus design, as long as the FSB has enough bandwidth to keep all of the cores fed. Interestingly, HyperTransport is a message-passing interconnect, which is commonly used to provide the illusion of shared memory. Exposing it directly to the programmer as a high-speed message-passing mechanism could yield some very interesting results.
It remains to be seen how AMD will react to Intel’s new microarchitecture. The K9 and K10 chips, intended to replace the K8 microarchitecture used in the Opteron and Athlon 64, have both been canceled. Some of the features expected from them have been moved into the K8L designs. It seems AMD is heading in a direction opposite to that of the rest of the industry in some ways, adding large reorder buffers, while others are abandoning out-of-order execution in favor of more contexts. Quad- and oct-core seem inevitable for both AMD and Intel, although AMD’s short-term strategy is likely to rely more heavily on custom coprocessors connected via HT.
Even in x86 land, it appears, some very exciting innovation is going on. I expect 2007 to be a very interesting year for new CPUs.