- CPU Wars, Part 3: Put Your Left ARM In
- Itanium (or, What Happens If We Add Another Few MBs of Cache?)
- The Obligatory x86 Segment
Itanium (or, What Happens If We Add Another Few MBs of Cache?)
In spite of the fact that it’s largely responsible for the death of the Alpha, I quite like the Itanium. The design takes RISC to its logical conclusion, by making the compiler do pretty much everything it possibly can, and leaving the CPU just to do processing.
Intel has always had a love/hate relationship with the x86 instruction set. Intel acts like parents whose dropout hippie son has just become a billionaire; pleased with the success, but embarrassed that it had to be that particular son who achieved it. There have been a few tries to replace it. The first attempt that I remember was the i860. It used a Very Long Instruction Word (VLIW) design (the predecessor to Itanium’s EPIC) and had an enormous (for the time) maximum throughput of around 66 MFLOPS. Unfortunately, most compilers could only produce code that achieved 10 MFLOPS, making the chip slower than the i486 in real-world use. For some workloads it was still fast, however, and so it found its way into a few graphics coprocessors.
Itanium is Intel’s second major attempt to ditch x86, and so far has met with very limited success. The first version sold only a few thousand units, and was reclassified as a "research project" by Intel. Itanium 2 addresses some of those shortcomings, although much of its performance comes from the fact that it has an insane amount of cache. Like the i860, it requires a lot of clever work from the compiler to come close to maximum throughput. This strategy makes it popular in scientific computing environments, where floating-point performance is more important than ease of development.
The latest Itanium design is the Montecito Itanium 2. As you might expect, it’s a dual-core design, with two contexts per core. The fact that it has only two contexts is somewhat surprising. I would expect the Itanium architecture to benefit hugely from more contexts. Itanium instructions are loaded in groups of three and sorted into blocks that can be executed in parallel. Having two buckets of instructions from which instructions can be issued should make it much easier to keep the execution units fed. One advantage of EPIC is that doubling the number of execution units will (for some workloads) double the throughput. Adding more contexts means that these extra units can be used by other threads most of the time, or by one thread that can use them all, giving a good balance between overall throughput and single-thread performance. This goal is impossible with current Itanium 2 processors, however, since they employ coarse-grained multithreading, and only switch between the two contexts on a cache miss.
The latest Itanium processors include a similar set of virtualization instructions to those of x86 chips. These are called VT-i for Itanium and VT-x for x86. This distinction is likely to become more important in the next few years, as virtualization becomes a standard method of deployment.
As mentioned earlier, Itanium chips have enormous caches. The latest versions have up to 24MB of on-die level 3 cache, plus 1MB of instruction and 256KB of data cache at level 2. Of the 1.72 billion transistors on a top-of-the-line Itanium, 1.67 billion are cache. Note the large size of the instruction caches; this number comes from the fact that Itanium machine code tends to be much larger than that of other architectures. This has knock-on effects on the amount of external bandwidth required.
What can we expect from the Itanium family in the rest of 2007? More of the same. The design is unlikely to change much, although with a faster clock speed and front-side buses. As with the i860, Itanium performance is highly dependent on the compiler, so improvements to the compiler are more likely to give a performance boost than improvements to the hardware, at this point.