Hi, SPARC-y
Sun is one of my favorite technology companies, releasing Free Software back before Open Source existed, and long before it was cool. Sun has given a huge amount to the community (OpenSolaris, OpenOffice, Java, NFS, etc.), and yet continually creates the impression that it couldn’t market itself out of a wet paper bag.
Back when it was the fashion, Sun started designing its own CPUs. The SPARC chips had some very nice features and a clean RISC design. Of the surviving RISC chips, they’re probably the most elegant. (Although I’ll admit to being a pathetic Alpha fanboy at heart.) One of the nicest things about SPARC is that it implements a feature from MIT RISC II: register windows. On most CPUs, you have a set of registers that are all available at one time. On RISC II, the registers were split into blocks. Every function call or return would move the window up or down, with some overlap. The overlapping registers were used for passing parameters and returning values, while the non-overlapping segment was used for local variables. Rather than having a linear block of registers, SPARC specifies a circle, so running out of windows just means some additional loads and stores, rather than a serious problem. The really nice thing about this architecture is that it allows function-return addresses to be stored in a register. Since the return address isn’t stored on the stack, it cannot be overwritten by buffer overflows on the stack, meaning that a stack overflow on SPARC gives data corruption, but rarely arbitrary code execution.
The SPARC specification allows from 3 to 32 register windows. Simpler designs used for embedded systems use three, whereas chips designed for larger workloads will use the whole set. Note the word specification. In 1989, the SPARC trademark was assigned to SPARC International, Inc., which is responsible for promoting the SPARC architecture. Unlike most other ISAs, SPARC is a fully open specification. The two latest versions, eight and nine, define 32-bit and 64-bit chips, respectively. Back in 1997, Jiri Gaisler developed an implementation of the SPARCv8 architecture for the European Space Agency, and released the VHDL code under the LGPL. This license allowed anyone who wanted a 32-bit SPARC CPU to create his or her own (as an ASIC or on an FPGA) with no license fees.
The close association between SPARC and Open Source was reinforced in 2006, when Sun released the HDL (Verilog this time, not VHDL) code for its flagship T1 CPU. Since the T1 is a relatively complicated processor, it’s not ideal for study. Simply RISC attempted to address this lack by creating the S1 design, using a single core from the T1 architecture and adding a Wishbone bridge. Wishbone is the standard intra-chip interconnect used by the OpenCores project, so the S1 can be used easily in designs incorporating other components from the project.
Of course, getting custom chips made is still beyond the capabilities of most hobbyists. All you need to compile Free Software is a compiler. "Compiling" an HDL program can have multiple meanings. Simulating the design is possible, but LEON only achieves about 10 MIPS simulated on a 1 GHz x86 chip. It’s appropriate for testing, but not for real use. An FPGA implementation can get reasonable performance, but is far less power-efficient than dedicated hardware—although some of the unused gates can be used for application-specific logic, closing this gap a lot.
Low-volume fab runs are starting to become affordable, if you’re willing to put up with a process technology that’s one or two generations behind the cutting-edge. Any run of less than about 100 tends to be prohibitively expensive; but when you start getting into the thousands or tens of thousands, the price comes down quickly. While not within the grasp of the individual tinkerer, it’s potentially feasible for a relatively large open hardware project to produce an annual release from a collaborative design. I suspect that Sun is hoping that the Open Source design process will help Sun to produce better chips, using an expensive process for its own servers and letting anyone else produce who is willing to make a similar investment.
Sun spent a few years in a "no man’s land" between UltraSPARC generations. Sun’s aging designs weren’t competitive, and the new generation was taking a long time to arrive, so Sun was forced to plug the gap in its product line with AMD Opteron chips. The Opteron (and now Xeon) systems are still sold, but the T1 started to take back some of Sun’s market.
The T1 is the embodiment of the "lots of simple cores" architecture discussed in part 1 of this series. Each core is quite simple; it doesn’t support out-of-order execution, or even floating-point operations. The chip has a single floating-point unit, shared between the eight cores. Floating-point operations thus are very expensive, since they require some extra locking and register copying. The T2, due out in 2007, will have a floating-point unit on each core—although most other chips have more than one, so this is still comparatively limited.
The other main feature of the T1 is that each core has four contexts. This helps to compensate for the lack of out-of-order execution; instructions are dispatched from each thread in turn, increasing the delay between instructions from a single thread so that pipeline stalls from dependencies (the problem out-of-order was intended to fix) are uncommon. When they exist, the thread can just be skipped the next time it would issue an instruction. The same mechanism helps guard against stalls caused by cache misses; the other threads get a bit more run time while the stalled thread waits for the data to be fetched from memory.
The T2 is expected to be an incremental upgrade to the T1, with a floating-point unit on each core and twice as many contexts. This family is fairly specialized; it performs very well for serving things like web applications, but isn’t a perfect general-purpose chip. The Rock, also due out in 2007, is intended to put SPARC back on the general-purpose map.
Like the T1, Rock can execute 32 threads at once. While the T1 has eight cores with four contexts, the Rock has sixteen cores with two contexts each. These are more general than the specialized T1 cores, with a floating-point/vector unit on each chip. The thing that marks the Rock cores as particularly unusual is the amount of cache each has.
While everyone else is adding 2MB or more cache (way more, in Itanium’s case) to every core, Sun is heading in the opposite direction. The Rock is due to debut with 2MB of level-2 cache for the entire chip, in four banks, shared between sixteen cores. Even the amount of level-1 cache is reduced—only 32KB of instruction and 32KB of data cache shared between each group of four cores.
The question is, have they gone completely mad? Most modern chips have a lot of cache, because cache misses cause a serious reduction in performance. Either this situation isn’t a problem for Rock, or it is a problem and performance will be atrocious. Multiple contexts get you a little bit of relief from cache misses, but only a bit. The performance of the Rock depends on a feature called Hardware Scout. The idea is that it will use spare execution units to scan the instruction stream—not actually executing the instructions—to ensure that they’re in the cache by the time the instruction actually is executed. If this approach works, it will be a very impressive piece of technology. My guess would be that it will have serious problems on code that has a lot of computed jumps, making it unsuitable for certain classes of scientific computing problems and many rendering algorithms. Other applications could be blisteringly fast. Cache accounts for the majority of the die size of most modern processors. Sun has broken with tradition to make the majority execution units, which will have an enormous payoff if they are kept fed with data.