- 1.1 Background of the Cell Processor
- 1.2 The Cell Architecture: An Overview
- 1.3 The Cell Broadband Engine Software Development Kit (SDK)
- 1.4 Conclusion
1.2 The Cell Architecture: An Overview
In Randall Hyde’s fine series of books, Write Great Code, one of his fundamental lessons is that, for optimal performance, you need to know how your code runs on the target processor. Nowhere is this truer than when programming the Cell. It isn’t enough to learn the C/C++ commands for the different cores; you need to understand how the elements communicate with memory and one another. This way, you’ll have a bubble-free instruction pipeline, an increased probability of cache hits, and an orderly, nonintersecting communication flow between processing elements. What more could anyone ask?
Figure 1.1 shows the primary building blocks of the Cell: the Memory Interface Controller (MIC), the PowerPC Processor Element (PPE), the eight Synergistic Processor Elements (SPEs), the Element Interconnect Bus (EIB), and the Input/Output Interface (IOIF). Each of these is explored in greater depth throughout this book, but for now, it’s a good idea to see how they function individually and interact as a whole.
Figure 1.1 The top-level anatomy of the Cell processor
The Memory Interface Controller (MIC)
The MIC connects the Cell’s system memory to the rest of the chip. It provides two channels to system memory, but because you can’t control its operation through code, the discussion of the MIC is limited to this brief treatment. However, you should know that, like the PlayStation 2’s Emotion Engine, the first-generation Cell supports connections only to Rambus memory.
This memory, called eXtreme Data Rate Dynamic Random Access Memory, or XDR DRAM, differs from conventional DRAM in that it makes eight data transfers per clock cycle rather than the usual two or four. This way, the memory can provide high data bandwidth without needing very high clock frequencies. The XDR interface can support different memory sizes, and the PlayStation 3 uses 256MB of XDR DRAM as its system memory.
The PowerPC Processor Element (PPE)
The PPE is the Cell’s control center. It runs the operating system, responds to interrupts, and contains and manages the 512KB L2 cache. It also distributes the processing workload among the SPEs and coordinates their operation. Comparing the Cell to an eight-horse coach, the PPE is the coachman, controlling the cart by feeding the horses and keeping them in line.
As shown in Figure 1.2, the PPE consists of two operational blocks. The first is the PowerPC Processor Unit, or PPU. This processor’s instruction set is based on the 64-bit PowerPC 970 architecture, used most prominently as the CPU of Apple Computer’s Power Mac G5. The PPU executes PPC 970 instructions in addition to other Cell-specific commands, and is the only general-purpose processing unit in the Cell. This is why Linux is installed to run on the PPU and not on the other processing units.
Figure 1.2 Structure of the PPE
But the PPU can do more than just housekeeping. It contains IBM’s VMX engine for Single Instruction, Multiple Data (SIMD) processing. This means the PPU can operate on groups of numbers (e.g., multiply two sets of four floating-point values) with a single instruction. The PPU’s SIMD instructions are the same as those used in Apple’s image-processing applications, and are collectively referred to as the AltiVec instruction set. Chapter 8, “SIMD Programming on the PPU, Part 1: Vector Libraries and Functions,” is dedicated to AltiVec programming on the PPU.
Another important aspect of the PPU is its capacity for symmetric multithreading (SMT). The PPU allows two threads of execution to run at the same time, and although each receives a copy of most of the PPU’s registers, they have to share basic on-chip execution blocks. This doesn’t provide the same performance gain as if the threads ran on different processors, but it allows you to maximize usage of the PPU resources. For example, if one thread is waiting on the PPU’s memory management unit (MMU) to complete a memory write, the other can perform mathematical operations with the vector and scalar unit (VXU).
The second block in the PPE is the PowerPC Processor Storage Subsystem, or PPSS. This contains the L2 cache along with registers and queues for reading and writing data. The cache plays a very important role in the Cell’s operation: not only does it perform the regular functions of an L2 cache, it’s also the only shared memory bank in the device. Therefore, it’s important to know how it works and maintains coherence. Chapter 6, “Introducing the PowerPC Processor Unit (PPU),” covers this topic in greater depth.
The Synergistic Processor Element (SPE)
The PPU is a powerful processor, but it’s the Synergistic Processor Unit (SPU) in each SPE that makes the Cell such a groundbreaking device. These processors are designed for one purpose only: high-speed SIMD operations. Each SPU contains two parallel pipelines that execute instructions at 3.1GHz. In only a handful of cycles, one pipeline can multiply and accumulate 128-bit vectors while the other loads more vectors from memory.
SPUs weren’t designed for general-purpose processing and aren’t well suited to run operating systems. Instead, they receive instructions from the PPU, which also starts and stops their execution. The SPU’s instructions, like its data, are stored in a unified 256KB local store (LS), shown in Figure 1.3. The LS is not cache; it’s the SPU’s own individual memory for instructions and data. This, along with the SPU’s large register file (128 128-bit registers), is the only memory the SPU can directly access, so it’s important to have a deep understanding of how the LS works and how to transfer its contents to other elements.
Figure 1.3 Structure of the SPE
The Cell provides hardware security (or digital rights management, if you prefer) by allowing users to isolate individual SPUs from the rest of the device. While an SPU is isolated, other processing elements can’t access its LS or registers, but it can continue running its program normally. The isolated processor will remain secure even if an intruder acquires root privileges on the PPU. The Cell’s advanced security measures are discussed in Chapter 14, “Advanced SPU Topics: Overlays, Software Caching, and SPU Isolation.”
Figure 1.3 shows the Memory Flow Controller (MFC) contained in each SPE. This manages communication to and from an SPU, and by doing so, frees the SPU for crunching numbers. More specifically, it provides a number of different mechanisms for interelement communication, such as mailboxes and channels. These topics are discussed in Chapters 12, “SPU Communication, Part 1: Direct Memory Access (DMA),” and 13, “SPU Communication, Part 2: Events, Signals, and Mailboxes.”
The MFC’s most important function is to enable direct memory access (DMA). When the PPU wants to transfer data to an SPU, it gives the MFC an address in system memory and an address in the LS, and tells the MFC to start moving bytes. Similarly, when an SPU needs to transfer data into its LS, it can not only initiate DMA transfers, but also create lists of transfers. This way, an SPU can access noncontiguous sections of memory efficiently, without burdening the central bus or significantly disturbing its processing.
The Element Interconnect Bus (EIB)
The EIB serves as the infrastructure underlying the DMA requests and interelement communication. Functionally, it consists of four rings, two that carry data in the clockwise direction (PPE > SPE1 > SPE3 > SPE5 > SPE7 > IOIF1 > IOIF0 > SPE6 > SPE4 > SPE2 > SPE0 > MIC) and two that transfer data in the counterclockwise direction. Each ring is 16 bytes wide and can support three data transfers simultaneously.
Each DMA transfer can hold payload sizes of 1, 2, 4, 8, and 16 bytes, and multiples of 16 bytes up to a maximum of 16KB. Each DMA transfer, no matter how large or small, consists of eight bus transfers (128 bytes). As Chapter 12 explains, DMA becomes more efficient as the data transfers increase in size.
The Input/Output Interface (IOIF)
As the name implies, IOIF connects the Cell to external peripherals. Like the memory interface, it is based on Rambus technology: FlexIO. The FlexIO connections can be configured for data rates between 400MHz to 8GHz, and with the high number of connections on the Cell, its maximum I/O bandwidth approaches 76.8GB/s. In the PlayStation 3, the I/O is connected to Nvidia’s RSX graphic processor. The IOIF can be accessed only by privileged applications, and for this reason, interfacing the IOIF lies beyond the scope of this book.