Hardware IPC
One problem with all of these execution units is communicating among them. Moving data between the SIMD unit and the scalar part of a modern CPU is relatively expensive; moving data between the CPU and the GPU, even more so. Communicating between cores typically involves going via a shared cache, or via main memory if the cores don’t share a common cache.
The Transputer, produced in the 1980s, faced a similar problem. It had a large number of (cheap) relatively independent computing units. Each one had four interconnects, allowing it to talk to other processing units in close proximity very quickly. AMD’s HyperTransport is similar, although it’s generally used to implement shared memory, rather than as a message-passing interface.
The closest descendent of the Transputer these days is the Cell. This design has a set of synergistic processing units (SPUs). Apart from having the highest buzzword density of any processor to date, these are interesting in the way that they process data. Most CPUs have a very fine-grained load-and-store mechanism. They load a word (typically 64 bits of data these days) from memory, process it, and write it out. This is a simplification; in practice, they’ll typically interact with a layer of cache, which will look to a lower layer if it can’t provide the data required. The Cell is different; rather than providing a transparent cache, the Cell has a small amount of local memory. It loads a large chunk of data into this space in a single DMA transfer from main memory, and then processes it. On the plus side, this means that you never get a cache miss because all of your data is in "cache." The only difficulty is that you have to work out a way of partitioning your problem to allow it to be solved one small block at a time.
Once an SPU has completed processing a block of data, it might send it back to main memory. Another option is to pass it on to another SPU. This approach is potentially very interesting, but it creates some significant problems for layout when you try scaling it to large numbers of cores. Each core will both consume and produce data. Most will then pass on their output to another core for further processing. The problem comes from the fact that the number of potential recipients is the number of cores. While it’s easy to send a message to the nearest say, four cores, sending it any further away is more difficult. This is even more complex in a system with heterogeneous cores, because some processes are going to need to be run on specific areas of the chip.