A Taxonomy: Naming the Beasts
Not that long ago, everything inside the computer came to a halt while a line of text was printed on the operator's console. The government's need for speed provided funding for some very creative ideas in the early 1960s, and the key to faster processing speeds was overlap: doing more than one thing at a time. Experimental prototypes with multiple central processing units (CPUs), partitioned physical memories, and some innovative hardware and software developments brought about some new and exciting ideas.
The Illiac computer (at the University of Illinois, in Champaign-Urbana, Illinois) was a vector processor with a novel architecture. Each of a string of worker computers was given an instruction to execute on its own portion of the data. One important benefit was that 2 lists of 10 numbers, for example, were multiplied by 10 processors in a single instruction cycle! Mathematicians refer to a list of numbers as a vector, so these types of computers are commonly called vector processors. This was overlap in the extreme, provided that the software could exploit it.
Michael J. Flynn (in his paper titled "Very High-Speed Computing Systems," in Proceedings of the IEEE, December 1966) gave us a way to categorize and name these new machines. His taxonomy focused on the number of instruction and data pathways that the architecture provided.
A uniprocessor was distinct from a multiprocessor (MP) because it could execute only a single instruction at a time: SI stood for single instruction. Similarly, if only a single item of data could be extracted from memory at a time, it was designated SD, for single data. The typical desktop computer is a uniprocessor, designated SISD. An MP with two or more CPUs and a single shared memory could simultaneously execute as many instructions as it had CPUs. Each of these instructions could access memory data simultaneously, so these kinds of machines were designated MIMD.
From here it gets a little trickier. Remember the Illiac? Technically, it executed only one instruction at a time. (Although there are many CPUs executing that instruction, only one instruction is taken from program memory, so it counts as only one.) Because each slave computer has its own local memory, multiple pieces of data can be accessed at a time. This is called SIMD, for single instruction, multiple data. The last possible combination of instruction and data pathways is MISD. Many (including Flynn) say there is no such architecture. Some speculate on how such a machine might operate, but the classification is dubious, at best.
Two popular types of multiprocessor emerged in the 1970s: loosely coupled and tightly coupled. Loose meant that two or more computers shared a job (input) queue and a print (output) queue; tight that meant two or more CPUs shared the same physical memory. A loosely coupled multiprocessor could improve a commercial data center's throughput in terms of jobs per day. A tightly coupled MP ran parallel programs that let scientists predict tomorrow's weather before tomorrow came, and let soldiers complete a battlefield simulation before the battle ended! Parallel processing became an important asset to the scientific and military communities, but it was very expensive, costing millions of dollars.
In the 1980s, a doctoral student at Yale named Kai Li proposed a network of workstations that could emulate a tightly coupled MP. (Search http://www.yale.edu for the IVY system. Yale is rightly still buzzing about his groundbreaking work today!) Each PC maintained a portion of the virtual memory space, and software interrupts handled page faults by sending the requested pages of virtual memory over the network. Parallel programs ran more slowly, but at a fraction of the cost. An immediate benefit was being able to debug parallel algorithms without tying up the real parallel processor's time.
This new distributed computing architecture became a modern version of the loosely coupled MP. Computer science research was flooded with extensions to this model for more than a decade. IVY (which doesn't stand for anythingYale just uses that word a lot) even had a major influence on the early Beowulf project at NASA, discussed later in this article. And Beowulf, some say, put Linux on the map. (The January 1998 issue of Linux Journal has half a dozen articles about the first Beowulf systems. The August 2002 issue has a recap and an update on Beowulf II projects.)
The distributed MP became so popular with computer architects that it warranted its own taxonomy. Modern computer architecture texts refer to tightly coupled multiprocessors as UMA, which stands for uniform memory access. This is because each CPU experiences the same amount of delay time when it accesses data from anywhere in physical memory space.
High-speed communications systems and other stream-data processors are often constructed from multiple computer boards, each of which is plugged into a backplane in a common chassis. Specialized operating systems let you organize the memory on each CPU board into a single memory space. The first board might represent the first 16MB of physical memory, the second board's memory can be mapped right behind that, and so on. When an instruction executed on a board's CPU accesses data in its local memory, access time is short. When it accesses data from some remote board's local memory, hardware retrieves the data from the remote board's memory and sends it over the high-speed backplane's data bus, placing it in the requesting CPU's memory data register, just as if it came from its own local memory. But memory access time is much longer from a remote board, due to all that hardware intervention. Textbooks call these systems NUMA, for their nonuniform memory access time properties.
The IVY system from Yale has a considerably longer memory access time for data from a remote machine. Remote data access is handled through software communications over the local area network, which is considerably slower than a backplane's data bus. Some texts call these systems NUMA, and others distinguish IVY's nonhardware approach as NORMA, for no remote memory access. We might all agree that there is certainly support for remote memory access, but the no here refers to the lack of hardware support. This is computer architecture, after all, and software architecture doesn't quite get the distinction it deserves, in my opinion. But hopefully, that's all about to change.
In the 1990s, a project team at NASA was tasked to develop a (cheap) solution to the problem of data access. NASA's databases were huge, and way too much time was being spent moving the same data to the various workstations for individual processing, sometimes repeatedly to the same workstation. The team was asked to look into a way to reduce the data transmission time by designing a "Gigaflops Scientific Workstation." (Gigaflops is shorthand for a billion floating-point operations per second, a common scientific measure of computer performance.) The team accomplished its task by developing a cluster system, a supercomputer called Beowulf.
A Beowulf system, according to Donald Becker, comprises only commercial, off-the-shelf components; costs less than $50,000; and uses freely available software and tools, with the express goal of returning the design and improvements back to the community. (See his paper titled "Beowulf: Harnessing the Power of Parallelism in a Pile-of-PCs" at http://www.beowulf.org for further details.) Later versions of Beowulf relaxed some of the off-the-shelf requirements, but their scientific computing benchmark results were no less than stunning: Becker's team won the prestigious Gordon Bell award in 1996. The cost-performance ratio of a cluster of Pentium Pros achieved 2.1 billion floating-point operations per second, or 2.1Gflops, based on measurements from a commonly available and long-trusted benchmark.
Cluster systems are not without their problems. Whether such a system is configured as a supercomputer or a high-volume server, you should consider carefully two ratios. Tuning a system's performance requires a balance between the amount of time it takes to transfer a message from one node machine in the cluster to another and the amount of time that the receiving node spends processing the data in that message. The message might be a query packet in a cluster server. In a supercomputer, it might be a block of data to be processed, or a block of program instructions to be executed by the receiving node. How much time does it take to process a typical query or a block of data, and how much time does it take to execute a typical block of instructions? Compare these to the time it takes to get the message from one node to another, including communications support software processing time on both the sending and the receiving nodes.
If the processing time is dramatically shorter than the transmission time, as it most often is, the master node in a cluster server might be able to process all incoming queries faster by itself, rather than by sending it to an available worker node. And a supercomputer will not benefit from any inherent parallelism if sending the data to a remote processor takes longer than the processing itself.
While we're on the subject of parallelism, consider Amdahl's Law before you consider using a supercomputer. If your application doesn't have enough inherent parallelism to warrant the use of a parallel machine, you probably shouldn't bother acquiring one.