The MMIX Supplement to The Art of Computer Programming: Programming Techniques
Programming Techniques
1. Index Variables
Many algorithms traverse information structures that are sequentially allocated in memory. Let us assume that a sequence of n data items a0, a1, . . . , an−1 is stored sequentially. Further assume that each data item occupies 8 bytes, and the first element a0 is stored at address A; the address of ai is then A + 8i. To load ai with 0 ≤ i < n from memory into register ai, we need a suitable base address and so we assume that we have A = LOC(a0) in register a. Then we can write ‘8ADDU t,i,a; LDO ai,t,0’ or alternatively ‘SL t,i,3; LDO ai,a,t’. If this operation is necessary for all i, it is more efficient to maintain a register i containing 8i as follows:
SET |
i,0 |
i ← 0. |
LDO |
ai,a,i |
Load ai. |
ADD |
i,i,8 |
Advance to next element: i ← i + 1. |
· · · |
Note how i advances by 8 when i advances by 1.
The branch instructions of MMIX, like most computer architectures, directly support a test against zero; therefore a loop becomes more efficient if the index variable runs toward 0 instead of toward n. The loop may then take the form:
SL |
i,n,3 |
i ← n. |
||
0H |
SUB |
i,i,8 |
Advance to next element: i ← i − 1. |
|
LDO |
ai,a,i |
Load ai. |
||
· · · |
||||
PBP |
i,0B |
Continue while i > 0. |
In the above form, the items are traversed in decreasing order. If the algorithm requires traversal in ascending order, it is more efficient to keep A + 8n, the address of an, as new base address in a register an, and to run the index register i from −8n toward −8 as in the following code:
8ADDU |
an,n,a |
an ← A + 8n. |
||
SUBU |
i,a,an |
i ← 0 (or i ← −8n). |
||
0H |
LDO |
ai,an,i |
ai ← ai. |
|
· · · |
||||
ADD |
i,i,8 |
Advance to next element: i ← i + 1. |
||
PBP |
i,0B |
Continue while i < n. |
If a is used only to compute A+8n, it is possible to write ‘8ADDU a,n,a’ and reuse register a to hold A + 8n. Loading ai then resumes the nice form ‘LDO ai,a,i’, without any need for an. For an example, see Program 4.3.1S on page 63.
When computer scientists enumerate n elements, they say “a0, a1, a2, . . . ”, starting with index zero. When mathematicians (and most other people) enumerate n elements, they say “a1, a2, a3, . . . ” and start with index 1. Nevertheless when such a sequence of elements is passed as a parameter to a subroutine, it is customary to pass the address of its first element LOC(a1). If this address is in register a, the address of ai is now a + 8(i − 1). To load ai efficiently into register ai, we have two choices: Either we adjust register a, saying ‘SUBU a,a,8’ for a ← LOC(a0), or we maintain in register i the value of 8(i − 1), saying for example ‘SET i,0’ for i ← 1. In both cases, we can write ‘LDO ai,a,i’ to load ai ← ai.
Many variations of these techniques are possible; a nice and important example is Program 5.2.1S on page 76.
2. Fields
Let us assume that the data elements ai, just considered, are further structured by having three fields, two WYDEs and one TETRA, like this:
It is then convenient to define offsets for the fields reusing the field names as follows:
LEFT |
IS |
0 |
Offset for field LEFT |
RIGHT |
IS |
2 |
Offset for field RIGHT |
KEY |
IS |
4 |
Offset for field KEY |
There is very little information in these lines, so these definitions are usually suppressed in a program’s display.
Computing the address of, say, the KEY field of ai requires two additions, A + 8i + KEY, of which only one must be done inside a loop over i. The quantity A + KEY can be precomputed and kept in a register named key. This simplifies loading of KEY(ai) as follows:
ADDU |
key,a,KEY |
key ← A + KEY. |
· · · |
Loop on i with i = 8i. |
|
LDT |
k,key,i |
k ← KEY(ai). |
3. Relative Addresses
In a more general setting, this technique can be applied to relative addresses. Assume that one of the data items ai is given by its relative address P = LOC(ai) − BASE relative to some base address BASE.
Then again KEY(ai) can be loaded by a single instruction ‘LDT k,key,p’, if P is in register p, and BASE + KEY is in register key.
While an absolute address always requires eight bytes in MMIX’s memory, relative addresses can be stored using only four bytes, two bytes, or one byte, which allows tighter packing of information structures and reduces the memory footprint of applications that handle large numbers of links. Using this technique, the use of relative addresses can be as efficient as the use of absolute addresses.
4. Using the Low Order Bits of Pointers (“Bit Stuffing”)
Modern computers impose alignment restrictions on the possible addresses of primitive data types. In the case of MMIX, an OCTA may start only at an address that is a multiple of 8, a TETRA requires a multiple of 4, and a WYDE needs an even address. As a result, data structures are typically octabyte-aligned, because they contain one or more OCTA-fields—for example, to hold an absolute address in a link field. Those link fields, in turn, are multiples of eight as well. Put differently, their three low-order bits are all zero. Such precious bits can be put to use as tag bits, marking the pointer to indicate that either the pointer itself or the data item it points to has some special property. MMIX further simplifies the use of these bits as tags by ignoring the low-order bits of an address in load and store instructions. That convention is not the case for all CPU architectures. Still, these bits are usable as tags; they just need to be masked to zero on such computers before using link fields as addresses.
Three different uses need to be distinguished. First, a tag bit in a link may contain some additional information about the data item it links to. Second, it may tell about the data item that contains the link. Third, it may disclose information about the link itself.
An example of the first type of use is the implementation of two-dimensional sparse arrays in Section 2.2.6. There, the nonzero elements of each row (or column) form a circular linked list anchored in a special list head node. It would have been possible to mark each head node using one of the bits in one of its link fields, but it is more convenient to put this information into the links pointing to a head node. Once the link to the next node in the row is known, a single instruction is sufficient to test for a head node, as for example in the implementation of Program 2.2.6S on page 132:
S3 |
LDOU |
q0,q0,UP |
S3. Find new row. |
Q0 ← UP(Q0). |
BOD |
q0,9F |
Exit if Q0 is odd. |
If a head node would be marked by using a tag bit in its own UP link, the code would require an extra load instruction:
S3 |
LDOU |
q0,q0,UP |
S3. Find new row. |
Q0 ← UP(Q0). |
LDOU |
t,q0,UP |
t ← UP(Q0). |
||
BOD |
t,9F |
Exit if TAG(Q0) = 1. |
The great disadvantage of this method, so it seems, is the need to maintain all the tag bits in all of the links that point to a head node during the running time of the program. A closer look at the operations a Program like Algorithm 2.2.6S performs will reveal, however, that it inserts and deletes matrix elements but never deletes or creates head nodes. Inserting or deleting matrix elements will just copy existing link values; hence no special coding is required to maintain the tag bits in the links to head nodes.
The second, more common, type of use of a tag field is illustrated by the solution to exercise 2.3.5–4 on page 139. The least significant bit of the ALINK field is used to mark accessible nodes, and the least significant bit of the BLINK field is used to distinguish between atomic and non-atomic nodes. The following snippet taken from this code is typical for testing and setting of these tag bits:
E2 |
LDOU |
x,p,ALINK |
E2. Mark P. |
|
OR |
x,x,1 |
|||
STOU |
x,p,ALINK |
MARK(P) ← 1. |
||
E3 |
LDOU |
x,p,BLINK |
E3. Atom? |
|
PBEV |
x,E4 |
Jump if ATOM(P) = 0. |
An interesting variation of this use of a tag bit can be seen in exercise 2.2.3–26 on page 23. There, the data structure asks for a variable-length list of links allocated sequentially in memory. Instead of encoding the length of the list somewhere as part of the data structure, the last link of the structure is marked by setting a tag bit. This arrangement leads to very simple code for the traversal of the list.
As a final example, consider the use of tag bits in the implementation of threaded binary trees in Section 2.3.1. There, the RIGHT and LEFT fields of a node might contain “down” links to a left or right subtree, or they might contain “thread” or “up” links to a parent node (see, for example, 2.3.1–(10), page 324 ). Within a tree, there are typically both “up” and “down” links for the same node. Hence, the tag is clearly a property of the link, not the node. Searching down the left branch of a threaded binary tree, as required by step S2 of Algorithm 2.3.1S, which reads “If LTAG(Q) = 0, set Q ← LLINK(Q) and repeat this step,” may take the following simple form:
0H |
SET |
q,p |
Set Q ← LLINK(Q) and repeat step S2. |
S2 |
LDOU |
p,q,LLINK |
S2. Search to left. p ← LLINK(Q). |
PBEV |
p,0B |
Jump if LTAG(Q) = 0. |
5. Loop Unrolling
The loop shown at the end of the last section has a SET operation that has no computational value; it just reorganizes the data when the code advances from one iteration to the next. A small loop may benefit significantly from eliminating such code by unrolling it or, in the simplest case, doubling it. Doubling the loop adds a second copy of the loop where the registers p and q exchange roles. This leads to
S2 |
LDOU |
p,q,LLINK |
S2. Search to left. P ← LLINK(Q). |
BOD |
p,1F |
If LTAG(Q) ≠ 0, exit the loop. |
|
LDOU |
q,p,LLINK |
S2. Search to left. Q ← LLINK(P). |
|
PBEV |
q,S2 |
If LTAG(P) = 0, repeat step S2. |
|
SET |
q,p |
At this point p and q have exchanged roles. |
|
1H |
IS |
@ |
The new loop requires 2υ per iteration instead of 3υ. For another example, see the solution to exercise 5.2.1–33 on page 167. Further, Program 6.1Q′ on page 98 illustrates how loop unrolling can benefit loops maintaining a counter variable, and the solution to exercise 6.2.1–10 on page 184 shows how to completely unroll a loop with a small, fixed number of iterations.
6. Subroutines
The code of a subroutine usually starts with the definition of its stack frame, the storage area containing parameters and local variables. Using the MMIX register stack, it is sufficient for most subroutines to list and name the appropriate local registers. Once the stack frame is defined, the instructions that make up the body of the subroutine follow. The first instruction is labeled with the name of the subroutine—typically preceded by a colon to make it global; the last instruction is a POP. For a simple example see the solution to exercise 2.2.3–2 on page 124 or the solution to exercise 5–7 on page 162.
Subroutine Invocation. Calling a subroutine requires three steps: passing of parameters, transfer of control, and handling of return values. In the simplest case, with no parameters and no return values, the transfer of control is accomplished with a single ‘PUSHJ $X,YZ’ instruction and a matching POP instruction. The problem remains choosing a register $X such that the subroutine call will preserve the values of registers belonging to the caller’s stack frame. For this purpose, the subroutines in this book will define a local register, named t, such that all other named local registers have register numbers smaller than t. Aside from its role in calling subroutines, t is used as temporary variable. The typical form of a subroutine call is then ‘PUSHJ t,YZ’.
If the subroutine has n > 0 parameters, the registers for the parameter values can be referenced as t+1, t+2, . . . , t+ n. A simple example is Program 2.3.1T, where the two functions Inorder and Visit are called like this:
T3 |
LDOU |
t+1,p,LLINK |
T3. Stack ⇐ P. |
SET |
t+2,visit |
||
PUSHJ |
t,:Inorder |
Call Inorder(LLINK(P),Visit). |
|
T5 |
SET |
t+1,p |
T5. Visit P. |
PUSHGO |
t,visit,0 |
Call Visit(P). |
After the subroutine has transferred control back to the caller, it may use the return values. If the subroutine has no return values, register t (and all registers with higher register numbers) will be marginal and a reference to it will yield zero; otherwise, t will hold the principal return value and further return values will be in registers t+1, t+2, . . . . The function FindTag in the solution to exercise 2.5–27 on page 143 is an example of a function with three return values.
Nested Calls. If the return value of one function serves as a parameter for the next function, the schema just described needs some modification. It is better to place the return value of the first function not in register t but directly in the parameter register for the second function; therefore we have to adjust the first function call. For example, the Mul function in Section 2.3.2, page 42, needs to compute Q1 ← Mult(Q1,Copy(P2)), and that is done like this:
SET |
t+1,q1 |
t+1 ← Q1. |
SET |
t+3,p2 |
|
PUSHJ |
t+2,:Copy |
t+2 ← Copy(P2). |
PUSHJ |
t,:Mult |
|
SET |
q1,t |
Q1 ← Mult(Q1,Copy(P2)). |
The Div function of exercise 2.3.2–15, which computes the slightly more complex formula
Q ← Tree2(Mult(Copy(P1),Q),Tree2(Copy(P2),Allocate(),“↑”),“/”),
contains more examples of nested function calls (see also the Pwr function of exercise 2.3.2–16).
Nested Subroutines. If one subroutine calls another subroutine, we have a situation known as nested subroutines. The most common error when programming MMIX is failing to save and restore the rJ register. At the start of a subroutine, the special register rJ contains the return address for the POP instruction. It will be rewritten by the next PUSHJ instruction and therefore must be saved if the next PUSHJ occurs before the POP.
There are two preferred places to save and restore rJ: Either start the subroutine with a GET instruction, saving rJ in a local register, and end the subroutine with a PUT instruction, restoring rJ, immediately before the terminating POP instruction; or, if the subroutine contains only a single PUSHJ instruction, save rJ immediately before the PUSHJ and restore it immediately after the PUSHJ. An example of the first method is the Mult function in Section 2.3.2; the second method is illustrated by the Tree2 function in the same section. If subroutines use the PREFIX instruction to create local namespaces, the local copy of ‘:rJ’ can simply be called ‘rJ’; that is the naming convention used in this book.
Tail Call Optimization. The Mult function of Section 2.3.2 is an interesting example for another reason: It uses an optimization called “tail call optimization.” If a subroutine ends with a subroutine call in such a way that the return values of the inner subroutine are already the return values of the outer subroutine, the stack frame of the outer subroutine can be reused for the inner subroutine because it is no longer needed after the call to the inner routine. Technically, this is achieved by moving the parameters into the right place inside the existing stack frame and then using a jump or branch instruction to transfer control to the inner subroutine. The POP instruction of the inner subroutine will then return directly to the caller of the outer subroutine. So, when the function Mult(u,v) wants to return Tree2(u,v,“×”), u and v are already in place and ‘GETA v+1,:Mul’ initializes the third parameter; then ‘BNZ t,:Tree2’ transfers control to the Tree2 function, which will return its result directly to the caller of Mult.
A special case of this optimization is the “tail recursion optimization.” Here, the last call of the subroutine is a recursive call to the subroutine itself. Applying the optimization will remove the overhead associated with recursion, turning a recursive call into a simple loop. For an example, see Program 5.2.2Q on page 82, which uses PUSHJ as well as JMP to call the recursive part Q2.
7. Reporting Errors
There is no good Program without good error handling. The standard situation is the discovery of an error while executing a subroutine. If the error is serious enough, it might be best to issue an error message and terminate the Program immediately. In most cases, however, the error should be reported to the calling Program for further processing.
The most common form of error reporting is the specification of special return values. Most UNIX system calls, for example, return negative values on error and nonnegative values on success. This schema has the advantage that the test for a negative value can be accomplished with a single instruction, not only by MMIX but by most CPUs. Another popular error return value, which can be tested equally well, is zero. For example, functions that return addresses often use zero as an error return, because addresses are usually considered unsigned and the valid addresses span the entire range of possible return values. In most circumstances, it is, furthermore, simple to arrange things in a way that excludes zero from the range of valid addresses.
MMIX offers two ways to return zero from a subroutine: The two instructions ‘SET $0,0; POP 1,0’ will do the job, but just ‘POP 0,0’ is sufficient as well. The second form will turn the register that is expected to contain the return value into a marginal register, and reading a marginal register yields zero (see the solution to exercise 2.2.3–4 on page 125 for an example).
The POP instruction of MMIX makes another form of error reporting very attractive: the use of separate subroutine exits for regular return and for error return (see exercise 2.2.3–3 and its solution on page 125 for an example). The subroutine will end with ‘POP 0,0’ in case of error and with ‘POP 1,1’ in case of success, returning control to the instruction immediately following the PUSHJ in case of error and to the second instruction after the PUSHJ otherwise. The calling sequence must then insert a jump to the error handler just after the PUSHJ while the normal control flow continues with the instruction after the jump instruction. The advantages of this method are twofold. First, the execution of the normal control path is faster because it no longer contains a branch instruction to test the return value. Second, this programming style forces the calling Program to provide explicit error handling; simply skipping the test for an error return will no longer work.