- The Power Mac G5
- The G5: Lineage and Roadmap
- The PowerPC 970FX
- Software Conventions
- Examples
3.4 Software Conventions
An application binary interface (ABI) defines a system interface for compiled programs, allowing compilers, linkers, debuggers, executables, libraries, other object files, and the operating system to work with each other. In a simplistic sense, an ABI is a low-level, "binary" API. A program conforming to an API should be compilable from source on different systems supporting that API, whereas a binary executable conforming to an ABI should operate on different systems supporting that ABI. [51]
An ABI usually includes a set of rules specifying how hardware and software resources are to be used for a given architecture. Besides interoperability, the conventions laid down by an ABI may have performance-related goals too, such as minimizing average subroutine-call overhead, branch latencies, and memory accesses. The scope of an ABI could be extensive, covering a wide variety of areas such as the following:
- Byte ordering (endianness)
- Alignment and padding
- Register usage
- Stack usage
- Subroutine parameter passing and value returning
- Subroutine prologues and epilogues
- System calls
- Object files
- Dynamic code generation
- Program loading and dynamic linking
The PowerPC version of Mac OS X uses the Darwin PowerPC ABI in its 32-bit and 64-bit versions, whereas the 32-bit x86 version uses the System V IA-32 ABI. The Darwin PowerPC ABI is similar to—but not the same as—the popular IBM AIX ABI for the PowerPC. In this section, we look at some aspects of the Darwin PowerPC ABI without analyzing its differences from the AIX ABI.
3.4.1 Byte Ordering
The PowerPC architecture natively supports 8-bit (byte), 16-bit (half word), 32-bit (word), and 64-bit (double word) data types. It uses a flat-address-space model with byte-addressable storage. Although the PowerPC architecture provides an optional little-endian facility, the 970FX does not implement it—it implements only the big-endian addressing mode. Big-endian refers to storing the "big" end of a multibyte value at the lowest memory address. In the PowerPC architecture, the leftmost bit—bit 0—is defined to be the most significant bit, whereas the rightmost bit is the least significant bit. For example, if a 64-bit register is being used as a 32-bit register in 32-bit computation mode, then bits 32 through 63 of the 64-bit register represent the 32-bit register; bits 0 through 31 are to be ignored. By corollary, the leftmost byte—byte 0—is the most significant byte, and so on.
3.4.2 Register Usage
The Darwin ABI defines a register to be dedicated, volatile, or nonvolatile. A dedicated register has a predefined or standard purpose; it should not be arbitrarily modified by the compiler. A volatile register is available for use at all times, but its contents may change if the context changes—for example, because of calling a subroutine. Since the caller must save volatile registers in such cases, such registers are also called caller-save registers. A nonvolatile register is available for use in a local context, but the user of such registers must save their original contents before use and must restore the contents before returning to the calling context. Therefore, it is the callee—and not the caller—who must save nonvolatile registers. Correspondingly, such registers are also called callee-save registers.
Table 3–12 lists common PowerPC registers along with their usage conventions as defined by the 32-bit Darwin ABI.
Table 3–12. Register Conventions in the 32-bit Darwin PowerPC ABI
Register(s) |
Volatility |
Purpose/Comments |
GPR0 |
Volatile |
Cannot be a base register. |
GPR1 |
Dedicated |
Used as the stack pointer to allow access to parameters and other temporary data. |
GPR2 |
Volatile |
Available on Darwin as a local register but used as the Table of Contents (TOC) pointer in the AIX ABI. Darwin does not use the TOC. |
GPR3 |
Volatile |
Contains the first argument word when calling a subroutine; contains the first word of a subroutine's return value. Objective-C uses GPR3 to pass a pointer to the object being messaged (i.e., "self") as an implicit parameter. |
GPR4 |
Volatile |
Contains the second argument word when calling a subroutine; contains the second word of a subroutine's return value. Objective-C uses GPR4 to pass the method selector as an implicit parameter. |
GPR5–GPR10 |
Volatile |
GPRn contains the (n – 2)th argument word when calling a subroutine. |
GPR11 |
Varies |
In the case of a nested function, used by the caller to pass its stack frame to the nested function—register is nonvolatile. In the case of a leaf function, the register is available and is volatile. |
GPR12 |
Volatile |
Used in an optimization for dynamic code generation, wherein a routine that branches indirectly to another routine must store the target of the call in GPR12. No special purpose for a routine that has been called directly. |
GPR13–GPR29 |
Nonvolatile |
Available for general use. Note that GPR13 is reserved for thread-specific storage in the 64-bit Darwin PowerPC ABI. |
GPR30 |
Nonvolatile |
Used as the frame pointer register—i.e., as the base register for access to a subroutine's local variables. |
GPR31 |
Nonvolatile |
Used as the PIC-offset table register. |
FPR0 |
Volatile |
Scratch register. |
FPR1–FPR4 |
Volatile |
FPRn contains the nth floating-point argument when calling a subroutine; FPR1 contains the subroutine's single-precision floating-point return value; a double-precision floating-point value is returned in FPR1 and FPR2. |
FPR5–FPR13 |
Volatile |
FPRn contains the nth floating-point argument when calling a subroutine. |
FPR14–FPR31 |
Nonvolatile |
Available for general use. |
CR0 |
Volatile |
Used for holding condition codes during arithmetic operations. |
CR1 |
Volatile |
Used for holding condition codes during floating-point operations. |
CR2–CR4 |
Nonvolatile |
Various condition codes. |
CR5 |
Volatile |
Various condition codes. |
CR6 |
Volatile |
Various condition codes; can be used by AltiVec. |
CR7 |
Volatile |
Various condition codes. |
CTR |
Volatile |
Contains a branch target address (for the bcctr instruction); contains counter value for a loop. |
FPSCR |
Volatile |
Floating-Point Status and Control Register. |
LR |
Volatile |
Contains a branch target address (for the bclr instruction); contains subroutine return address. |
XER |
Volatile |
Fixed-point exception register. |
VR0, VR1 |
Volatile |
Scratch registers. |
VR2 |
Volatile |
Contains the first vector argument when calling a subroutine; contains the vector returned by a subroutine. |
VR3–VR19 |
Volatile |
VRn contains the (n – 1)th vector argument when calling a subroutine. |
VR20–VR31 |
Nonvolatile |
Available for general use. |
VRSAVE |
Nonvolatile |
If bit n of the VRSAVE is set, then VRn must be saved during any kind of a context switch. |
VSCR |
Volatile |
Vector Status and Control Register. |
3.4.2.1 Indirect Calls
We noted in Table 3–12 that a function that branches indirectly to another function stores the target of the call in GPR12. Indirect calls are, in fact, the default scenario for dynamically compiled Mac OS X user-level code. Since the target address would need to be stored in a register in any case, using a standardized register allows for potential optimizations. Consider the code fragment shown in Figure 3–18.
Example 3–18. A simple C function that calls another function
void f1(void) { f2(); }
By default, the assembly code generated by GCC on Mac OS X for the function shown in Figure 3–18 will be similar to that shown in Figure 3–19, which has been annotated and trimmed down to relevant parts. In particular, note the use of GPR12, which is referred to as r12 in the GNU assembler syntax.
Example 3–19. Assembly code depicting an indirect function call
... _f1: mflr r0 ; prologue stmw r30,-8(r1) ; prologue stw r0,8(r1) ; prologue stwu r1,-80(r1) ; prologue mr r30,r1 ; prologue bl L_f2$stub ; indirect call lwz r1,0(r1) ; epilogue lwz r0,8(r1) ; epilogue mtlr r0 ; epilogue lmw r30,-8(r1) ; epilogue blr ; epilogue ... L_f2$stub: .indirect_symbol _f2 mflr r0 bcl 20,31,L0$_f2 L0$_f2: mflr r11 ; lazy pointer contains our desired branch target ; copy that value to r12 (the 'addis' and the 'lwzu') addis r11,r11,ha16(L_f2$lazy_ptr-L0$_f2) mtlr r0 lwzu r12,lo16(L_f2$lazy_ptr-L0$_f2)(r11) ; copy branch target to CTR mtctr r12 ; branch through CTR bctr .data .lazy_symbol_pointer L_f2$lazy_ptr: .indirect_symbol _f2 .long dyld_stub_binding_helper
3.4.2.2 Direct Calls
If GCC is instructed to statically compile the code in Figure 3–18, we can verify in the resultant assembly that there is a direct call to f2 from f1, with no use of GPR12. This case is shown in Figure 3–20.
Example 3–20. Assembly code depicting a direct function call
.machine ppc .text .align 2 .globl _f1 _f1: mflr r0 stmw r30,-8(r1) stw r0,8(r1) stwu r1,-80(r1) mr r30,r1 bl _f2 lwz r1,0(r1) lwz r0,8(r1) mtlr r0 lmw r30,-8(r1) blr
3.4.3 Stack Usage
On most processor architectures, a stack is used to hold automatic variables, temporary variables, and return information for each invocation of a subroutine. The PowerPC architecture does not explicitly define a stack for local storage: There is neither a dedicated stack pointer nor any push or pop instructions. However, it is conventional for operating systems running on the PowerPC—including Mac OS X—to designate (per the ABI) an area of memory as the stack and grow it upward: from a high memory address to a low memory address. GPR1, which is used as the stack pointer, points to the top of the stack.
Both the stack and the registers play important roles in the working of subroutines. As listed in Table 3–12, registers are used to hold subroutine arguments, up to a certain number.
If a function f1 calls another function f2, which calls yet another function f3, and so on in a program, the program's stack grows per the ABI's conventions. Each function in the call chain owns part of the stack. A representative runtime stack for the 32-bit Darwin ABI is shown in Figure 3–21.
Figure 3–21 Darwin 32-bit ABI runtime stack
In Figure 3–21, f1 calls f2, which calls f3. f1's stack frame contains a parameter area and a linkage area.
The parameter area must be large enough to hold the largest parameter list of all functions that f1 calls. f1 typically will pass arguments in registers as long as there are registers available. Once registers are exhausted, f1 will place arguments in its parameter area, from where f2 will pick them up. However, f1 must reserve space for all arguments of f2 in any case—even if it is able to pass all arguments in registers. f2 is free to use f1's parameter area for storing arguments if it wants to free up the corresponding registers for other use. Thus, in a subroutine call, the caller sets up a parameter area in its own stack portion, and the callee can access the caller's parameter area for loading or storing arguments.
The linkage area begins after the parameter area and is at the top of the stack—adjacent to the stack pointer. The adjacency to the stack pointer is important: The linkage area has a fixed size, and therefore the callee can find the caller's parameter area deterministically. The callee can save the CR and the LR in the caller's linkage area if it needs to. The stack pointer is always saved by the caller as a back chain to its caller.
In Figure 3–21, f2's portion of the stack shows space for saving nonvolatile registers that f2 changes. These must be restored by f2 before it returns to its caller.
Space for each function's local variables is reserved by growing the stack appropriately. This space lies below the parameter area and above the saved registers.
The fact that a called function is responsible for allocating its own stack frame does not mean the programmer has to write code to do so. When you compile a function, the compiler inserts code fragments called the prologue and the epilogue before and after the function body, respectively. The prologue sets up the stack frame for the function. The epilogue undoes the prologue's work, restoring any saved registers (including CR and LR), incrementing the stack pointer to its previous value (that the prologue saved in its linkage area), and finally returning to the caller.
Consider the trivial function shown in Figure 3–22, along with the corresponding annotated assembly code.
Example 3–22. Assembly listing for a C function with no arguments and an empty body
$ cat function.c void function(void) { } $ gcc -S function.c $ cat function.s ... _function: stmw r30,-8(r1) ; Prologue: save r30 and r31 stwu r1,-48(r1) ; Prologue: grow the stack 48 bytes mr r30,r1 ; Prologue: copy stack pointer to r30 lwz r1,0(r1) ; Epilogue: pop the stack (restore frame) lmw r30,-8(r1) ; Epilogue: restore r30 and r31 blr ; Epilogue: return to caller (through LR)
3.4.3.1 Stack Usage Examples
Figures 3–23 and 3–24 show examples of how the compiler sets up a function's stack depending on the number of local variables a function has, the number of parameters it has, the number of arguments it passes to a function it calls, and so on.
Figure 3–23 Examples of stack usage in functions
Figure 3–24 Examples of stack usage in functions (continued from Figure 3–23)
f1 is identical to the "null" function that we encountered in Figure 3–22, where we saw that the compiler reserves 48 bytes for the function's stack. The portions shown as shaded in the stacks are present either for alignment padding or for some current or future purpose not necessarily exposed through the ABI. Note that GPR30 and GPR31 are always saved, GPR30 being the designated frame pointer.
f2 uses a single 32-bit local variable. Its stack is 64 bytes.
f3 calls a function that takes no arguments. Nevertheless, this introduces a parameter area on f3's stack. A parameter area is at least eight words (32 bytes) in size. f3's stack is 80 bytes.
f4 takes eight arguments, has no local variables, and calls no functions. Its stack area is the same size as that of the null function because space for its arguments is reserved in the parameter area of its caller.
f5 takes no arguments, has eight word-size local variables, and calls no functions. Its stack is 64 bytes.
3.4.3.2 Printing Stack Frames
GCC provides built-in functions that may be used by a function to retrieve information about its callers. The current function's return address can be retrieved by calling the __builtin_return_address() function, which takes a single argument—the level, an integer specifying the number of stack frames to walk. A level of 0 results in the return address of the current function. Similarly, the __builtin_frame_address() function may be used to retrieve the frame address of a function in the call stack. Both functions return a NULL pointer when the top of the stack has been reached. [53] Figure 3–25 shows a program that uses these functions to display a stack trace. The program also uses the dladdr() function in the dyld API to find the various function addresses corresponding to return addresses in the call stack.
Example 3–25. Printing a function call stack trace [54]
// stacktrace.c #include <stdio.h> #include <dlfcn.h> void printframeinfo(unsigned int level, void *fp, void *ra) { int ret; Dl_info info; // Find the image containing the given address ret = dladdr(ra, &info); printf("#%u %s%s in %s, fp = %p, pc = %p\n", level, (ret) ? info.dli_sname : "?", // symbol name (ret) ? "()" : "", // show as a function (ret) ? info.dli_fname : "?", fp, ra); // shared object name } void stacktrace() { unsigned int level = 0; void *saved_ra = __builtin_return_address(0); void **fp = (void **)__builtin_frame_address(0); void *saved_fp = __builtin_frame_address(1); printframeinfo(level, saved_fp, saved_ra); level++; fp = saved_fp; while (fp) { saved_fp = *fp; fp = saved_fp; if (*fp == NULL) break; saved_ra = *(fp + 2); printframeinfo(level, saved_fp, saved_ra); level++; } } void f4() { stacktrace(); } void f3() { f4(); } void f2() { f3(); } void f1() { f2(); } int main() { f1(); return 0; } $ gcc -Wall -o stacktrace stacktrace.c $ ./stacktrace #0 f4() in /private/tmp/./stacktrace, fp = 0xbffff850, pc = 0x2a3c #1 f3() in /private/tmp/./stacktrace, fp = 0xbffff8a0, pc = 0x2a68 #2 f2() in /private/tmp/./stacktrace, fp = 0xbffff8f0, pc = 0x2a94 #3 f1() in /private/tmp/./stacktrace, fp = 0xbffff940, pc = 0x2ac0 #4 main() in /private/tmp/./stacktrace, fp = 0xbffff990, pc = 0x2aec #5 tart() in /private/tmp/./stacktrace, fp = 0xbffff9e0, pc = 0x20c8 #6 tart() in /private/tmp/./stacktrace, fp = 0xbffffa40, pc = 0x1f6c
3.4.4 Function Parameters and Return Values
We saw earlier that when a function calls another with arguments, the parameter area in the caller's stack frame is large enough to hold all parameters passed to the called function, regardless of the number of parameters actually passed in registers. Doing so has benefits such as the following.
- The called function might want to call further functions that take arguments or might want to use registers containing its arguments for other purposes. Having a dedicated parameter area allows the callee to store an argument from a register to the argument's "home location" on the stack, thus freeing up a register.
- It may be useful to have all arguments in the parameter area for debugging purposes.
- If a function has a variable-length parameter list, it will typically access its arguments from memory.
3.4.4.1 Passing Parameters
Parameter-passing rules may depend on the type of programming language used—for example, procedural or object-oriented. Let us look at parameter-passing rules for C and C-like languages. Even for such languages, the rules further depend on whether a function has a fixed-length or a variable-length parameter list. The rules for fixed-length parameter lists are as follows.
- The first eight parameter words (i.e., the first 32 bytes, not necessarily the first eight arguments) are passed in GPR3 through GPR10, unless a floating-point parameter appears.
- Floating-point parameters are passed in FPR1 through FPR13.
- If a floating-point parameter appears, but GPRs are still available, then the parameter is placed in an FPR, as expected. However, the next available GPRs that together sum up to the floating-point parameter's size are skipped and not considered for allocation. Therefore, a single-precision floating-point parameter (4 bytes) causes the next available GPR (4 bytes) to be skipped. A double-precision floating-point parameter (8 bytes) causes the next two available GPRs (8 bytes total) to be skipped.
- If not all parameters can fit within the available registers in accordance with the skipping rules, the caller passes the excess parameters by storing them in the parameter area of its stack frame.
- Vector parameters are passed in VR2 through VR13.
- Unlike floating-point parameters, vector parameters do not cause GPRs—or FPRs, for that matter—to be skipped.
- Unless there are more vector parameters than can fit in available vector registers, no space is allocated for vector parameters in the caller's stack frame. Only when the registers are exhausted does the caller reserve any vector parameter space.
Let us look at the case of functions with variable-length parameter lists. Note that a function may have some number of required parameters preceding a variable number of parameters.
- Parameters in the variable portion of the parameter list are passed in both GPRs and FPRs. Consequently, floating-point parameters are always shadowed in GPRs instead of causing GPRs to be skipped.
- If there are vector parameters in the fixed portion of the parameter list, 16-byte-aligned space is reserved for such parameters in the caller's parameter area, even if there are available vector registers.
- If there are vector parameters in the variable portion of the parameter list, such parameters are also shadowed in GPRs.
- The called routine accesses arguments from the fixed portion of the parameter list similarly to the fixed-length parameter list case.
- The called routine accesses arguments from the variable portion of the parameter list by copying GPRs to the callee's parameter area and accessing values from there.
3.4.4.2 Returning Values
Functions return values according to the following rules.
- Values less than one word (32 bits) in size are returned in the least significant byte(s) of GPR3, with the remaining byte(s) being undefined.
- Values exactly one word in size are returned in GPR3.
- 64-bit fixed-point values are returned in GPR3 (the 4 low-order bytes) and GPR4 (the 4 high-order bytes).
- Structures up to a word in size are returned in GPR3.
- Single-precision floating-point values are returned in FPR1.
- Double-precision floating-point values are returned in FPR1.
- A 16-byte long double value is returned in FPR1 (the 8 low-order bytes) and FPR2 (the 8 high-order bytes).
- A composite value (such as an array, a structure, or a union) that is more than one word in size is returned via an implicit pointer that the caller must pass. Such functions require the caller to pass a pointer to a memory location that is large enough to hold the return value. The pointer is passed as an "invisible" argument in GPR3. Actual user-visible arguments, if any, are passed in GPR4 onward.