A Light Introduction to ARM Assembly
- Calling Assembly Code
- Stack Management for ARM
- Conditional Execution / Assembly or C?
These days, most developers can get away without writing any assembly code. There are only really two cases where it's ever applicable:
- Where you have some performance-critical code and your C compiler doesn't do a good enough job.
- Where your high(er)-level language isn't sufficiently expressive.
The first case is increasingly rare. Compilers keep getting better, and CPUs keep getting faster. WordPerfect for DOS was originally written entirely in 8086 assembly to achieve the speed it required, and it didn't even do WYSIWYG editing. Today, Google Docs manages to do more, yet is written in JavaScript. No one cares that it doesn't run on a 4.77MHz 16-bit processor with 128KB of RAM.
The latter is slightly more common. Some things are simply impossible to do in C. A simple example is a trampoline that calls another function with the same arguments. This is not possible to implement in C in the general case.
I recently had to implement a trampoline like this for the GNUstep Objective-C runtime's version of imp_implementationForBlock(). To understand how this works, you need to understand how Objective-C methods and blocks are implemented. There's a full explanation in my Objective-C Phrasebook, but I'll try to give a summary here:
An Objective-C method has two hidden parameters, named self and _cmd. The first is a pointer to the receiver; the second is the selector that was used to look up the method. There's a more detailed explanation of why in my book, if you're interested in how Objective-C really works. A block has one hidden argument (the block structure).
The imp_implementationForBlock() function is used to turn the latter into the former. It must, at runtime, create a new function that, when called, will move the receiver from the first argument to the second and load the block into the first.
In C, it would be something vaguely like this:
id block_trampoline(id self, SEL _cmd) { block->invoke(block, self); }
Of course, this is a very simplified version that ignores where the block comes from (in the real version, it's loaded from the word before the start of the function) and ignores what happens if the block takes any arguments other than self. This is only three ARM instructions, but it's impossible to implement in the general case in C. The assembly version just moves the object pointer from register 1 to register 0, loads the block pointer into register 1, and then loads the block's invoke field into the program counter register (which achieves a jump).
This particular example is interesting because the same trampoline can be copied repeatedly. To simplify things, I stored the block and its invoke pointer immediately before the trampoline. This is one fairly common use of assembly: generating a snippet of code that you can then duplicate at runtime. Every time you call imp_implementationForBlock(), the runtime will copy this little trampoline into a new bit of memory and return a function for you to use. This is similar to how GCC implements nested functions, but can be done entirely at runtime.
Generally, the rule should always be that if you can do something without writing assembly, and it's fast enough, then you should. For example, a lot of older projects used assembly code for vector or atomic operations. These are now typically provided as compiler intrinsics and can be a lot more portable. For example, an atomic add operation is different on every architecture, but a compiler intrinsic can be the same.
Calling Assembly Code
There are several ways of inserting assembly code into a project. The simplest is to create a separate compilation unit for the assembly code. You can then call functions written in it as if they were C functions.
This requires you to understand the calling conventions of the architecture. Fortunately, for ARM these are quite simple. ARM has 16 registers and uses the first four to pass integer (and pointer) arguments and the first two to return integer results. Arguments after these are passed on the stack.
Passing floating-point arguments is a lot more complex because it depends on the ABI. In hard-float mode, they are passed in floating-point registers. This is problematic if the CPU doesn't have a floating-point unit (FPU) because it means that every instruction copying the value to or from the register causes a trap and must be emulated in the kernel. In soft-float mode, they are passed in integer registers. This is problematic if you do have an FPU, because every move between the FPU and integer units causes a pipeline stall (and typically a 15-cycle penalty). One simple hack to work around this is to pass pointers to floating-point values that are stored on the stack. This is actually used in quite a lot of C code on ARM.
There are two ways of calling into assembly code from C. The simplest is to make the compiler do all of the complex work for you. You can do this using inline assembly, like this:
int atomic_add_or_fail(int *addr, int a) { int ret; asm ( " ldrex r0, [%1] \n" " add r0, r0, %2 \n" " strex %0, r0, [%1] \n" : "=r" (ret) : "r" (addr), "r" (a) : "r0"); return !ret; }
This is a simple function that attempts to atomically add the value passed in the second argument to the value in the address passed as the first. It returns 0 if the operation failed and 1 if it succeeded.
This uses two instructions that are relatively recent additions to the ARM architecture: load and store exclusive, with the ldrex and strex mnemonics, respectively. The ldrex instruction loads the value into a register and sets the address in the exclusive monitor. The corresponding store exclusive will fail if the memory address has been written to since the load.
The instruction in the middle is an add instruction, similar to the one that you will find on any RISC architecture. This, like all ARM arithmetic operations, takes three operands. The first is the destination and the next two are the operands.
Typically you would simply retry this if it failed, but you may want some version with bounded running timefor example, trying this 100 times and giving up if it can't successfully update the memory address.
This uses GNU-syntax inline assembly. The first three lines in the asm block contain assembly code. The remainder tells the compiler how to insert it into the code. The first of these is the list of output registers; here "=r" means register used for output and the (ret) means that the ret variable should be used. The next line contains the input registers. The final line contains the list of registers that we clobbered (i.e., ones whose state is not preserved across this assembly snippet).
The other thing to notice about this example is that I'm using %0 and so on in place of some register arguments. This example only uses one explicit register: register 0. The other registers will be allocated by the compiler. The advantage of this approach is that the compiler's register allocator can pick ones to use. Remember that, although this looks like a stand-alone function, it may be inlined subsequently, and then some of the registers may be in use.