Automatic Reference Counting in Objective-C, Part 2: The Details
In Part I, we looked at the basic ideas behind ARC and how it works in pure Objective-C code. In this article, we'll go a bit deeper and see how it works with the C memory model, with Objective-C++, and what the compiler is doing to make it fast.
How It Works
When you compile code using ARC, the compiler inserts calls to functions like objc_retain() and objc_release(). These are inserted, rather than -retain and -release message sends, for several reasons. They are smaller and faster than message sends, easier for the optimizer to recognize, faster, and easier to inline.
The optimizer is very important. The implementation of ARC is split into three components. The compiler front end generates calls to these functions. The optimizer removes as many as it can, and the runtime library implements the required functions.
The fact that the optimization is all done on the LLVM intermediate representation, rather than in the front end, is very helpful because it means that it's useable by other compilers. I've written a Smalltalk compiler that generates code that is binary-compatible with LLVM. I can now make it use the ARC functions, rather than its own retain/release code, and it will use the same LLVM optimizations.
The optimizer uses local balancing: It assumes that every function is safe with respect to maintaining valid pointers, and will attempt to remove redundant retain and release operations on this assumption. The optimizer is still a bit of a work-in-progress, but it already works quite well.
Some of its transforms are quite simple. As well as the primitive runtime calls, the ABI also defines some combined versions, such as objc_retainAutorelease(), which is a combination of objc_retain() followed by objc_autorelease(), and objc_storeWeak(), which releases the old value and retains the new one. The optimizer will try to use the combined versions where possible. In the worst case they'll have similar performance, but it's also possible to implement them more efficiently in the runtime. Even if they're slightly slower, you'd still save from only having one call and from the smaller code required for one call than two.
The next set of optimizations is more interesting. It tries to remove redundant calls. For example, consider something like:
objc_retain(x); [x someMessage]; objc_release(x); objc_retain(x); [x someOtherMessage]; objc_release(x);
This seems pretty stupid, but the front end can emit it because part of the design of ARC is to avoid the need for the front end to do complex flow analysis. The optimizer can simplify it to:
objc_retain(x); [x someMessage]; [x someOtherMessage]; objc_release(x);
This is even more important with things like loops. If there's a retain-release pair on a loop invariant, then they can be hoisted to either side of the loop. If you look at the source code for the optimizer, you'll see a lot of planned optimizations that aren't finished yet. In particular, it's possible to transform autorelease into release in a lot of cases.
Earlier, I talked about using a __autoreleasing id for a return value to catch an autoreleased return value. This reduces the number of retain/release calls, so you have something like:
id a = foo(); // a is autoreleased, so will be deallcoated later
Instead of:
id a = objc_retainAutoreleasedReturnValue(foo()); ... objc_release(a)
This may seem better, but often isn't. The autoreleased object may not be collected for some time. It will take up some space in the autorelease pool and will be wasting RAM. The optimizer will (in the future) try to convert the first form into the latter. This increases the code size slightly, but means that short-lived objects are really short-lived objects. They won't sit around using RAM until the end of the run loop.
The objc_retainAutoreleasedReturnValue() call is particularly clever. If the value has just been autoreleased and returned, then rather than retaining it, this un-autoreleases it. Conceptually, it removes the object from the autorelease pool, but in fact it removes it from some thread-local storage, where it was inserted instead of being put in the autorelease pool. This allows functions to return non-owning references without the need to fill up the autorelease pool and artificially extend the lives of temporary objects.
A more interesting form of this optimization is also planned. Autorelease pools are now created and destroyed in a way that it's easy for the optimizer to modify. For example, if you do something like:
objc_autoreleasePoolPush(); ... objc_autorelease(obj); objc_autoreleasePoolPop();
The optimizer can see where obj will be deallocated and turn the autorelease into a release. This looks like a weird thing to do, but the ARC optimizers can run after inlining, and it's entirely possible that a function that returns an autoreleased value will be inlined into a function that creates an autorelease pool.
More interestingly, the optimizer can insert the autoreleasepool push and pop functions itself, if it determines that a lot of short-lived objects are going to be created. It may do this around loops, for example, but it's most likely that this will run as a profile-driven optimization at some point in the future. This is possible because the optimizer can now easily extend the life of temporary objects by calling objc_retain() on them before objc_autoreleasePoolPop() and then objc_autorelease() after, for the ones that need to persist beyond where it's inserted an autoreleasepool scope.