Debugging C-Family Languages
Recently I've come across some interesting bugs in C-family programs. In this article, I'll share a few of these experiences. As with most debugging problems, the second time you see the issue it's obvious, so hopefully this will save at least one reader from spending a few days chasing down a problem.
The C family is generally regarded as being low-level, and therefore exposes a lot of detail about the implementation to the programmer. If you've never worked on a C compiler, a lot of these details might be slightly surprising to you.
Schrödinger's Variable
For some reason, I've come across this problem a lot recently. I hadn't paid particular attention to it, probably because it had gone away after doing a clean build. This particular instance was on someone else's system, however, and remote debugging meant that I wanted to fully understand the issue.
The code in this case was Objective-C, although you can see something similar in most C-like languages. A simplified version of the problem looks something like this:
static MapTable *table = NULL; void doInitialisation() { ... MapInsert(table, key, value); } ... @implementation AClass + (void) initialize { table = createMapTable(); ... doInitialisation(); } ... @end
In Objective-C, the +initialize method has special significance—it's automatically called (in a thread-safe way) before the first message is sent to a class. This setup allows lazy initialization of variables referenced in a compilation unit, and is one of the main reasons why Objective-C programs start a lot faster than their C++ counterparts.
The issue in this case was that the function was crashing because table was NULL. In the stack trace, it was clear that the function was called from the method. Looking further up the stack trace, you could see the table variable was a pointer to something in memory. Stepping back down the stack trace, it suddenly became NULL again.
What was the problem? It turned out to be the first word in the snippet I've shown here: static. What does static mean in a C program? To the programmer, it means that the variable is visible only in this file. To the compiler, it means almost the same thing. To the linker, however, it means something different. The linker (on UNIX-like systems, at least) has no notion of visibility. A variable that's static to the programmer is "renaming" to the linker. Rather than restricting access to it, the linker simply renames the variable if there's a conflict.
Linker renaming was exactly the problem in this case. Running ldd showed that the program was linked against two copies of the library containing this module. In most cases, this conflict would just cause one copy of the variable to be invisible. In this case, however, the renaming caused a problem due to the interference between two loaders.
In Objective-C, there are two ways of resolving a symbol to a function. The first option is via the C linker. At load time, the linker resolves all function names to pointers to functions. The second approach is via the Objective-C runtime library. The runtime library resolves method names to methods when they are caused. Due to implementation differences, one reference to the variable used the first mapping it came across, while the other used the second mapping. Again, normally this mismatch wouldn't be a problem. In this case, however, the duplication meant that the method was seeing the function (which wasn't static) in the other copy of the module. Since the variable was static, it was renamed when the second version of the module was loaded. This meant that the method was setting its local copy of the variable and then calling the function, which saw its own copy of the variable, which was still NULL.
So, what's the correct solution? There are several things wrong with this particular code. The first is that the function was not written defensively, and therefore was not checking that the variable was not NULL before doing something that would fail if it were NULL. A test could have initialized the variable. (Although in this case it would have actually made things worse—there would have been two valid but independent versions of the variable.) This omission is understandable, however, since the function was called from only one place, so the variable could never be NULL.
Or was it? The second thing that's wrong with this code is that the function, which was meant to be called only inside this module, wasn't static. Because it wasn't static, it could have been called from any other code in the program. Since it was in a library, it could have been called accidentally from anywhere if other programmers had picked the same name for one of their functions. (This is a huge problem with many current languages, and largely ignored. Modeling it is a subject for some of my current research.) If the function had been static, the method would have called its private copy, and the function would have seen the same copy of the variable as the method did.
Of course, the real bug is the loader, which should have thrown an error as soon as two modules containing different symbols with the same name in the same scope were loaded. This problem is well known, but no one seems keen to fix it.