Competition Among Open Source Compilers

Date: Feb 1, 2008

Article is provided courtesy of Prentice Hall Professional.

David Chisnall takes a look at two of the recent competitors to the venerable GNU compiler collection and tries to see where they will fit in an evolving Free Software landscape.

Since its creation, the C language has been tightly tied to UNIX. C was designed as a portable assembly language for reimplementing UNIX, to make it easier to port to different platforms.

In 1984, Richard Stallman began the GNU (GNU’s Not UNIX) Project to provide a clone of the UNIX operating system using entirely Free Software. Because a C compiler is a core component of any UNIX-like operating system, he wrote one: the GNU C Compiler (GCC).

Over the years, GCC was rewritten a few times, and support for various languages was added. When it began to support more languages than just C, the name was changed to the GNU Compiler Collection, keeping the same GCC abbreviation. As with other parts of the GNU Project, GCC uses the GNU General Public License, although it has a special exception for any parts of the compiler that are embedded in the compiled output.

The BSD Issue

Although the GNU Project has GCC, and most proprietary UNIX systems have their own compilers, the BSD projects typically have none. In the base system of any BSD system, GCC is the largest piece of GPL’d code. Over the years, it periodically looked as though TenDRA might replace it. TenDRA, a BSD-licensed C compiler originally started by the Defence Evaluation and Research Agency (the institution that used to be the UK’s equivalent of DARPA) focuses on correctness, and would be a good match for systems like OpenBSD, but progress has been slow.

Recently, another option appeared from an unexpected direction. Back in the 1970s, Stephen Johnson of Bell Labs wrote the Portable C Compiler (PCC). Unlike many earlier compilers, it had a clean separation between the parser and code-generation stages, allowing it to be ported to new architectures easily—a feature present in most newer compilers. This compiler was included with a lot of UNIX variants, including 4.3BSD-Reno.

PCC never underwent the same degree of growth as GCC, and remains a very small project. The source code is under 1 megabyte (compressed), and compiling it on a 1 GHz machine takes only a few seconds, whereas compiling GCC on the same machine takes most of an afternoon. PCC isn’t as strong in terms of optimization as GCC, but its small size makes it much easier to verify that the output is correct.

In September 2007, PCC was imported into the OpenBSD source tree, with the aim of evaluating it as a GCC replacement for future releases. A simple compiler is very attractive to the BSD communities—particularly one that’s portable. GCC has a habit of changing the interface to the back end, orphaning architectures. This behavior has resulted in some ports of NetBSD, for example, having to stick with old versions of GCC, since no one with the expertise to maintain the compiler port is willing to do so.

It’s difficult to overstate the relative complexity of GCC versus PCC. The codebase for GCC is almost 100 times the size of the PCC codebase. The OpenBSD team hopes that this difference in scale will make getting involved with PCC development a lot less daunting than getting involved with GCC.

PCC began life on the VAX, and didn’t support x86 until very recently. The port took one person less than a week, which makes supporting other architectures seem quite plausible; the total amount of x86-specific code comes to less than 4,000 lines.

An Advanced Architecture

The GCC design has gradually evolved since the project’s creation, and adopted some more modern design principles. Static single assignment (SSA) was one of the more recent large changes, back in 2005. The principle of SSA is that each variable should hold only one value ever. This rule is enforced in the language in something like Erlang, or in the intermediate form in a C compiler. Consider some C code of the following form:

a = b;
a += c;

This code doesn’t conform to the SSA principle, so the compiler would replace it with something like this:

a1 = b;
a2 = c;

All future uses of a would be replaced with references to a2, until the next assignment to a. This form allows a number of optimizations to be accomplished quite easily. Describing it as "new" is somewhat misleading, however; the original paper proposing it was published in 1985. Most modern compilers, including the newer versions of PCC, use some kind of SSA form.

While GCC has accreted design components, another compiler was written from scratch based on the latest ideas in compiler research. The Low Level Virtual Machine (LLVM) is designed to focus heavily on optimization. Like Java and .NET, it uses a virtual machine to define an intermediate form; however (as the name suggests), this machine is quite low-level, and not tied to any particular language.

Originally, LLVM used code taken from GCC to handle parsing. This approach changed slightly with release 2.1 in September 2007. A new, Apple-developed front end was introduced, with support for C, C++, and Objective-C, under the name "clang.’’

Part of the motivation for developing clang came from a criticism often leveled at GCC: that it’s difficult to separate out the front-end and back-end code. When you edit code in something like Microsoft’s Visual Studio, you can use the same code for parsing the code to generate syntax-highlighting information as you use for code generation. The same is true of most LISP and Smalltalk environments. This isn’t the case with something like Apple’s Xcode, however, which has to implement its own parser for syntax highlighting and code completion. This setup is less than ideal for developers, because parsing errors in the IDE don’t necessarily correspond to code that won’t compile, and vice versa.

Extracting the front end from GCC for this purpose is difficult for two reasons. First, GCC is released under the GNU General Public License, which means that any other program built using it is required to be under the same license. Second, GCC intentionally ties the front and back ends into the rest of the code quite closely, to avoid "semi-proprietary" forks. In contrast, LLVM is BSD-licensed and has comparatively clean separation between the various layers.

A Low-Level Virtual Machine?

The concept of something that is both "low-level" and a "virtual machine" (VM) is slightly strange. Most virtual machines are either very abstract or consist of only very thin abstraction layers. Perhaps the lowest-level virtual machine is Xen, which has the same non-privileged instructions as x86, but different privileged ones.

At the opposite extreme is something like the Java virtual machine. Java bytecode runs on a virtual-stack-based machine (all of the real stack-based machines except x87 having died out a long time ago). This is very different from any real machine, although it has some passing similarities to old LISP machines. The Java VM makes a number of assumptions about the code running on it, making it difficult to run anything that’s not a strongly typed language with a Java-esque system.

The abstract machine presented by LLVM is somewhere between the two. It’s close to an unlimited register machine (URM), which is a fairly common theoretical model for computing. A few things mark it as being different from a real computer:

Each register is typed. In most computers, registers are typed to the extent that they are integer or floating-point quantities, but the LLVM type system goes far beyond this level.
Each register is single-assignment. As discussed earlier, this form makes it much easier to perform a number of classes of optimization.

Compiling with LLVM is a two-stage process:

The first step is to compile a program from source code to bytecode for the abstract machine. For some languages, this process may be two steps, going via a language-specific bytecode. After this, there are a few options.
The LLVM bytecode can be interpreted directly. This is nice for debugging, since the entire state of the machine can be inspected, and none of the type information has yet been lost.
The bytecode also can be compiled to native code, either as a single-step operation or as part of a profiling-directed optimization pass. This arrangement allows the optimizer to examine the runtime behavior of the code and compile it on the fly. Alternatively, of course, it can be distributed in bytecode form, which is hardware-agnostic, and then compiled at install time, or JIT-compiled as it runs with the additional profiling information.

The End of GCC?

Before predicting the end of GCC, it’s important to understand the role it fills. One big reason for using GCC is that it’s portable. You can use the same compiler on a wide collection of operating systems and architectures. GCC isn’t bad at optimization, either, although less-portable compilers tend to do better on the specific platforms they target. Having to support only a single compiler can reduce development costs, however.

The main reason that GCC exists is to be Free Software. The GNU Project requires a compiler that’s capable of building the GNU system. Both of the compilers I’ve discussed in this article are Free Software, but they’re available under the permissive terms of BSD-style licenses. This fact makes them less attractive to the Free Software Foundation, because these licenses permit making proprietary forks.

While the Free Software Foundation is nominally in charge of GCC, a lot of development is funded by the likes of Apple and Red Hat. Making a closed-source fork of the compiler is not in the best interest of either of these companies, since they don’t sell compilers—they give them away, to encourage people to write software for their platforms. (Red Hat also gives the rest of the platform away, to encourage people to pay for Red Hat support.) Apple, on the other hand, would very much like to be able to use components of the compiler combined with other, proprietary components. This desire is what attracted Apple to LLVM as a compiler for OpenGL shaders (code which is now making its way into the open source OpenGL clone Mesa), and led do development of the clang front end for LLVM.

Another strength of GCC is the languages it supports. With GCC, you can compile C, C++, Objective-C, Fortran, Java, and Ada. It also has some support for Pascal, Mercury, COBOL, Modula-2, Modula-3 VHDL, PL/1, and UPC. In contrast, PCC supports only C, with experimental support for Fortran. LLVM supports C and C++, with experimental support for Objective-C and Stacker (a stack-based language).

From my perspective, one of the most interesting things about GCC at the moment is that it’s the only Objective-C compiler that works with either Apple’s Cocoa or GNUstep. This situation is likely to change soon, as the clang front end becomes more mature, which is likely to benefit all Objective-C developers as a result of competition between the projects.

Not having to rely on GCC is likely to benefit the BSD family quite a lot. GCC is designed fairly closely with GNU libc in mind, so the project is hesitant to accept improvements that don’t work with glibc (and, by extension, with most Linux systems).

Competition is usually a good thing; in open source projects, where ideas (and, ideally, code) can flow freely between the projects, everybody wins in the end. The more advanced architecture of LLVM makes it a good long-term bet for future systems, and the simple codebase of PCC makes it a good systems compiler. Both are likely to carve out a significant portion of market share in the next few years, giving developers more of choices for compiling their code.