Building Secure Software: Race Conditions
- What Is a Race Condition?
- Time-of-Check, Time-of-Use
- Secure File Access
- Temporary Files
- File Locking
- Other Race Conditions
- Conclusion
"O let not Time deceive you,
You cannot conquer Time.
In the burrows of the Nightmare
Where Justice naked is,
Time watches from the shadow
And coughs when you would kiss."
W. H. AUDEN
"As I Walked Out One Evening"
Race conditions are among the most common classes of bugs found in deployed software. They are only possible in environments in which there are multiple threads or processes occurring at once that may potentially interact (or some other form of asynchronous processing, such as with UNIX signals). People who have experience with multithread programming have almost certainly had to deal with race conditions, regardless of whether they know the term. Race conditions are a horrible problem because a program that seems to work fine may still harbor them. They are very hard todetect, especially if you're not looking for them. They are often difficult to fix, even when you are aware of their existence. Race conditions are one of the few places where a seemingly deterministic program can behave in a seriously nondeterministic way. In a world where multithreading, multiprocessing, and distributed computing are becoming more and more prevalent, race conditions will continue to become a bigger and bigger problem.
Most of the time, race conditions present robustness problems. However, there are plenty of times when race conditions have security implications. In this chapter we explore race conditions and their security ramifications. It turns out that file system accesses are subject to security-related race conditions far more often than people tend to suspect.
What Is a Race Condition?
Let's say that Alice and Bob work at the same company. Through e-mail, they decide to meet for lunch, agreeing to meet in the lobby at noon. However, they do not agree on whether they mean the lobby for their office or the building lobby several floors below. At 12:15, Alice is standing in the company lobby by the elevators, waiting for Bob. Then it occurs to her that Bob may be waiting for her in the building lobby, on the first floor. Her strategy for finding Bob is to take the elevators down to the first floor, and check to see if Bob is there.
If Bob is there, all is well. If he isn't, can Alice conclude that Bob is either late or has stood her up? No. Bob could have been sitting in the lobby, waiting for Alice. At some point, it could have occurred to him that Alice may be waiting upstairs, at which point he took an elevator up to check. If Alice and Bob were both on an elevator at the same time, unless it is the same elevator, they will pass each other during their ride.
When Bob and Alice each assume that the other one is in the other place and is staying put and both take the elevator, they have been bitten by a race condition. A race condition occurs when an assumption needs to hold true for a period of time, but actually may not. Whether it does is a matter of exact timing. In every race condition there is a window of vulnerability. That is, there is a period of time when violating the assumption leads to incorrect behavior. In the case of Alice and Bob, the window of vulnerability is approximately twice the length of an elevator ride. Alice can step on the elevator up until the point where Bob's elevator is about to arrive and still miss him. Bob can step on to the elevator up until the point that Alice's elevator is about to arrive. We could imagine the door to Alice's elevator opening just as Bob's door shuts. When the assumption is broken, leading to unexpected behavior, then the race condition has been exploited.
When it comes to computer programs, windows of vulnerability can be large, but often they are small. For example, consider the following Java servlet:
import java.io.*; import java.servlet.*; import java.servlet.http.*; public class Counter extends HttpServlet{ int count = 0; public void doGet(HttpServletRequest in, HttpServletResponse out) throws ServletException, IOException { out.setContentType("text/plain"); Printwriter p = out.getWriter(); count++; p.println(count + " hits so far!"); } }
This tiny piece of code may look straightforward and correct to most people, but it has a race condition in it, because Java servlets are multithreaded. The programmer has implicitly assumed that the variable count is the same when printed as it is after the previous line of code sets its value. This isn't necessarily the case. Let's say that Alice and Bob both hit this servlet at nearly the same time. Alice is first; the variable count becomes 1. Bob causes count to be changed to 2, before println in Alice's thread runs. The result is that Alice and Bob both see 2, when Alice should have seen 1. In this example, the window of vulnerability isn't very large. It is, at most, a fraction of a second.
Even if we move the increment of the counter into the expression in which we print, there is no guarantee that it solves our problem. That is, the following change isn't going to fix the problem:
p.println(++count + " hits so far!");
The reason is that the call to println takes time, as does the evaluation of the argument. The amount of time may seem really small, maybe a few dozen instructions. However, this isn't always the case. In a multithread system, threads usually run for a fraction of a second, then wait for a short time while other threads get the chance to run. It could be the case that a thread increments the counter, and then must wait to evaluate the argument and run println. While that thread waits, some other thread may also increment the counter.
It is true that the window of vulnerability is very small. In practice, this means the bug may show up infrequently, if ever. If our servlet isn't receiving several hits per second, then it is likely never to be a problem. This alludes to one of the reasons why race conditions can be so frustrating: When they manifest themselves, reproducing the problem can be almost impossible. Race conditions tend not to show up in highly controlled test environments. If you don't have any clue where to begin looking for a problem, you may never find it. The same sorts of issues hold true even when the window of opportunity is bigger.
In real-world examples, an attacker with control over machine resources can increase the odds of exploiting a race condition by slowing down the machine. Another factor is that race conditions with security implications generally only need to be exploited once. That is, an attacker can automate code that repeatedly tries to exploit the race condition, and just wait for it to succeed. If the odds are one in a million that the attacker will be able to exploit the race condition, then it may not take too long to do so with an automated tool.
In general, the way to fix a race condition is to reduce the window of vulnerability to zero by making sure that all assumptions hold for however long they need to hold. The main strategy for doing this is to make the relevant code atomic with respect to relevant data. By atomic, we mean that all the relevant code executes as if the operation is a single unit, when nothing can occur while the operation is executing. What's happening with race conditions is that a programmer assumes (usually implicitly) that certain operations happen atomically, when in reality they do not. When we must make that assumption, then we need to find a way to make the operation atomic. When we don't have to make the assumption, we can code the algorithm differently.
To make an operation atomic, we usually use locking primitives, especially in multithread applications. For example, one way to fix our Java servlet would be to use the object lock on the servlet by using the synchronized keyword. The synchronized keyword prevents multiple threads from running code in the same object that is governed by the synchronized keyword. For example, if we have ten synchronized methods in a Java class, only one thread can be running any of those methods at any given time. The JVM implementation is responsible for enforcing the semantics of the synchronized keyword.
Here's a fixed version of the counter servlet:
import java.io.*; import java.servlet.*; import java.servlet.http.*; public class Counter extends HttpServlet{ int count = 0; public synchronized void doGet(HttpServletRequest in, HttpServletResponse out) throws ServletException, IOException { out.setContentType("text/plain"); Printwriter p = out.getWriter(); count++; p.println(count + " hits so far!"); } }
The problem with this solution is that it can have a significant impact on efficiency. In this particular case, we have made it so that only one thread can run our servlet at a time, because doGet is the entry point. If the servlet is incredibly popular, or if the servlet takes a long time to run, this solution won't work very well. People will have to wait to get their chance inside the servlet, potentially for a long time. The solution is to keep the code we need to be atomic (often called a critical section) as small as possible [Silberschatz, 1999]. In Java, we can apply the synchronized keyword to blocks of code. For example, the following is a much better solution to our servlet problem:
import java.io.*; import java.servlet.*; import java.servlet.http.*; public class Counter extends HttpServlet { int count = 0; public void doGet(HttpServletRequest in, HttpServletResponse out) throws ServletException, IOException { int my_count; out.setContentType("text/plain"); Printwriter p = out.getWriter(); synchronized(this) { my_count = ++count; } p.println(my_count + " hits so far!"); } }
We could just put the call to println inside the synchronized block, and avoid the use of a temporary variable. However, println is a method call, which is somewhat expensive in and of itself. There's no need for it to be in the block, so we may as well remove it, to make our critical section finish as quickly as possible.
As we have seen, race conditions may be possible whenever two or more operations occur and one of the latter operations depends on the first. In the interval between events, an attacker may be able to force something to happen, changing the behavior of the system in ways not anticipated by the developer. Making this all work as an attacker requires a security-critical context, and explicit attention to timing and knowledge of the assumptions a developer may have made.
The term race condition implies a race going on between the attacker and the developer. In fact, the attacker must "race" to invalidate assumptions about the system that the programmer may have made in the interval between operations. A successful attack involves a quick-and-dirty change to the situation in a way that has not been anticipated.