1.3 Program Analysis
An attack against a program typically will go through two stages: an analysis stage that gathers information about the program, and a transformation stage that makes modifications to the program based on the information that was collected. There are two basic ways of analyzing the program: You can just look at the code itself (this is called static analysis), or you can collect information by looking at the execution of the code (dynamic analysis).
Static analyses takes only one input, the program P itself:
There are a huge number of different kinds of static analysis that have been developed over the years. The chief designers have been software engineering researchers who want to analyze programs for defects and compiler researchers who want to analyze programs to optimize them, but there are also crackers who want to analyze programs to remove protection codes. Static analysis gathers information that we call conservative, that is, it may be imprecise but it will always err on the conservative side. So for example, if a static analysis tells you that “on line 45, variable x is always 42,” you can be sure that this is the case. Sometimes conservative analyses will fail to gather a piece of information about the code that is in fact true, but at the very least, it will never lie and say that something is true when it isn’t.
Dynamic analyses collect information about a program by executing it on a sample input data set:
The accuracy of the generated information depends on the completeness of the input data. Because of this, dynamic analysis can only make predictions such as, “On line 45, variable x is always 42, well, OK, at least for the set of inputs I’ve tried.” Code transformations that make use of information only from static analyses are safe in that they won’t turn a working program into a buggy one (assuming, of course, that the transformation itself is semantics-preserving, i.e., it doesn’t change the meaning of the program). Transformations that use dynamic analysis results, on the other hand, will typically not be safe: They can fail if they are based on information gathered from an insufficient input data set.
Depending on what an attacker is trying to accomplish, he will choose different types of analyses and transformations. If all he wants to do is to disable a license check, the simplest of static and dynamic analyses may be all he needs: He can just run the program under a debugger until the “license expired” alert comes up, find the approximate location in the code, disassemble the code at that location, read until he finds something that looks like if today's date > license date then ..., fire up a binary editor on the code, and edit out the offending lines. If, on the other hand, he wants to extract a complex algorithm from a huge program, being able to decompile it all the way to source code would be very helpful.
1.3.1 A Simple Reverse Engineering Example
To make this a little more concrete, let’s look at an example. Assume that your boss has given you the following string of bytes and asked you to reverse engineer it:
06 3b 03 3c 1b 07 a2 00 15 1a 06 a3 00 0b 06 3b 84 01 01 a7 ff f1 06 3b a7 ff ec b1
He tells you that these bytes correspond to the bytecode for a competitor’s Java program that contains a very secret and important algorithm. The code actually corresponds to the following program, but of course as a reverse engineer you don’t know this yet:
Since your goal is to find and understand the super-duper-secret algorithm, it would be great if you could turn this bytecode mess into some really clean and easy-to-read Java source code. If the owner of the code inserted some obfuscations or other trickery to confuse us, then it would be great if we were able to remove that too, of course.
As your first step, you’ll want to disassemble the bytes. This means to convert the raw bytes into a symbolic, assembly code form. For Java bytecode this is essentially trivial, and there are many tools (such as jasmin or javap) that will do the job:
We’ve shaded the source code and instructions to make it easy for you to identify which part of the source code corresponds to which bytecode instructions. We’ve put the codebytes themselves in brackets.
Java bytecode was designed to make disassembly easy. The bytecode contains enough information to allow for the recovery of types and control flow. This is not true of other machine codes, such as those for x86 and other processors. For these binary codes, it is easy to insert code that will confuse disassembly. We will talk more about this in Chapter 3.
Now that you have the Java bytecode in an assembly code format, your next step is to perform control flow analysis, which will recover the order in which the code can be executed. The result of this analysis is a control flow graph (CFG). A node of this graph consists of straight-line code, except that the last statement can be a jump. There is an edge from one node to another if it is possible for us take this path through the code during execution:
The nodes in the CFG are called basic blocks. CFGs are the central data structure around which many compilers and reverse engineering tools are built. We’ve again used shading to make it easy to see which basic blocks correspond to which bytecode instruction sequences.
Next, you will want to perform a variety of analysis tasks to gather as much information as you can about the code. This information may allow you to perform transformations that will make the code simpler and easier to understand, or even to remove some of the obfuscations that may have been added. One family of analyses common in optimizing compilers and reverse engineering tools is called data flow analysis. You’ll learn more about this in Section 3.1.2▸127. In our example, an analysis called Constant Propagation can be used to track the value of variable x in order to see at which points in the code it will always have the same (constant) value:
In the leftmost flow graph, we’ve indicated that at first we know nothing about the value of x at the entry and exit of each basic block. In the second step, we’ve considered each basic block once, and have gathered some information. After a basic block has executed the statement x = 3, for example, it is clear that x must have the value 3. Also, if a basic block doesn’t change the value of x, then x must have the same value after the block executes as it did before. When control can flow into a basic block from two different directions and we’ve computed that x has the same value on both paths, then we can safely assume that it will always have that value at the entry to the block. After considering all the basic blocks one more time, you’re left with the annotated control flow graph to the left:
Given this CFG, you can now start to perform transformations. First, wherever x is used and you’ve determined that its value is constant, you can go ahead and replace x with the computed value. So, for example, x<=3 can be replaced by true, since x=3 at the entrance to this basic block. Given this transformation, you can now perform a Dead Code Elimination, getting rid of any basic block that can never be executed, and you can also get rid of any redundant statements. The result is the last CFG discussed earlier.
Now, eyeballing this code, it’s clear that further transformations are possible. In fact, the loop reduces to a single statement, i=4, and since the procedure returns nothing and has no side effects, it can be removed altogether! In this particular case, this is easy to see, but in general, the result will depend on the power of the analyses and the complexity of the programs.
You can’t be sure whether the “extra” code that you’ve been able to eliminate was inserted by a code obfuscator with the purpose of sowing confusion or was just the result of a really broken compiler. All you know is that you’ve been able to turn the raw sequence of bytes your boss gave you into a much simpler structured form and to get rid of some irrelevant code in the process.
The final step of your reverse engineering effort should be to take the transformed control flow graph and turn it into source code. The graph is so simple that this decompilation step is trivial. You get:
public static void P() { int i = 0; while (i<4) i++; }
Obviously, other source forms are possible, for example, using a for loop.
Now, turning raw bytes into readable source code is very helpful for a reverse engineer, but the process typically won’t stop when source code is generated. The final stage, extracting a deep understanding of the algorithms employed or modifying the code to bypass license checks, and so on, will often have to be done by hand.
In Chapter 2 (Methods of Attack and Defense), we’ll show you the general strategies that reverse engineers go through when they attack code in order to reveal its secrets or to make it perform tasks it wasn’t designed to do. In Chapter 3 (Program Analysis), we’ll delve into even more detail and discuss the many kinds of tools and analysis techniques available to your adversary. We will not simply present the off-the-shelf tools that happen to be available now, but will discuss those that could be built given currently known techniques. It’s much too easy to say, “Our software protection algorithm is secure because we’ve implemented it for dynamically linked x86 executables, and current decompilers only handle statically linked code.” A much more interesting statement would be, “This algorithm employs protection techniques that can cause code explosion in any decompiler, current and future.”