1.4 Code Obfuscation
The first protection technique we’re going to look at is code obfuscation. What’s particularly interesting about obfuscation is that it’s a double-edged sword: Bad guys use it to protect their malware from discovery (you will see this in the next section), good guys use it to protect their programs from reverse engineering, and bad guys can also use it to destroy secrets (such as watermarks) stored in the good guys’ programs.
In the most general sense, to obfuscate a program means to transform it into a form that is more difficult for an adversary to understand or change than the original code. We are deliberately vague about defining “difficult,” but we typically take it to mean that the obfuscated program requires more human time, more money, or more computing power to analyze than the original program. Under this definition, to distribute a program in a compiled form rather than as source code is a form of obfuscation, since analyzing binary machine code is more demanding than reading source. Similarly, we would consider a program that has been optimized to be more obfuscated than one that has not, since many code optimizations make analysis (both by humans and tools such as disassemblers and decompilers) more onerous.
However, the tools and techniques we present in this book go further than compilation and optimization in order to make a program hard to understand. In contrast to an optimizer that rearranges code for the purposes of efficiency, a code obfuscator transforms code for the sole purpose of making it difficult to analyze. A negative by-product of obfuscating transformations is that the resulting code often becomes larger, slower, or both. The author of the code has to decide whether the protection that the transformations afford is worth this overhead.
Obfuscation is often confused with security through obscurity, a term (used contemptuously) for the “branch” of cryptography or security where the algorithms used are expected to remain secret. This is in contrast to mainstream research that teaches that you must assume that all algorithms are public, and the only secrets you may keep are the cryptographic keys, and so on, that are the inputs to the algorithms. The idea is that many eyes examining the same algorithm or piece of code will likely be able to find flaws, and the more eyes that have failed to find a flaw, the more confident you can be that the algorithm is, in fact, secure. This principle is frequently violated, and you’ll often see unscrupulous web-sites advertise “military-strength, proprietary” cryptographic algorithms, arguing that “since no one knows what our algorithm does, this will make it that much harder to break.” The same argument is sometimes made in reverse by software vendors like Microsoft: “Open-source software is inherently more vulnerable to attacks than our closed-source code since you can easily read the source and find bugs to exploit.” We know from experience that both claims are false. Hackers have no problem finding holes in closed-source software, and once a proprietary algorithm is leaked (which, inevitably, happens) it is often found to have serious and exploitable flaws.
As we define it in this book, obfuscation isn’t security through obscurity. As with research in cryptography, we generally expect that the obfuscating code transformation algorithms are known to the attacker and that the only thing the defender can assume is kept secret are the seeds that determine how and where these algorithms are applied.
Listing 1.1. Obfuscation example. The original unobfuscated version of the code can be found on the book’s Web site.
public class C { static Object get0(Object[] I) { Integer I7, I6, I4, I3; int t9, t8; I7=new Integer(9); for (;;) { if (((Integer)I[0]).intValue()%((Integer)I[1]).intValue()==0) {t9=1; t8=0;} else {t9=0; t8=0;} I4=new Integer(t8); I6=new Integer(t9); if ((I4.intValue()^I6.intValue())!=0) return new Integer(((Integer)I[1]).intValue()); else { if ((((I7.intValue()+ I7.intValue()*I7.intValue())%2!=0)?0:1)!=1) return new Integer(0); I3=new Integer(((Integer)I[0]).intValue()%((Integer)I[1]).intValue()); I[0]=new Integer(((Integer)I[1]).intValue()); I[1]=new Integer(I3.intValue()); } } }
Before we look at some applications of code obfuscation, let’s have a look at what obfuscated code might actually look like. Check out Listing 1.1 for a very simple example generated by the SandMark Java code obfuscator. Without peeking (the answer is in footnote 2), time yourself to see how long it takes you to analyze this 20-line program and figure out what it does. Now imagine that rather than being 20 lines long, it’s of a “normal” size for a program today: hundreds of thousands of lines to a few million lines. Then how long would it take you? What does your intuition tell you? Does the time to understanding grow linearly with the size of the code and the number of obfuscations applied? Do you think some obfuscations would add more confusion than others? Might some obfuscations be harder to undo than others? If so, how much harder? Are some impossible to undo? Unfortunately, the answers to these questions are largely unknown. As of now, we don’t have any models that can tell us how much longer it would take to reverse engineer a program that’s been obfuscated by a particular transformation or sequence of transformations, nor do we know what overhead these transformations will entail (although this is certainly easier to measure). Much current obfuscation research tries to devise such models [289], but we don’t yet have any that are developed enough to be used by practitioners.
1.4.1 Applications of Code Obfuscation
Now let’s look at a few scenarios where you can use code obfuscation to protect your code.
1.4.1.1 Malicious Reverse Engineering
In the first scenario, malicious reverse engineering, Doris builds a program that contains a valuable trade secret (a clever algorithm or design), which Axel, a rival developer, extracts and incorporates into his own program and sells to his customer, Carol:
This scenario is what most people have in mind when they think of code obfuscation. As we’ll soon see, it’s far from the only one. The assumption (although there’s no formal proof of this proposition) is that given enough time and resources, Axel will be able to reverse engineer any program. In other words, no secret hidden in a program will remain a secret forever. Doris’ goal, instead, has to be to use obfuscation to slow Axel down as much as possible, while at the same time adding as little overhead as possible. Ideally, the code is convoluted enough that Axel gives up trying to understand it and says “OK, fine, then! I’ll just reinvent this darned algorithm myself from scratch.” Ideally, Doris is able to choose just the right set of obfuscating transformations and apply them in just the right places to not make her program so slow and bloated that her customers will no longer buy it.
1.4.1.2 Digital Rights Management
In a digital rights management scenario, Doris is in the business of building a software media player. The player will only play music, images, or video that is distributed encrypted in a special file format known as a cryptolope. The player contains cryptographic keys that are necessary to unlock and play the media:
Since you want to be able to enjoy the encrypted media that you’ve bought in an “untethered environment,” say, watching a movie on your laptop on a plane where there’s no network connectivity, Doris is forced to store the decryption keys somewhere on your computer, probably inside your player’s code. Along with the keys, of course, the code will also contain the decryption algorithm and a decoder that turns the decrypted media into analog signals that you can hear or watch. In a typical player, the decoding chain looks like this:
You should notice that there are three targets for Axel to attack here. He could steal the keys (and if they’re universal keys he can now decode any media designed for this player, and, if they’re not tied specifically to him, he can sell them on the Web), he could steal the digital media, or he could steal the less desirable analog output. The possible weak points of such a system are many. First of all, it’s probably unreasonable to believe that the cryptographic algorithm used by the system will not be well known to an attacker. So unless the decryptor is obfuscated, a simple pattern-matching attack may be all that is necessary in order to locate the decryptor and the keys it uses. Dynamic attacks are also possible. For example, cryptographic algorithms have very specific execution patterns (think tight loops with lots of xors) and if they’re not heavily obfuscated, they’d be easy to find using a dynamic trace of the program. The keys themselves are a weak point. They’re long strings of bits with a high degree of randomness, and as such, unusual beasts in most programs. So Axel could simply scan through the player code looking for, say, a 512-bit long string that’s more random than expected. Any code that uses this string is likely to be the decryptor. Once Axel has found the location of the decryptor, he should have little problem finding where the decrypted media is generated and sent to the decoder. He can then simply add some code that writes the decrypted content to a file, and he’s done. What we learn from this is that Doris needs to obfuscate her code so that a simple pattern-match against it won’t reveal the location of the decryptor or decoder, or the interfaces between them. She needs to tamperproof the code so that Axel can’t insert new code, she needs to obfuscate not only the static code but also the dynamic behavior of the player, and she needs to obfuscate static data (the keys) in the code as well. And, still, she has to assume that these defense measures are only temporary. Given enough time, Axel will bypass them all, and so she needs to have a plan for what to do when the system is broken.
1.4.1.3 Mobile Agent Computing
In our next scenario, Doris sends out a mobile shopping agent, which visits online stores in order to find the best deal on a particular CD. The agent traverses the Web and asks every store it encounters if they have the CD and how much it costs, records the best price so far, and eventually, returns to Doris with the site where she can get the best deal. Of course, if evil Axel runs a store there’s no reason why he wouldn’t cheat. First of all, he can just erase the information that the agent has collected so far and substitute his own price:
This strategy will only help him if the agent returns directly to Doris when it’s done with Axel’s site. Much better (for Axel) would be to manipulate the code so that regardless of which stores it visits after his, it will still record his (higher) price as the best one.
One defense that has been proposed (there are many others) is for Doris to obfuscate the agent [165], thereby slowing down an attack. Ideally, this way Axel won’t have enough resources (he’s servicing many simultaneous requests, after all) to reverse engineer and modify the agent. Also, Doris might be able to detect that the agent spends a suspicious amount of time at Axel’s site. She can further complicate Axel’s attack by differently obfuscating every agent she sends out. This way, he won’t be able to speed up his analyses over time as he gathers more information about the agents and their defenses.
1.4.1.4 Grid Computing
In the grid computing scenario, Doris wants to run her program P but lacks the computational resources to do so. So she buys cycles from Axel to run P on his supercomputer, sends Axel P and the input data, and receives the output data in return. The problem arises when one or more of P, the inputs, or the outputs are confidential:
Doris must worry not only about Axel snooping on her algorithms or her inputs and outputs but also about his tampering with her program. If she can’t trust that P maintains its integrity on Axel’s site, she can’t trust the validity of the output data that Axel returns to her.
One way to defend the confidentiality of the inputs and outputs is to encrypt them and transform P into a program that operates directly on encrypted inputs and produces encrypted results. There is considerable research on such homomorphic encryption schemes, but the ones invented so far are inefficient and not applicable to real programs.
Alternatively, Doris can obfuscate P to help maintain confidentiality of its algorithms or tamperproof it to help maintain its integrity by using the techniques in this book. To preserve the confidentiality of the data, something similar to a DRM scheme can be used, where obfuscation and tamperproofing are used to hide and protect the encryption code.
Grid computing is a harder scenario to protect than many others. The reason is that you care about the confidentiality of algorithms and data, integrity of code, and on top of that, performance. The reason that Doris sent her code to Axel, after all, was so that it would execute faster on his superior hardware! She would be very unhappy, indeed, if the protection techniques she applied negated the performance boost she was paying for.
1.4.1.5 Artificial Diversity
Code obfuscation techniques have also been applied to operating systems to protect them against attacks by malware such as viruses and worms [74,75]. The idea is for Doris to randomize her code so that a malicious agent will not be able to locate or take advantage of a known vulnerability. Just like in the mobile agent scenario, we can take advantage of multi-versioning: If every distributed version of Doris’ code is obfuscated differently, Axel’s virus will need to be very clever to infect all of them:
This is known as artificial diversity. Of course, viruses themselves make use of obfuscation techniques to avoid detection by virus scanners, and with spectacular success. We will talk about this in the next section.
1.4.2 Obfuscating Transformations
It’s of course possible to take your program with its precious cargo and manually transform it into a mess that’s hard for your adversary to understand and manipulate. In practice, though, that’s too tedious and error-prone. A better idea is to build an obfuscation tool that translates your well-designed, easy-to-comprehend, easy-to-modify program into an incomprehensible mess of spaghetti code that’s near-impossible to alter. Such an obfuscator is similar to a compiler, except that instead of generating efficient and compact code, it generates code that’s hard for your adversary to comprehend.
Conceptually, an obfuscation tool takes four inputs: the program P you want to transform, the amount of obfuscation you would like to add, the amount of overhead you can accept, and a list of the code locations that are particularly precious to you that you would like to protect the most:
Internally, the obfuscator has a set of obfuscating code transformations, a set of program analyses needed to implement those transformations, and a loop that iteratively applies the transformations to P. The analyses will be similar to those used by compilers and reverse engineering tools. The process continues until the amount of obfuscation you desire has been reached or the maximum amount of overhead you can accept has been exceeded. The output is a program P that behaves the same as P but whose internal structure is very different. Practical code obfuscators may have a simpler structure than this. It’s common, for example, to have just a small number of transformations and to apply them in a fixed order.
There are four broad classes of obfuscating code transformations. Abstraction transformations break down the overall structure of the program, i.e., they obfuscate the way the programmer has organized the program into classes, modules, and functions. Data transformations replace the data structures the programmer has selected with others that reveal less information. Control transformations modify the control structures (if- and while-statements) in the program to hide the paths it may take at runtime. Dynamic transformations, finally, insert a transformer T into the program so that, at runtime, T causes the program to continuously transform itself. At runtime, the program therefore looks like this:
We’ve spread our discussion of code obfuscation over three chapters. In Chapter 4 (Code Obfuscation), you will see many control, data, and abstraction transformations. We’ll discuss the amount of confusion they add, how hard they are to defeat, and the amount of overhead they incur. In Chapter 6 (Dynamic Obfuscation), we’ll do the same for dynamic obfuscation. In Chapter 5 (Obfuscation Theory), we will look at the theoretical underpinnings of obfuscation. In particular, we’ll be interested in finding out what can be obfuscated, and what can’t.
To give you some idea of what obfuscating code transformations do, let’s go through a really trivial example. We’ll start with a little C program in its original form and show how it changes as you apply, first, an abstraction transformation, then a data transformation, next a control transformation, and finally, a dynamic transformation. Here’s the original program:
int main() { int y = 6; y = foo(y); bar(y,42); } |
int foo(int x) { return x*7; } |
void bar(int x, int z) { if (x==z) printf("%i\n",x); } |
The first thing we’re going to do is to hide the fact that the program consists of two functions. The programmer had something in mind when he decided to break the program into three parts, main, foo, and bar; presumably, this matched the mental model he had of his program. So let’s break this abstraction by merging foo and bar into one function, foobar. This new function takes three parameters. Two of them, x and z, are necessary to accommodate bar’s arguments, and the third, s, we’ll use to distinguish calls that should execute foo’s and bar’s bodies. Here’s foobar and the transformed version of main:
int main() { int y = 6; y = foobar(y,99,1); foobar(y,42,2); } |
int foobar(int x, int z, int s) { if (s==1) return x*7; else if (s==2) if (x==z) printf("%i\n",x); } |
Notice how it appears as if main calls the same function twice when, in fact, it’s really calling two different functions.
Now, in many programs the precious thing that you want to protect is data rather than code or design. This is, for example, the case in a digital rights management system where you want to prevent the adversary from getting their hands on the cleartext media. Ideally, in a system like that, the data is never in cleartext. Rather, it is always encoded in some incomprehensible (to the attacker) format and always operated on in this format. Let’s assume that, in our little example program, we want to protect all the integer values from the prying eyes of an attacker, who, for example, might be examining the program by running it under a debugger.
As it turns out, we’re lucky. The program only performs three operations on the data, namely, assignment, multiplication, and comparison for equality. Why is this lucky? Well, there’s a very simple encoding on integers that supports exactly these operations, namely, RSA encryption! We’ll leave the details of this encoding to later in the book. For right now, you’ll just have to take our word that setting
- p =3
- q =17
- N = pq = 51
- E(x)= x3 mod 51
- D(x)= x11 mod 51
leads to a program where no integer values are ever in cleartext:
int main() { // E(6) = 12 int y = 12; y = foobar3(y,99,1); // E(42) = 36 foobar3(y,36,2); } |
int foobar(int x, int z, int s) { if (s==1) return (x*37)%51; // E(7) = 37 else if (s==2) if (x==z) { // x11 = D(x) int x2=x*x % 51, x3=x2*x % 51; int x4=x2*x2 % 51, x8=x4*x4 % 51; int x11=x8*x3 % 51; printf("%i\n",x11); } } |
In particular, you can see how 6 is encoded as 12, 42 as 36, and 7 as 37! Not until the program absolutely has to have a value in cleartext (when it needs to pass it to printf to print it out) is it finally decoded. Note also that the multiplication x*7 takes place in the encoded domain; again, no values are in cleartext until necessary.
Structured programming dictates that you organize your functions by properly nesting conditional and loop statements. This makes the code easy to understand and modify. One popular kind of obfuscating control transformation, control flow flattening, rewrites functions to turn structured statements into spaghetti code. Here’s what the last version of the foobar function looks like after control structures have been replaced by plain old goto statements:
int foobar(int x, int z, int s) { char* next = &&cell0; int retVal = 0; cell0: next = (s==1)?&&cell1:&&cell2; goto *next; cell1: retVal=(x*37)%51; goto end; cell2: next = (s==2)?&&cell3:&&end; goto *next; cell3: next = (x==z)?&&cell4:&&end; goto *next; cell4: { int x2=x*x % 51, x3=x2*x % 51; int x4=x2*x2 % 51, x8=x4*x4 % 51; int x11=x8*x3 % 51; printf("%i\n",x11); goto end; } end: return retVal; }
Have a look at Listing 1.2▸25. Here, we’ve broken the body of foobar into two functions, A and B. This, by itself, isn’t a very effective transformation, but what’s interesting here is what happens to A and B at runtime. Every time foobar is called, it makes A and B trade places:
From an attacker’s point of view this code is hard to understand in two different ways. If he looks at our program statically, i.e., without running it, the abstraction transformation will have removed the way we chose to organize the program, the data transformation the way we chose our data structures, and the control transformations the way we structured our flow of control. If, instead, he decides to learn about the program by executing it, he will find that the dynamic obfuscation has violated a very basic assumption about programs, namely, that every time control reaches a particular point, the same code is executed.
Listing 1.2. The result of a dynamic obfuscating transformation. The functions A and B will continuously trade places at runtime. The swap function is in Listing 6.3▸377.
int start = 0; typedef int (*printfT) (char const *str,...); typedef int (*FuncPtr)(int,int,int,uint32,int,printfT,void * funcs); typedef FuncPtr FuncPtrArr[]; static FuncPtrArr funcs ={&A,&B}; int A(int x, int z, int s, uint32 begin, int start, printfT printf,void * funcs) { int next = 0; int retVal = 0; char* cells[]={&&cell0-(uint32)&A,&&cell1-(uint32)&A, &&cell2-(uint32)&A,&&cell3-(uint32)&A, &&cell4-(uint32)&A,&&end-(uint32)&A}; goto *(cells[next]+begin); cell0: next = (s==1)?1:2; goto *(cells[next]+begin); cell1: retVal=(x*37)%51; next=5; goto *(cells[next]+begin); cell2: next = (s==2)?3:5; goto *(cells[next]+begin); cell3: next = (x==z)?4:5; goto *(cells[next]+begin); cell4: FuncPtr f = ((FuncPtr*) funcs)[(1+start)%2]; f(x,z,s,(uint32)f,start,printf,funcs); next=5; goto *(cells[next]+begin); end: return retVal; } int B(int x, int z, int s, uint32 begin, int start,printfT printf,void * funcs) { int x2=x*x % 51; int x3=x2*x % 51; int x4=x2*x2 % 51; int x8=x4*x4 % 51; int x11=x8*x3 % 51; printf("%i\n",x11); } int foobar(int x, int z, int s) { int retVal = funcs[0+start](x,z,s,(uint32)funcs[0+start], start,printf,funcs); swap((caddr-t)funcs[0],(caddr-t)funcs[1],1024); start = (start+1) % 2; return retVal; }
1.4.3 Black Hat Code Obfuscation
For many years obfuscation was considered nothing more than a curiosity, something that no serious researcher would touch. The International Obfuscated C Code Contest (IOCCC) [177,227], for example, is an annual event that (humorously) tries to write the worst programs possible, C being a particularly suitable language for this task. It was generally assumed that there could be no real value to any code obfuscation technique and that anyone using one was just foolish for not using “real” security algorithms. Only fairly recently has it been acknowledged that there are legitimate applications where obfuscation and related techniques are the only available methods of protection.
Unfortunately, however, it’s turned out that the applications where obfuscation has had its most spectacular success are in what we like to call black hat scenarios. This should probably not come as much of a surprise. It’s not uncommon for the bad guys to adopt techniques first designed by the good guys. Cryptography, for example, can be used to protect the communication between criminals as well as between law enforcement agents. Steganography can be used by freedom fighters to avoid detection by an oppressive regime as well as by terrorists to avoid detection by national security forces. TV set-top box hackers have been known to first break through the smartcard-based defenses of the box and then turn around and use smartcards themselves to protect these hacks from counterattacks by the set-top box manufacturers!
One black hat scenario is when Axel uses obfuscation to protect his virus V from detection by Doris:
The virus comes in two parts, the payload, which is the code that’s designed to cause harm, and the obfuscator, which the virus uses to protect itself from discovery. In the first step, Axel infects a program P with V and sends it out into the wild. If Doris installs the infected P′ on her site, the virus may be able to infect another program, Q. Before infection, however, the virus uses the obfuscator to generate a different version of itself. The idea is that if every version of the virus is different, it will be difficult for Doris’ virus scanner to detect it. This is similar to the artificial diversity scenario you saw earlier, only this time the good guy and the bad guy have traded places!
1.4.3.1 Misdirection—Stealthy Obfuscation
If you look at a few of the programs submitted to the IOCCC, it should be clear that the code looks far from natural. While machine-generated code, obfuscated code, or optimized code can often look this bad, code written by humans typically doesn’t. For example, you can tell by looking at the obfuscator-generated code in Listing 1.1▸15 that it wasn’t written by a typical human programmer. So if Axel was looking for a secret in Doris’ program, a section that looked like this would likely raise his suspicion—could the secret be hidden behind this obviously obfuscated mess? You’ll see many cases in this book where the stealthiness of a protection technique is important; the attacker mustn’t be given a clue as to where in the code the technique was applied or the order in which a sequence of techniques were applied.
Misdirection is a particularly nasty black hat obfuscation technique that is based on the extreme stealthiness of an inserted bug. Look at Listing 1.3▸28, which shows a program to collect and tally the votes for American Idol. The program reads votes from standard input, and after the contest displays a summary of the votes cast. Here is a sample run of the program:
> cat votes-cast.txt alice alice bob alice dmitri bob zebra > java Voting < votes-cast.txt Total: 7 Invalid: 1 alice: 3 bob: 2 charles: 0 dmitri: 1
Listing 1.3. Obfuscated voting code.
public class Voting { final int INVALID-VOTE = -1; int invalidVotes, totalVotes = 0; String[] candidates = {"alice", "bob", "charles", "dmitri"}; int[] tally = new int [candidates.length]; BufferedReader in = null; BufferedWriter log = null; public Voting () { in = new BufferedReader (new InputStreamReader (System.in)); } public String readVote () { try {return in.readLine();} catch (Exception e) {return null;} } public boolean isValidTime (Date today) { SimpleDateFormat time = new SimpleDateFormat ("HH"); int hour24 = Integer.decode (time.format(today)).intValue(); return !(hour24 < 9 {{ hour24 > 21); } public int decodeVote (String input) { for (int i=0; i < candidates.length; i++) if (candidates[i].equals (input)) return i; return INVALID-VOTE; } public void logVote (Date date, int vote) throws Exception { if (log == null) log = new BufferedWriter (new FileWriter ("log.txt")); log.write ("TIME: " + date + " VOTE: " + vote); } public void printSummary () { System.out.println ("Total:"+totalVotes + "\nInvalid:"+invalidVotes); for (int i=0; i < candidates.length; i++) System.out.println ("Votes for " + candidates[i] +": "+tally[i]); } public void go () { while (true) { String input = readVote(); int vote = 0; if (input == null)break; try { Date today = new Date(); if (isValidTime (today)) vote = decodeVote (input); else vote = INVALID-VOTE; logVote (today, vote); } catch (Exception e) {} totalVotes++; if (vote == INVALID-VOTE) invalidVotes++; else tally[vote]++; } printSummary(); } public static void main (String args[]) { Voting voting = new Voting (); voting.go(); } }
Can you tell whether the program produces a fair count or not, or has it, in fact, been deliberately manipulated to favor a particular candidate? Before reading the answer in footnote 3, time yourself to see how long it took to analyze this 58-line program. Now, how long would it take for you to find a potential problem in a real-world voting system comprising hundreds of thousands of lines of code? What if the techniques used in Listing 1.3 were combined with those in Listing 1.1▸15? Might it be possible to provide enough confusion so that by the time the obfuscation has been penetrated and any irregularities identified, the next American Idol has already been selected (or, in the case of the United States’ presidential election, the Electoral College has convened and cast their votes)?
1.4.3.2 Obfuscating Viruses
As you will see in this book, there are important real-world problems that, to date, only obfuscation is able to tackle. Unfortunately, however, it’s in black hat code scenarios where obfuscation has had its most spectacular successes: Obfuscation is used by malware writers to protect their viruses, worms, trojans, and rootkits from detection. It is entertaining to examine techniques that hackers have invented and successfully used to foil security researchers, and to see whether the good guys can make use of the same techniques. Virus writers and virus scanner writers are engaged in a cat-and-mouse game: When a new virus detection technique is invented, the virus writers counter with a more sophisticated code obfuscation technique, which is then countered by a more powerful detection technique, and so on. So far, the hackers seem to be winning. Recent reports claim that 25% of the world’s computers have been penetrated and taken over by bot-nets [368]. Obviously, this can be attributed not only to the success of obfuscated malware but also to the fact that many people run unpatched operating systems, outdated virus scanners, and so on.
Virus writers typically beat virus scanners by making their virus code “invisible” to the scanners. Scanners have only a limited amount of resources and cannot fully analyze a file to decide whether or not it’s malicious. Even if they could, there are theoretical results [74] that state that it may still fail. The scanner therefore identifies a virus by its signature, some aspect of it that is easy enough to extract and that does not change from one infection to the next. This is similar to birthmarks, which you will see in Chapter 10 (Software Similarity Analysis).
What sort of tricks does a virus writer use to make the viruses stay “invisible” to virus scanners? Look at the self-reproducing viral Java program in Listing 1.4▸31. Real viruses are rarely, if ever, written in Java, but this simplified example will help you understand how a virus can use obfuscation to protect itself.
There are several things that are noteworthy in this Java virus. The first thing to note is that the program seems to have its entire source code encoded as a string within itself. In order to manage this in a finite space, a program takes advantage of the duplicate structure of the program. This trick is used in a common geek amusement to devise quines—programs that when run produce their own source code as their only output [1].
The source code is built in the variable self. It’s eventually written to a file and compiled using Sun’s internal Java compiler. The recent Slapper worm [320] and its variants also used a compiler to make the virus less platform-dependent. Using a compiler in this way has another side effect that is advantageous for a virus writer. Different compilers can produce slightly different (though semantically equivalent) programs from the same source code. This makes it a bit harder to find good signatures that scanners can use to reliably detect a virus.
In Listing 1.4▸31, the example goes one step further. Before the new copy of the virus is written to a file, it is passed to the function morph. The morph function adds a i++ statement between every line of the main program. This instruction has no effect on the output or behavior of the program. However, this trivial obfuscation ensures that every new version of the program will be different from its predecessor! This means that a virus scanner that uses a trivial checksum-based signature detector will not be powerful enough to identify the virus.
Listing 1.4. A metamorphic virus in Java.
import java.io.*; public class m { private static int i = 0; private static com.sun.tools.javac.Main javac= new com.sun.tools.javac.Main(); public static String morph (String text) { return text.substring(0,360) + text.substring(360).replaceAll ( new String (new char[] { ';' }), new String (new char[] { ';','i','+','+',';' })); } public static void main (String[] a) throws IOException { String self="import java.io.*; public class m { private static int i = 0; private static com.sun.tools.javac.Main javac=new com.sun.tools.javac.Main();public static String morph (String text) { return text.substring(0,360) + text.substring(360).replaceAll (new String (new char[] { ';' }),new String (new char[] { ';','i','+','+',';' }) );} public static void main (String[] a) throws IOException { String self=@;char q=34;char t=64;String text= self.replaceAll(String.valueOf(t),q+morph(self)+q);String filename = new String (new char[] { 'm','.','j','a','v', 'a' });File file=new File(filename); file.deleteOnExit(); PrintWriter out=new PrintWriter(new FileOutputStream(file )); out.println(text);out.close(); javac.compile(new String[]{filename});}}"; char q=34; char t=64; String text=self.replaceAll(String.valueOf((char)64), q+morph(self)+q); text=text.replaceAll(String.valueOf((char)35), String.valueOf((char)34)); String filename = new String (new char[] { 'm','.','j','a','v','a' }); File file = new File(new String (filename)); file.deleteOnExit(); PrintWriter out = new PrintWriter(new FileOutputStream(file)); out.println(text); out.close(); javac.compile(new String[] { filename }); } }
Imagine if, instead of our very trivial morph function, a virus writer included and used one of the more sophisticated obfuscations we discussed in the last section. What sort of analysis would a scanner need to perform to detect such a virus?
Real viruses use obfuscation to counter detection by ensuring that as many properties as possible change between infections. Metamorphic viruses use obfuscating transformations to change the entire body of the program from one generation to the next. Polymorphic viruses are a simpler variation of metamorphic viruses. Instead of morphing the entire program, each generation of the virus encrypts itself with a different key. Of course, for the virus to still be functional it must also contain a decryption and execution routine. Polymorphic viruses only morph this decryption and execution routine. In this way, morphing protects the decryption and execution routine from the scanner, while encryption protects the rest of the program.