1.7 Software Similarity
There’s a class of software protection problems that are not amenable to algorithms based on code transformations, and we lump them together under the term software similarity analysis. They have in common that, conceptually, they rely on your being able to determine if two programs are very similar or if one program is (partially) contained in another. We capture this in the two functions similarity and containment:
1.7.1 Plagiarism
We think of plagiarism as chiefly occurring in academic settings where students copy each other’s assignments or researchers copy each other’s work, but it’s really found anywhere that some humans create and others try making a shortcut to profit by “borrowing” these creations. Ideas from works of art are copied, as are pieces of music, furniture, or fashion designs and so on. Over the years, many famous authors, artists, and musicians have found themselves in court for incorporating too much of a colleague’s work into their own. Exactly what “too much” means is ultimately left up to the courts to define. Famous cases include John Fogerty, who was sued for plagiarizing himself (known as self-plagiarism) when his new songs sounded too much like his old ones that were under copyright to a previous publisher.
In this book, we’re interested in plagiarism of code. This occurs frequently in computer science classes where students find it easier and faster to borrow their classmates’ assignments than to write them from scratch themselves. Here, Axel has copied a piece of Doris’ program Q, inserted it into his own program P, and submitted it as his own:
With large classes, it becomes impossible for computer science professors to manually look for code copying in every pair of submitted programming assignments. Instead (being programmers and used to automating everything), they build tools that perform the comparisons automatically. For the example above, the output might look something like this, listing all the pairs of programs in order, from most likely to least likely to have been enhanced by copying:
Automatic methods are best used to weed out from consideration programs highly unlikely to be the result of plagiarism, leaving a few serious suspects for the professor to examine by hand.
Students who are aware that instructors are using tools to look for copying will naturally try to fool them. Simple code transformations such as renaming identifiers and reordering functions are common, and it’s therefore important that plagiarism detectors are immune to such transforms.
1.7.2 Software Forensics
Software forensics answers the question, “Who, out of a group of suspected programmers, wrote program S?” To answer the question, you need to start out with code samples from all the programmers you think might have written S:
From these samples, you extract features that you believe are likely to uniquely identify each programmer and then compare them to the same features extracted from S. From the example above, the output might look something like this, indicating that Axel is much more likely to have written S than either Doris or Carol:
Here, f is the function that extracts a feature set from a program. Exactly which set of features will best indicate authorship is hotly debated, but the feature sets often include measures related to line length, function size, or the placement of curly braces.
Most work on software forensics has been done on source code, but binary code forensics is just as interesting. For example, once the FBI catches a suspected malware author, they could compare his programming style to those of all known viruses. Being able to bring a large collection of malware to court as evidence against him might strengthen the government’s case.
1.7.3 Birthmarking
You’ve already encountered the idea of code lifting, a competitor copying a module M from your program P into his own program Q:
Both obfuscation and watermarking can make this attack harder to mount successfully. By obfuscating P you make it more difficult to find M, or more difficult to extract it cleanly from P . For example, you could mix M with code from other modules so that along with M’s code any automatic extraction tool would produce a whole lot of irrelevant code.
You could also embed a watermark or fingerprint in M. Say, for example, that M is a graphics module produced by a third party (you) that Doris licenses to use in her own game. If Doris’ fingerprint shows up in a program sold by Axel, you could use that as evidence that he’s lifted it, and even evidence that he’s lifted it from Doris’ program. You could take legal action against Axel for code theft or against Doris if she’s not lived up to her license agreement to properly protect the module from theft.
For a variety of reasons, you may choose not to use obfuscation and water-marking. For example, both come with a performance penalty, and obfuscation can make debugging and quality assurance difficult. Also, you may want to detect theft of legacy code that was never protected against intellectual property attacks. Instead of using either one, you could simply search for your module M inside the attacker’s program Q:
This works fine, unless the adversary has anticipated this and applied some code transformations, such as obfuscation, to M (or possibly all of Q) to make it hard to find M:
Depending on the severity of the code transformations, this could make a straightforward search for M difficult:
This is where the concept of birthmark comes in. The idea is to extract “signals” from Q and from M, and then look for M’s signal within Q’s signal rather than looking for M directly within Q:
Here, f is a function that extracts the signal, which we call a birthmark, from a program or module. The idea is to select the birthmark so that it’s invariant under common code transformations such as obfuscation and optimization.
We know of at least one case where birthmarking was successfully used to argue code theft. In a court case in the early 1980s [128], IBM sued a rival for theft of their PC-AT ROM. They argued that the defendant’s programmers pushed and popped registers in the same order as in the original code, which was essentially a birthmark. They also argued that it would be highly unlikely for two programs to both say push R1; push R2; add when push R2; push R1; add is semantically equivalent.
1.7.4 A Birthmarking Example
To be effective, a birthmarking algorithm must extract the mark from a language feature that is hard for an attacker to alter. One idea that has been reinvented several times and that we’ll explore further in Chapter 10 (Software Similarity Analysis) is to compute the birthmark from the calls the program makes to standard library functions or system calls. Some of these functions are difficult for the adversary to replace with his own functions. For example, the only way to write to a file on Unix is to issue the write system call. A birthmark extracted from the way the program uses write system calls should therefore be reasonably robust against semantics-preserving transformations.
Consider this C function that reads two strings from a file, converts them to integers, and returns their product:
int x () { char str [100]; FILE *fp = fopen(" myfile ", "rb"); fscanf(fp ,"%s",str); int v1 = atoi(str); fscanf(fp ,"%s",str); int v2 = atoi(str); fclose(fp); return v1*v2; }
Several birthmark-extracting functions are possible. You could, for example, make the birthmark be the sequence of calls to standard library functions:
- bm1(x) = fopen, fscanf, atoi, fscanf, atoi, fclose
Or you could ignore the actual sequence, since some calls are independent and could easily be reordered. The resulting birthmark is now a set of the calls the function makes:
- bm2(x) = {atoi, fclose, fopen, fscanf}
Or you could take into account the number of times the function calls each library function. The birthmark becomes a set of system calls and their frequency:
- bm3(x) = {atoi → 2, fclose → 1, fopen → 1, fscanf → 2}
An attacker would get a copy of x, include it in his own program, P, and perform a variety of transformations to confuse our birthmark extractor. Here, he’s renamed the function, added calls to gettimeofday and getpagesize (functions he knows have no dangerous side effects), reordered the calls to fclose and atoi, and added further bogus calls to fopen and fclose:
void y () {...} int f () { FILE *fp = fopen (" myfile ", "rb"); char str [100]; struct timeval tv; gettimeofday (&tv , NULL ); fscanf (fp ,"%s",str ); int v1 = atoi ( str ); fscanf (fp ,"%s",str ); fclose (fp ); int v2 = atoi ( str ); int p = getpagesize () * getpagesize (); fp = fopen (" myfile ", "rb"); fclose (fp ); return v1*v2; } void z () {...}
Bogus calls are shaded in dark gray. Assuming that the rest of P (functions y and z) don’t contain any standard library calls, you get these possible birthmarks for P:
bm1
(P) = < fopen, gettimeofday, fscanf, atoi, fscanf, fclose, atoi, getpagesize, getpagesize> bm2
(P) = {atoi, fclose, fopen, fscanf, getpagesize, gettimeofday} bm3
(P) = {atoi ↦ 2, fclose ↦ 2, fopen ↦ 2, fscanf ↦ 2, getpagesize ↦ 2, gettimeofday ↦ 1}
To determine whether the attacker has included your function x in his program P, you compute containment(bmi (x), bmi (P)), where containment returns a value between 0.0 and 1.0 representing the fraction of x that’s contained in P. In this case, it would seem like the attacker has done a pretty good job of covering his tracks and altering the sequence of calls, the set of functions being called, and the frequency of calls to the different functions.