Software [In]security: Comparing Apples, Oranges, and Aardvarks (or, All Static Analysis Tools Are Not Created Equal)
Since the ealiest days of ITS4 in the year 2000, static analysis tools that scan source code looking for security bugs have seen great progress, both from a market perspective and from a technical perspective. Today, because two of the leading tools have been acquired by gigantic technical juggernaut companies (Fortify by HP and Ounce Labs by IBM), we are on the cusp of even broader adoption. This is great news for software security.
As the market for these highly technical tools grows, so to does the desire to compare and contrast their performance. At first blush, comparing tools that basically "do the same thing" should be easy, right? As we describe in this article — unfortunately not.
As the Market Grows
For the last several years we have tracked the growth of the software security space in a series of articles. The latest article, describing the software security market of 2009 (topping out at $554.4 million) had this to say about static analysis tools:
The high end of the market continues to be ever more interested in solving software security issues and not simply diagnosing them. Mitigation is a central concern, and white box analysis is essential. But the low end of the market is beginning to expand rapidly and appears to be starting out with Web app firewalls and black box testing tools. Such basic products are easy to purchase with standard IT funding and they can be put in place without much process change.
You see, source code scanning tools get to the heart of the matter by identifying actual lines in a program that lead to security vulnerability (something that makes them more difficult to use and more difficult to compare). Mitigating a security problem in an application can be much easier when you know where in the code to look. By contrast, black box tools such as web application testing tools take an outside→in approach. Though they certainly find vulnerbilites in severely broken web apps, results can be sufficiently vague about exactly where the vulnerbility lies that fixing the app remains a major challenge best left to others.
Because many organizations treat black box testing tools in many ways like a "fire and forget" technology, they allow inexperienced non-programmers to wield them. As long as they are treated like badness-ometers and not security meters, that is OK — well, it's OK until the time comes to fix the problems that they find!
We expect the middle market to adopt black box web app testing tools before it moves on to static analysis tools (note that we described this in our 2009 article). But when static analysis tools begin come of age in the middle market (something we expect to see within the next 2-3 years), we expect rapid adoption to follow. That's because early adopters today feel they've reached the height of what they can accomplish weilding black box testing tools and are busy moving their bug-finding endeavors earlier in the development lifecycle. Static analysis tools promise to help them meet this goal.
Compared to What
So, the good news is that interest is rising in static analysis tools. Better yet, security managers are moving beyond piloting these tools on a handful of high-risk applications and are now deploying them organization-wide so that developers receive feedback on their code's security more quickly. The bad news is that comparing static analysis tools proves difficult at best. Further, lessons learned in comparing and deploying penetration testing tools don't necessarily translate well to static tools. Watch out: the process of selecting and deploying a static analysis tool faces entirely different challenges than selecting and deploying a black box testing tool.
Many firms among the early adopters of code-scanning tools wasted a ton of money selecting, piloting, and implementing static analysis tools. In certain circumstances as Cigital consultants, we have watched mouths agape as certain tool vendors underestimated the cost of deployment in a large enterprise by $2.5M during the first year (this real-world number is an average over six large-scale adoption efforts). This awe-inspiring waste (both absolutely and as a percentage of tool license cost) validates the potential benefit of a project to select and deploy a tool more carefully.
So what can we do to help firms that want to use static analysis select the right tool, adopt it in as painless a manner as possible, and scale its use effectively?
Pitfalls, Puddles, and Pentagrams
There are a number of pitfalls to any tool comparison approach that are worth bearing in mind as we think about this problem. If and when you set up an experiment to compare such tools (or even when you discuss your own informal comparisons) keep these five issues front and center.
Pitfall 1: Tool A will perform dramatically differently than Tool B on the same code base, despite the fact that both A and B claim similar/identical language/rule support in a particular context. Most tools today tout the ability to detect buffer overflows in out-of-range char[] access in C++, for example. Though this is a very specific vulnerability type, different tools yield very different results. The reason for this behavior is that specifics of tool performance rely on esoteric detail such as:
- Engine and rule implementation (often discernible through only proprietary knowledge of a particular tool).
- Structure and style of code (only discernible through a wealth of code reviewing experience) including esoterica such as: calling convention; order of operations; depth of call chain and related block/scope issues; and, use of inheritance, polymorphism, pointers and other indirection.
Pitfall 2: Tool A will perform very differently than (the very same) Tool A over two test code bases because of the same technical criteria listed above.
Pitfall 3: The particular configuration of a tool may dramatically change performance. Configuration settings including settings for memory usage, scan parameters, and other tool-specific settings often directly impact findings.
Pitfall 4: Operator use of a given tool has overwhelming impact on that tool's performance. Sadly we have witnessed tool vendors' own staff misconfigure their tools for a scan (more than once). Because this is a slippery slope, different operators can (and do) produce result sets in which a majority of findings are added, removed, or changed by comparison to another operator's scan. This even happens to the veteran tool operators at Cigital.
Pitfall 5: Tool A's minor release N+1 may well perform dramatically differently than release N under a fixed set of criteria.
Fruit versus Aardvarks
Pitfall 1 means that high-quality comparison is extremely difficult to achieve. Pitfalls 2 and 3 mean that high-quality test set-up (even for a single tool) is extremely difficult to achieve. Pitfall 4 indicates (to us anyway) that finding qualified people to compare tools objectively and meaningfully is difficult at best. Pitfall 5 warns us that even published results may be very misleading. And, though certainly not as reliable, the always-seductive "oral tradition" among test maintainers misleads as well.
The five pitfalls collectively also imply that just about any public test criteria developed is likely both to produce very inconsistent results if implemented by different firms and to produce very inconsistent results sensitive to when in the release cycle particular static analysis tools are considered. The upshot? Use your own code instead of a pre-fab evaluation suite. You probably have the makings of a good set of tests within your own organization's application base, especially if you take into account recent historical penetration testing results.
We shouldn't really have to mention the problem of vendors "coding to the test," but we will. Vendors do, in fact, design their tools to pass particular tests with flying colors. If you did any tool bake-offs in the 2004-2006 time frame (we did a bunch), OWASP code and certain open-source packages provided particularly interesting/amusing test cases.
The Bottom Line
Seek out experience. We can toot our own Cigital horn a bit here. After all, we invented ITS4 and built the technology that became the Fortify tool once we licensed it to Kleiner-Perkins. Many of us have built static analysis in various capacities and collectively we have succeeded in helping a large number of clients select, implement, and scale their static tool initiatives. Heck, our consultants even include those who built direct tool comparisons for NIST.
Do not compare fruit and aardvarks. When selecting a tool, try to determine whether a given tool will help raise your code quality and how. Use test suites consisting of code representative of what you'll be scanning in business-as-usual implementation to answer that question.
Keep in mind that analysis capabilities that a given tool possesses represent less than half the battle. Think carefully about implementing the tools you are comparing in a production code assessment environment. "What steps must I take to find instances of Vulnerability X reliably using Tool A in my 3,000 applications?" is a much better question than "Does Tool A find problem 47?"
Take into account customization. In our experience, organizations obtain the bulk of the benefit in static analysis implementations when they mature towards customization. For instance, imagine using your static analysis tool to remind developers to use your secure-by-default web portal APIs and follow your secure coding standards as part of their nightly build feedback. (Unfortunately, the bulk of the industry's experience remains centered around implementing the base tool.) Though organizations that have reached maturity always indicate they spend more on customization and maintenance, these aspects of tool comparison almost never register in the initial selection process. Consider what expertise and effort a potential tool choice will require in years 3-5 most carefully.
We think that the notion of comparing static analysis tools is an important one. Indeed, it is something we do every day. However, proper care must be taken to get meaningful results.