Being Methodical About Test Design
Google started like many companies, with software developers writing and testing their own code. We developed and tested Search. We developed and tested Ads. The number of products and projects grew along with our campus, our employees, and our market share.
As Google grew, the company hired contractors to perform manual testing. A line was drawn in the sand. On one side of that line, developers performed unit testing, managed the build process, and performed basic integration testing on the software we created. On the other side, the contractors -- often in obscene numbers -- performed the manual testing. The image in my mind is one of bees buzzing around a hive.
As you might imagine, the growth of Google created a veritable army of contract testers. Testing became an expensive, unpredictable process that depended on individual talent and the sheer weight of contractor labor.
Something had to give.
What gave was a return to traditional Google development values: The developers, not testers, own quality. This means that the quality buck stops in development. We went back to the days of Search and Ads where everyone was the jack of the development and testing trade. Specifically, developers cannot hand off mostly-complete code to testers who test, debug, and write the fit and finish code. Developers own software quality, from appetizer through dessert. If the product fails, it's the fault of those who built it.
This is an intuitively pleasing idea. Who better to own quality than Development? Imagine the care involved in writing code that you yourself must test. Imagine its testability. Imagine its simplicity.
That was the plan and it took us... only so far. Because there is a hole in this reasoning through which you can fly a very large bug.
Do you see it?
Simply put, the problem is scope. It takes many developers to write a single application. If each developer is responsible for the quality of his component, who is responsible for the quality of the whole?
Users don't recognize module boundaries. Few real software failures fall neatly within those boundaries. Instead, failures are the result of several modules and the data they manipulate colluding to corrupt memory, hose a computation, or cause some specific capability to go pear-shaped. Users care little about the who, what, why, or how behind the reason they can't get work done. They blame it on the product or the company that produced the product. And this is precisely the tester’s role; it is our job to surface these higher-level concerns. It is our job to assess the collective whole, not an individual developer's work. It is within that collective whole that the worst failures occur. It is the collective whole that must be reliable. And it is the collective whole that is the domain of the software tester.
However, simply looking at the collective whole does not find a lot of problems. At least it doesn’t find a lot of important problems.
Test driving a car is looking at a car holistically. But how often does a test drive find a real bug? A test drive is about look and feel; it's too high level to be a good bug finding exercise. To find a bug, you hire a mechanic to look at the car, component by component and subsystem by subsystem. Mechanics don't test by driving the car on a date or by taking a Sunday drive. They test by monitoring specific subsystems and by looking for specific types of problems that eventually work their way out as a bug on a Sunday drive. A mechanic finds flaws quickly in the fuel system, the exhaust system, the electrical system. Sunday drivers must be patient for such flaws to work their way into their line of sight.
Proper software testing requires a combination of Sunday driving and a mechanic's analysis. It is about looking at the big picture and analyzing individual components and capabilities and how they contribute to the collective whole. The way we now do exploratory testing at Google treats it as such.
Rethinking Manual Testing
Manual testing generally takes one of two forms: script-based testing or exploratory testing.
Scripting is a front-loaded technique. A great deal of effort is expended in the planning phase to create test scripts that can be executed without a great deal of variation or even thought. Scripts are written based on expected usage, so-called “user stories” that describe how a real user trying to get work done will use the application in question. Once completed, these scripts can be handed to just about anyone and applied to the software as test cases.
Scripting’s benefit is that the scripts can be written while the software is still under development, allowing a parallelization of test planning and development. By the time the software is ready to test, testers have any number of test scripts ready to go. Likewise, well-written scripts can be easily outsourced for inexpensive execution.
With exploratory testing, the problem is often that it treats the system under test as a black box. Apply some inputs, observe some results, try to be clever about applying more inputs based on those results, and hope you find bugs. Software and the problems it solves are too important to be treated so casually. Real engineering practice is called for. Software deserves it.
Neither of these – script-based testing or exploratory testing – are enough. Scripts are too rigid. Exploration is too touchy-feely. At Google, we are looking at ways to build more engineering practice around manual testing. Interestingly enough, what we stumbled upon is not only a technique for guiding manual testing, but also has turned into a technique for designing test automation.
What we are really doing when we apply inputs and observe results is slicing a single piece of functionality, often across the domains of multiple developers, from the software. Does this slice work as we expect? If yes we come up with another slice, if no we investigate a potential bug. The question is to find a slicing technique that represents real world usage, finds bugs, and gains good coverage. It should be repeatable, teachable, and ultimately part of the testing culture at Google.
Sunday drives aren't slices so much as meanderings. A mechanic's diagnosis is more like taking chunks, not slices. What we need is something in between.
The Test Plan as a Map
What goes into a test plan? This is an open-ended question and not one that is fully answered here. However, the main ingredient of a good test plan is a summary of all the testable functionality within an application. A good test plan should help a tester choose which slices of an application to select for a given test case. At the very least, it should identify the possibilities.
At Google, we're beginning to think about test plans as maps. Tourists use maps to decide which parts of a city or destination to visit. They use maps to plan their routes. They may slice up a city into sections based on the map. A test plan should do the same thing for testers, but most fall short of such lofty goals.
Tourists may drag their maps along with them. Indeed, many wouldn't leave their hotel without their map. They become wrinkled, folded, ragged and if a map is lost, a new copy is sought out immediately. If only testers found their plans as useful.
We've been experimenting with a number of processes at Google to help make test plans act more as maps. The one we found the most useful is a process we call Component, Feature, Capability Analysis for lack of a better term. In involves first identifying the major components of the application under test, refining the components into features, and then a final refinement of features into capabilities. This is like looking at a city as a collection of districts (components), attractions (features), and activities (capabilities). Just as a tourist may select a district, then choose an attraction and experience some set of activities for the attraction (visit Florida, go to Disney, ride Space Mountain) a tester does the same when she decides which component to test, what features to exercise, and which capabilities will actually get executed as a result.
Here's an example (oversimplified for simplicity) for our new operating system Chrome OS. We list the components and capabilities as a matrix where the components are derived largely based on the structure of the development team. Asking a developer, “what component do you work on?” and repeating this for all the developers is a great way to build the list of components. These represent the row labels in Figure 1.
Figure 1. A component feature matrix for a subset of Chrome OS functionality.
The column labels are the features. These are determined by asking the question “What does this component do?” Ask developers to demo their work (show me how cool your feature is!), and you'll end up with a fairly exhaustive list of features. Google developers are always happy to demonstrate their work. I can’t imagine a much different phenomenon at other companies.
But this is still not a low enough level of detail to test by. As a map, it identifies more than it guides. Imagine a street map of a tourist destination that only lists business names without specifying the type of business or what goods and services are on offer. That's too high a level of detail to be useful to a tourist, and the same goes for testing.
The next refinement we do at Google is our attempt to document what we call the testing surface. That is the sum total of all the capabilities that we can and should have as part of some or many test cases. Figure 2 shows a few of the capabilities for Chrome OS.
Figure 2. Adding capabilities to the matrix gives testers more concrete guidance about testable functionality.
When we do this in practice, we fill out a spreadsheet. For a complicated application, that spreadsheet can grow to numerous pages in length.