An Interview with Watts Humphrey, Part 26: Catastrophic Software Failures and the Limits of Testing
This interview was provided courtesy of the Computer History Museum.
Initial PSP Use
Humphrey: So a lot of people
were very helpful with my programming problems. But basically coming up with
the PSP and the PSP idea and how to do it, by and large pretty much just about
everybody I was working with at SEI and elsewhere thought I was kind of a nut. I
mean “what in the world is the point of doing that?” was sort of the reaction
that I was getting. And remember I’d gone through and written the programs, and
then I gave a talk down in
There was a group --
a Siemens Research lab in
And
so I went home all encouraged that I'll get these guys to actually use it.
I'd given them the process, I'd described it to them. And
so I kept calling like every week: where do you stand, how are you doing? Oh
we're doing a prototype here, we don't have time to do
it yet. And they never did. They never got to it. They were in too big a hurry
to do other stuff. What was interesting was as part of that, I'd actually taken
time to sit down with each of the engineers on the team. There were five
engineers, and they had a process guy there. He was sort of a technology guy
but he was sort of acting as the process guy I guess. But in any event, he and
I would sit down and interview each of the engineers individually on the team. And
what I did was I wanted to find out what process are you using personally. And
they didn't know how to answer that question, so I'd kind of lead them through
it.
I said, okay, when
you get an assignment to write a module of code, say 2,000 lines of code, what
do you get? So they'd tell me, and then say here's how I'd start. I'd say well,
what do you do next? So they'd start to describe that. So they would describe
what they did, and I'd lead them step by step, and I wrote down on the board
what they were saying. And every so often they'd say,
here's what they did next. And I'd say, "Well don't you have to do this
too?" And so they, "Oh yeah, well I forgot that." And so we
built the process for each of these five people. I don't know if it's in my
notebooks or not. But I wrote it on the board. I'd write it where they could
see it. But in any event, they were all -- the five of them on one team --
totally different. It was amazing. I mean, one young guy came right out of
school, seemed like a real sharp guy, but he basically started coding in the
middle and he didn't do any design work at all. He just sort of had this idea
of what it was and how it was going to work, and so he started coding in the
middle of this big thing, and he'd sort of build it.
And at the other
extreme, there was an engineer -- I think he was actually from
And so the individual
process is a critical piece of this, and that's why I went all the way to the
PSP. And so, as I say, I never could find anybody who would use it and so I was
really very frustrated. I think I may have mentioned I was at a conference in
Booch: And the time you were there that's
when the wall had just gone down so the dynamic was amazing.
Humphrey: Oh yes, the wall had
come down, I walked over there and around it, was looking at the whole area.
Booch: And you saw the piles of rubble
in
Defective Software
Humphrey: Oh exactly, exactly.
It was a thrilling experience. But in any event I was looking at that and it's
amazing, by watching how symphonies work what kind of feeling you get from the
dynamics of doing beautiful work. And that's the software issue. We really need
symphonic teams. The hacker business is so sad because, just like in a
symphony, any individual instrument can destroy the whole effect, and that's
exactly true of software. Any individual piece of code can destroy the whole
thing. And that's the problem. I'm trying to remember the name of the medical
instrument, remember that killed a bunch of people, the Therac-25 machine. And
that was a trivial error in an error recovery program. I mean it wasn't
something that normally gets used. It was off in the side somewhere. I remember
an interesting sideline -- earlier at IBM on OS 360 we had performance
problems.
Booch: The Therac-25,
that was the machine.
Humphrey: That's what it was,
the Therac-25. And that was an error recovery program that had a defect in it,
and it missed getting a reset and I believe killed half a dozen people. And so
we get those problems. And exactly the same thing with the
V-22 Osprey. Do you know the story about the V-22?
Booch: I do but the people listening in
might not so why don't you relay.
Humphrey: Well, on the V-22
Osprey, I actually I went out and talked to the executives of the company that
built that system, and I was talking about the quality of their software. They bridled, they said we have very high quality software. I
said, the fact that it's killed 13 marines is a good
measure of quality. They didn't buy that. But in any event, the V-22 Osprey is
this tilt-wing aircraft that you can fly as a regular airplane and then as
you're coming in to land you tilt the wings up and so the propellers are
pointed up instead of forward and it lands as a helicopter. I mean it's an
enormous technical achievement to build that thing.
And, of course, one
of the issues that they had was what happens when the hydraulics fail while
you're tilting the wing? You've got a whole hydraulic system that does that,
and so they put in a whole electronic backup system in case you have a
hydraulic failure -- an electronic backup system that will fly the airplane
electronically. It turned out in this particular case, they were in a test
flight with a bunch of marines in the aircraft, they were coming in to land,
and as they were tilting the wings to bring it in, the hydraulics failed.
And so the backup system
took over, the software that controlled the electronics system took over, and
the software had an error in it that essentially crossed the controls. And of
course a pilot can figure that out if he's got a little time, but when you're
coming in to land that's a little hard, and they
crashed and killed everybody. And the point is, I run into this in all kinds of
things, the number of possibilities that have to be tested, and this is what
the executive was telling me, "Oh, we tested it exhaustively." And I
agreed they tested exhaustively but exhaustive testing won't find all the
defects. People don't know that. They don't understand that. And let me branch
a little piece on here. When you think about a big program, big complex system
program, 2 million lines of code something like that,
and you run exhaustive tests, what percentage of all the possibilities do you
think you've tested? Any idea?
Booch: Oh it's going to be an
embarrassingly small number probably in the less than 20, 30% would be my
guess. What would you say?
Humphrey: You're way off. Way
off. I typically ask people and I get back numbers 50%, 30%, that kind of thing.
I asked the people at Microsoft, the Windows people, what they thought. And
then we chatted about it a bit and they said about 1%.
Booch: Oh my goodness.
Humphrey: And my reaction is
they're high by several orders of magnitude. And let me explain the reason why:
the conditions that actually affect testing. I mean, testing will only test a
specific set of conditions and the conditions that will affect testing include,
for instance, how many job streams are running, what the configuration is for
the system at that time, all kinds of things. And also what the operator's
doing, what the hardware conditions are. So you can have a hardware failure,
you could have data errors, you can have operator errors, you can have just an
enormous range of things. And you make a list of all the variations, and then
by the way you need different data values too. And so you got different data
values. So if you go through and actually see what are the
conditions -- I did this on a program with 57 lines of code that I'd
written for the PSP. I went through and analyzed exactly how many test cases
I'd have to run to exhaustively test it. I didn't worry about different data
values yet. I assumed I would classify those and I could come up with that and
I never did go back and do that. But it was like 268 test cases for 57 lines of
code. I mean, it's extraordinary. And that's true. So people can talk about
automated testing and that sort of thing, but the number of possibilities is so
extraordinary you literally couldn't do a comprehensive test in the lifetime of
the universe today.
Booch: So in effect there's a
combinatorial explosion due to the number of possible states.
Humphrey: Exactly. And you
look at all the number of possible ways things can connect, I mean it's
extraordinary. And so when people have this enormous faith in testing it's
vastly misplaced. And so the quality problem is really severe. And that's the
issue that I was getting at. My sense was if you didn't deal with quality
exhaustively at the very beginning, at the smallest unique level of the program,
you will never solve the problem anywhere else. And that you can do. And so I
found, and this is what I was after finding out with the PSP prior to my
quality study, could I produce defect-free programming? And my contention that
a program was defect-free, if I wrote the program, I had a design, I went
through it, I produced a comprehensive test and if I wrote the program and I
compiled it without error and I ran all the tests without error, then I figured
I probably had a pretty good program. Now that was without error the first time
I ran the tests. So I'm now treating testing as not a way to find defects, it's
a way to verify the quality of the work I've done.