An Interview with Watts Humphrey, Part 25: SEI Strategy and the Trouble with Trivial Errors
This interview was provided courtesy of the Computer History Museum.
Nico Haberman and the SEI Strategy
Humphrey: Fairly early on, you
remember now, there were a bunch of techies there at the SEI. I mean good
people, but they just weren’t on my wavelength. And Nico
Haberman had been the computer science chair or dean,
and I think was the guy who had actually conceived the whole idea for CMU
proposing that the SEI be there. He was a marvelous guy, a Dutchman. Highly revered. Unfortunately, he died prematurely. He had a
heart attack running on the beach, I think, in 1994. I remember because we were
in
In any event, Larry Druffel called a strategy meeting shortly after he became
director. It was a little bit after, because we had gotten far enough along to
put together the ideas for a maturity model. I mentioned that I had sort of
dreamed up this five levels and worked off that
“quality is free” idea and that sort of thing. We put together the five levels
with these questions with the people from MITRE. So we arrived at this strategy
meeting with all of the techie people, and Nico was
invited. During the meeting, we broke into about three subgroups -- working
groups -- to put together what we thought the SEI strategy ought to be and we
were going to have these three presentations at the end. Well, by great good
fortune I happened to be on Nico’s group. And we were
going through a discussion, and they came to me and I got up and I described
this five level model and how you use it for improving software development
work and that sort of thing. And the guys there were panning it, right? Well,
what’s that for? Why do I want to do that? And Nico
interrupted and said, “This is why we formed the SEI.” He said that’s it,
“That’s marvelous.” And I’ll tell you without his backing I don't think I would
have been able to get it done at the SEI. I mean, he could see it. It was
exactly what he was after. So I thought that this man had vision and he really
did, he had marvelous vision, a wonderful guy and I was really sad that we lost
him so early. But in any event, so he played a very powerful hand in forming
the SEI and in putting us on the path that we’ve been on ever since. Because
without that we would never have had a process program that went anywhere at
all. Okay.
Booch: Very well.
Humphrey: Well, now, let’s
shift gears, because I think I’ve pretty much caught all the way up to where I
was made a fellow and Barbara had decided that she wanted to live in
And so Barbara rather
wisely suggested, why don’t you put together a monthly report? And so I did. So
I started writing a Fellow report, and I wrote one every month, and I've done
it ever since. And I've got them. That's part of what I'll send to the museum,
my monthly reports. So I wrote about the meetings I had and the trips I made
and what I'd done. And I was really quite surprised when I started looking at the
end of the month that I had a whole bunch of items I'd actually accomplished. I'd
been off and done this and I'd been asked to come give the keynote speech at
some place and I had some conference calls with people. And so I was actually
accomplishing a lot more than I realized. And so that all of a sudden was a
great feeling of more self-worth out of it. It's very helpful to do that. And
that's one of the reasons I kept it up. I was sending copies to a distribution
list at SEI, but since my cancer, I basically stopped. My activities are so
limited and I'm writing it but not widely distributing it anymore.
Booch: Well, let's consider this
interview here one of your extended reports.
Measuring Myself
Humphrey: Right, okay. So I
was writing the PSP programs and a whole bunch of things came up as I was doing
it. One of the questions was “exactly what do you want to make?” And I decided
I wanted to get really basic measures. People get all confused about what
measures are, and I wanted things that were auditable. And think about it this
way: think about a measurement system that's scalable. Can you think of one
that really is scalable from the smallest to the biggest business or operation?
Booch: Well, source
lines of code comes to mind.
Humphrey: Well, yes, it is an
auditable measurement system, and source lines of code is
a measure of size. But a widely-used measure that is scalable from the smallest
to the biggest business is cost accounting. I took cost accounting in my MBA
which was very helpful, but they use it and it scales up to the largest
corporation. Everybody uses that, and the reason is they're working from
auditable real data. And you can count on the data. You can actually build it
and define it. It's firm, it's clear. They say that
the data don't lie. I mean liars can figure but the figures don't lie. You know
what I mean? And so that's what I tried to put together: an auditable
foundation of data. And what I've discovered was that to do that I had to have
really well defined data.
And then I realized
that the pretty basic data items that I concluded I needed were size, time and
defects. And that's the size of everything you build, the time you spend doing
every action, whatever you're doing, every task, and the defects you find at
every step. Now to do that, to define that, however, that means you've got to
define the steps of your job. So you must have a defined process, you must
connect the tasks that you're doing to your process, and then you connect the
data and the product so the tasks have to connect to the products and to the
process.
And so it's a big
interconnected system. But if it's auditable and if you have real sound data
under it, what we have discovered is that I can put together a report on a
program. I write of 100 lines of code and I can look at it and we can do that
identical report with that information in it and we can scale it up to a system
of several million lines of code. The identical numbers.
Now as a matter of fact, you can then look at it and you can say, well what's this mean? And you can bore back down because it's
now constructed from an auditable base. And you can go find all the data for
all the parts to justify it: the defects, the time spent, what happened, how
big this was, the changes, all the activity-- I mean it's amazing what you can
get out of this stuff.
Booch: So the first two things you
mentioned are fairly quantifiable but defects seem to need to be something that
there's a spectrum of -- different kinds of defects. So in your early
incarnation of this, how did you characterize what a defect was and did you
have a sense for different classes of defects and if that would weigh things?
Humphrey: Yes, yes. Well that
was, of course, the first question: “what is a defect?” Lots of debates with
that, and then you try to get an agreement with the lawyers as to what a defect
is, but don't bother because you'll never get there because they are all
looking for blame. But the issue I found was if you focus on it in terms of
actual actions, what you have to do, the programmers have no trouble
identifying defects. And I didn't either. It's something I have to fix, period.
It was very straightforward. Now I had to define them, and you have got to be very
careful defining defects, because people want to confuse the cause with the
actual nature of the defect.
And so a defect, you
know, may cause a buffer overflow, and there may be things that it does and the
results and the problems they cause, but it is in fact an error in a loop. Or
it is, you know, some particular error that you've got in the code. And they
could be trivial errors. I remember there was one defect. Some guys were
teaching the PSP at DEC, and one of the guys called me from there. They'd been
to PSP training, they never went very far with it, which was too bad because as
you know DEC got bought, and things got moved, and management changed, and it
was very hard to maintain this improvement in the face of that dynamic.
Booch: And their artifacts became
eventually the
Humphrey: But in any event,
DEC had this non-stop computing system where they could put two or more processors
together, and if one processor had a problem, a second one or a third one would
pick up. That was a marvelous system, and it was actually guaranteed to run
without failure. The problem was it was failing. And so the guys called me
about it. They told me it was really quite an amazing story because they had
struggled with this thing. It didn't happen very often but when it hit, all the
connected machines locked up, bam, and that was it. And they had to go and shut
the whole thing down and restart. And I'll tell you, that was a disaster. And
so this guy called me -- I think his name was Goldman. Matter
of fact, interesting guy. He worked with Howie
Dow, who's the guy that I had known at SEI that taught them the PSP. And he
called me, he said what had happened was that they were just about -- I'm going
to back up.
This problem finally
became so severe that this manager decided he had to take his two best
engineers and assign them full time to find and fix this problem. And so they
did. And they worked for two to three months before they finally figured out
what it was. And before they fixed it, the manager said, "Oh hold now,
because where that defect is, is code that they're
just about to do an inspection on for a revision of the system. So let's just
participate in the inspection and see if they find it." So he did. And he
called me to tell me the results.
The guys had gone
through the inspection, took them several hours. Remember now, the two best
programmers spent three months on this, the inspection team spent a couple of
hours going through the code, and then they were going through the defects, and
one of them pointed out this defect. And so the manager asked them, he said,
"What kind of trouble do you think that would cause?" And the guys
thought it would be kind of an annoyance but they'd probably find it in test
and this sort of thing. And he asked how hard they thought it would be to find.
You know, it'd probably take a little while in testing. And then he told them
the story.
Well, it turned out
it was a trivial little error, and they could fix it in a few minutes, and they
found it real quick with an inspection when they really had a team focused on
it. And that's what we're telling engineers they need to do. They need to do
personal reviews; they need to do team inspections. Don't count on testing
because some of these trivial little problems have enormous consequences, and
that's what this was. It was a very simple little problem. It was basically the
trigger problem where one computer would actually end up needing feedback from
another one, and the other one would end up getting in a loop where it needed
feedback, and the two of them would both essentially wait for feedback from the
other one. So they’d basically hang. And it was a simple defect in terms of
making sure the hang wouldn't occur. And it was a trivial one but as you know,
those trivial errors can cause enormous trouble.
So the issue here is
to separate out what the defect is. In this case it could be a bit that's set
wrong or it could be an overlooked loop closure or it could be an off-by-one
error. It could be, you know, you name it. The consequences are another thing
and the causes are another thing. So the defect classifications that we have
for the PSP simply relate to what the defect itself is: what you've got to fix.
And if people want to go and analyze causes and all that, the defect data are
very helpful and can be used. But when the engineers are just looking at what
it is and what's got to be changed, they have no trouble identifying fixes, and
we've got data now on something over 30,000 programs written for the PSP. And
every PSP program the guys record every defect. And
they don't have any trouble. No one argues about it; you just say here's all you
do and everybody does it.
And so while people
may say that you can't actually count the defects, we have no trouble. And I
have thousands of engineers’ defect data. Now we're doing it with no trouble,
no one debates it, no one argues about it, so it's a lot of smoke. And the
reason I say it's smoke is because I’m talking to
people who haven't actually tried to do it themselves. So if you sat down and
were to track all the defects that you find in your programs, when you review
them, when you compile them, when you test them, when you build them, whatever
you do, track every one, look at it, figure out what it takes to fix it, track
it. No big deal. So, that answer your question?
Booch: Absolutely. It does very much,
thank you.
Humphrey: Okay. And that's why
I wanted to sit down and write the PSP programs because I was getting all this
smoke about stuff you couldn't do and why, and this and that, and so I decided
to just try to do it. Here's what I thought we ought to do, so I was trying to
act like a CMM level 5 organization of one person. I was trying to do
everything I thought was needed and put the whole thing together and I did. And
it was extraordinary. So that's what the PSP was all about.
Booch: And while you were doing this,
who were you primarily collaborating with? Or was this springing largely from
just your mind and work?
Humphrey: It was me. It just
did it myself. I was working on statistics because I
discovered a lot on my analysis work. I needed simple statistics and so I
worked through a statistics book, a graduate text on statistics. I remember it
was kind of funny. Barbara and I went on a trip to
And I took along my
statistics textbook and
spent a fair amount of the trip just going through that. Although we did see Capadocea and lots of
other wonderful places. But so I remember going through that and
learning statistics. I had some people in the statistics department at CMU that
were very helpful -- John Lahosky and others. I had
the people at the SEI help me with programming problems. I hadn't programmed in
years. I didn't know the modern systems and so I'd call folks -- Jim Over and
others -- and say, "Well, how do you do this?" And so I had all kinds
of people help.