An Interview with Watts Humphrey, Part 33: The Boeing B2 System and the "Last Liar Problem"
This interview was provided courtesy of the Computer History Museum.
The Boeing B2 System
Humphrey:
Let me move now to large systems. I mentioned one of the problems we've had is
that software has been hard to manage since the very beginning, and I ran into
that very early on. I'd go out and talk to software teams. They didn't know
where they were, they couldn't tell me where they were. I couldn't tell by
looking at them what was going on. It was just sort of this fog of “What's
happening?” and you sort of hope something will come out some day. And that's
basically what we were going through. And I was able at IBM to deal with that
when we put in place some structured steps and a process, so I had some idea
where they were. But it wasn't at the level we had to reach with the PSP and
TSP, to really understand status.
The problem you have
on large programs, and it’s the point that Fred Brooks made -- it was a
wonderful point, it’s in his “Mythical
Man-Month” book -- and if anybody hasn't read it, they need to read it.
That's a marvelous book. But he says
schedules slip a day at a time, and that's absolutely true. And so the question
is, when the schedule is slipping, how long does it take to know? If you're
really slipping the schedule, when can you take action to recover? Today, the
software community basically knows we're in real trouble when they're not into
test when they should be. That's usually a year or more after they start. They're
way down the road. It's unrecoverable.
The issue is, how can you detect a one day slip that same day? What's
interesting is that, with the TSP, we can do it. So that's what we do. That's
why they meet schedules, that's why they can come in
on time. We had one team, this was a Boeing team years
ago. There were disasters with them. Half of these projects have disasters,
because of management changes or something else -- all kinds of stuff that have
nothing to do with what we're doing with the process. So we run into this and
it's one of my frustrations, the dynamics of the management system. Maintaining
continuity in an organization in any kind of improvement effort is not totally
impossible, but almost, unless you really are dealing with the top of the
organization. And we were not doing so at Boeing.
This story started in
about September one year. Boeing had bid to do a major update for the weapons
delivery system for the B2 bomber. They were under a subcontract from Northrop,
and the Air Force had told them that they wanted a lot of features but that
only a few were truly critical. Those critical features had to be completed by
December in the following year for a B2 flight test. So Boeing looked over the
feature list and gave the Air Force a bid to do them all.
I heard that people
had decided to use the PSP on the project, so I called them and told them they
should use the TSP because the PSP by itself wasn’t likely to be successful.
They agreed and asked me to be the team coach for the first launch. They also
trained two of their people to be local coaches.
Well, when I got
there and we did the launch, we did not realize that having people watch the
launch could be a real problem, so I agreed that they could have people sit in.
The team had 18 engineers and we got through the first day-and-a-half of the
launch and had made the first preliminary estimates for the development effort.
During the launch, some of the observers went to see the program manager and
told him that the development estimates were much bigger than their estimates,
so the schedule would not meet their December deadline. The program manager
then killed the launch during the lunch break. Unfortunately, we did not have
enough data then to prove that the test times would be much less than the
original plan so they could still meet the schedule, so the
team did not officially follow the TSP.
What then happened
surprised me, because the Boeing managers didn’t know about the importance of
the team having a plan to follow, so they just went along following the plan
that had been developed during the proposal. The team,
however, had already produced enough of a plan to guide their work, and since
they had no other detailed plan for the work, they just followed the partial plan
that came out of the partial launch we had done. They also had two coaches who
helped them use the PSP.
In any event, this
team got going, and they could see -- they had a schedule -- they had to be
done in 15 months, and it was a hard stop. They had to be done. After three
weeks with the TSP, they could see they were in trouble. They were able to lay
it out -- they were going to be about a month, month-and-a-half late. And so
the team actually sat down. Remember I talked about task earlier? The team sat
down and figured out, “What can we do?” They concluded they had to go on
overtime for a couple of weeks. They decided not to do it right away, because
it's going into Thanksgiving and Christmas, but right after Christmas, they
went on overtime for several weeks till they got back on schedule.
And they each
established goals for personal task time and they tracked them. And they got
their hours up and they met the schedule. They actually came in a little bit
early. If they hadn't done that, they would have had no way of knowing, and
that's the case on all these projects. If you're going to maintain schedules on
these projects, you've got to manage every day. That means you've got to know
where you are every day, and that takes precise schedules, data, and tracking,
all kinds of stuff. And that can only be done by the developers themselves. That's
why knowledge workers have to manage themselves.
The manager's job is
to give them support, coach them and protect them and all that sort of stuff,
but not to actually manage what they do. That's a key point, so think of it
this way, on a big program. Take a program that's got thousands of people or
hundreds. As I say, I had the OS 360 with thousands of people, so I have had a
lot of experience with great big programs. I'll tell you, I would have given my
eye teeth for a process like the TSP. It just would have been a gift, because
we really would have had the ability to manage what we were doing.
Booch:
Watts, measure for me what a really, really large program is, versus a small
one. Bjarne Stroustrup once pointed out to me that if I can't say what a thing is not, I haven't really identified what the thing is. So
I'd like to understand your metrics for when you reach that threshold, that it
becomes that size.
Humphrey: Before I do that,
let me finish the Boeing story, OK?
Booch: Ok, then we’ll come back to my
question.
Humphrey: Well, the Boeing
team ended up producing really high quality designs and code. The overall code
ended up being much bigger than planned – as I recall over twice as big – and
the effort was much more than expected because they were making a fundamental
change in the system. The requirement was to take a hard coded weapon-attachment
design and convert it to a table-driven design. That is a bit like replacing
all the wiring and plumbing in a house without affecting the wallpaper. It’s
real tough.
So the team kept
following the TSP, and they were about five months late finishing the design.
Everybody was worried. Because the design was such high quality, however,
coding was pretty fast and they got into test only about two months late. Then
the testing was a snap – they actually were ready for the December Air Force tests
a month ahead of schedule.
You would think that
this would have convinced management to keep using the TSP, but it didn’t. The
problem was that the contract called for a design review in December that first
year and the development team didn’t have the design review in their plan so
they missed the date and Boeing forfeited an award fee as a result. In any
event, because they missed this fee, management was severely criticized by
senior management and the program was always viewed as a failure, even though
it was delivered on time.
Several things
happened that we could have avoided had we known what we know now. Actually,
however, we learned a lot of these lessons at Boeing. The first was that, if
management had trusted the team and explained the critical need for a design
review in December, the team could have held one. They knew a lot about the
design in December but, with the TSP, they were now spending a lot more time in
design and a lot less in coding. So they could have given a high level design
review in December even though the detailed designs weren’t done.
Some of our early PSP
data shows what I mean. I mean, during PSP training, the developers write a
bunch of programs. In the first courses it was 10 programs and the data on
program 1 is with their old process, the way they had always worked without the
PSP. On program 1, the data we have on several hundred programmers
shows that they spent less than 20% of their time in design at the beginning
and over 40% at the end of the course. Design was taking twice as long. While
programmers don’t now generally use compilers, they use code generators which
also try to generate code from whatever defective stuff the programmers submit,
which is essentially the same problem we were running into with compilers.
In any event, what is
surprising about this is that even though the total design and code time
percentages went up, the total time actually went down. That is because the time spent in compile and test dropped from about 45% to
only 25%. The compile time at the beginning was more than the design time and
at the end it was much less. While these were all important, the key was that
program quality was so much better that system test time dropped from a planned
six months to about one month, so even though they got into their internal
testing late they were able to deliver to the Air Force for flight testing
ahead of schedule. In a later presentation on this project, a Boeing executive
said that this time cut testing time by 94%.
I also went out to
visit the team in about early March and looked at some of their early data.
Their plan showed that the program would have about 7,500 new and changed lines
of code, and they had estimates for how much this would be in each of the
system’s existing modules. When I got there, they had only completed coding for
a few dozen modules but, with the TSP, they had the data for the new and
changed code for each module. I then assumed that the ratio for these few
modules would be about the same for the rest of the program and estimated that
total program size would be about 18,000 lines of code. My estimate later
turned out to be pretty accurate. The TSP data is really valuable for this kind
of thing, and people don’t seem to see how useful it can be in measuring job
status or estimating job completion.
We also learned that
we had to bar observers from all but the opening and closing meetings, and we
had to make sure that the managers better understood what this was all about.
This experience led to a one-day course for executives. Actually, executives
don’t go to courses so we call it a seminar. We also have a three-day manager’s
course.
OK, that’s the Boeing
story, now, back to your question about measurement, OK?
Booch: OK. I was asking about your
metrics for what is a really large program.
The Large-System Problem
Humphrey:
I'm talking about a program, typically it's three or four hundred or more
people, typically involves multiple organizations, not necessarily different
companies, but certainly different laboratories, different teams, different
groups. Typically, they're remote and they're building a fairly big product,
usually in multiple releases, not always. But that's what I'm talking about. As
I say, it's typically a fairly large product. I won't put it in terms of
software lines of code, because many of these are not software systems. They
have software in them, but they're other systems. You work on these big nuclear power plant things. We've had some
involvement in those. I think we have a team that's actually using the TSP to
design nuclear reactor power plants. But there aren't any software people on
the team at all. We have requirements people doing it too.
So you can have a big
team, which is just a large collection of groups that all have to interact and
they all have to synchronize their work. That's the key. And so the issue now
-- I ran into it at IBM, and it was one of the most serious problems we had,
and we had to keep the managers right on the ball -- because we ran into what
we call the “last liar problem.” Fundamentally, if everybody's in trouble on a
project, no one wants to admit it, because no one really knows what's going on.
They're all sort of in trouble, they can see intuitively, “We're late,” and
they wait for somebody to have problems that are so visible, they can no longer
be concealed. And then everybody else can relax, because the first ones to
admit to problems are the ones that did it. But everybody's in trouble.
And so the real issue
with these great big systems is that no one really knows precisely where they
are. No one really has a way of tracking a day at a time where they are. The
interactions and the connections among the groups are very hard to manage,
because people really can't predict exactly when they'll be done. And so all of these great big systems are enormous interconnected
things. You have this big network of commitments and that sort of thing.
In these great big systems, there are several things that are serious problems,
and one is, no one really knows where any of the pieces stand. That means that
everybody, when they're talking about their status, is defensive. They're sort
of guessing.
So the team leader
goes to the individual engineer/developer and says, “Where are you?” He says,
“I'm almost done, Chief.” Well, the team leader knows he isn't almost done and
he tries to poke at it. So they have kind of a guess at when you're going to get
into test, but no one knows. So the team leader’s kind of uneasy about it. He
doesn't have this feeling of confidence, and he gets that from all his team. And
when he talks to his management, he then gives kind of a fluffy answer and no
one is convinced he's right. He knows he's sort of soft and he hasn't met
schedules before, so why would they believe him this time? And as you begin to
build that up, a layer at a time, you get all the way up in the organization
and everybody is being defensive. No one is admitting exactly where they are. No
one is sitting down and saying, “Here are the facts. Let's fix it. Let's get in
and resolve the problems.”
And so these great
big systems, you get this kind of defensive structure, all the way up. From the
management of these big companies, it goes to the Department of Defense. And
from the Department of Defense it goes to Congress. So no one knows what
they're talking about at every level and the reason is because they don't start
with a solid foundation at the very beginning. The engineers, the individual
developers, don't know exactly where they are and if they don't, no one above
them can, and the entire system is guessing.
My point is that on
these great big systems, they're so sophisticated and so interconnected, that
you're literally taking all kinds of risks when you do this. If you count on
good luck, you're not going to get it, and that's exactly the case with these
systems. That's why, by and large, these great big systems are enormously late
and way over budget. The way you can deal with it is to start all the way at
the bottom with real precise control, put self directed teams in place, begin
to use data to track and manage it. And we know exactly how to do that. That's
what's so frustrating to me, because I can't get anybody interested.
So that's the issue
we're struggling with. That's the ultra large system problem. How we move to
that stuff, how do we handle this? I'm hoping we'll get there someday, but it's
going to be a challenge. And if next year I'll be around to
do it. But that's what we're struggling with.