Power On
The first set of boards came back from assembly and performed admirably on the bench top. The hardware team did their job measuring thermal characteristics, calculating power draw, and capturing bus transaction traces on the logic analyzer. Programmers sifted through register dumps and debugged their code. Knowing there was a possible exposure to crosstalk problems, Anne probed around the board while the other team members were out for a caffeine break. She found low levels of quiet line noise—for example, 200 mV on a 2.5 V signal. No cause for alarm. After two weeks of intense work, the development team declared the board ready for pre-production testing, and everyone left work early to celebrate.
The first sign of trouble arose in the final days of pre-production testing. A set of boards was in the thermal chamber undergoing stress testing at temperature and voltage corners when one of them crashed during a boot cycle, leaving little useful information in the registers. They pulled the board out of the chamber and attempted to repeat the fail on the bench top. No luck. The experienced engineers on the team collectively cringed, remembering painful 14-hour days on another project with a similar irreproducible error. Back to the chamber and still no luck. How worried should they be about this phantom error? It only occurred once while stressing voltage beyond design limits in an attempt to emulate the spectrum of silicon processing variations they expected to see in manufacturing. One could make a strong argument that silicon and voltage stresses produce different failure mechanisms. Are the test results truly representative of actual operating conditions?
Then the phantom reappeared, this time on the bench top while someone happened to be watching the console. The processor had just finished loading boot code and was passing control to the bridge chip to initialize the PCI Express interface. This serial number failed more than once, although not on every boot. The next morning during the daily meeting, the team discussed their options. Hearing the words "PCI Express initialization" reminded Anne of the design review and the I2C clock. She convinced the team to set up two parallel tests probing the I2C clock and its PCI Express neighbor using a glitch detect trigger. They repeatedly booted each board watching for the telltale crosstalk signature.
Their persistence and patience finally paid off when the gremlin surfaced again several days later during the middle of the graveyard shift around 3:00 am. Since I2C and PCI Express are asynchronous to one another, the board only crashed when timing conditions were exactly right for the PCI Express crosstalk to superimpose on the I2C clock while it was passing through the threshold region. The sharp, low-amplitude crosstalk caused a brief slope reversal in the slower I2C clock edge—just enough to clock the input latch twice and upset the state machine.
For a while, the engineers on the development team were elated at having found clear evidence pointing to the origin of the fault. Then reality set in: There was little to be done about it. It is not possible to slow down the edge rate on a PCI Express net. Someone would need to notify management, and that unpleasant job fell to Anne as the signal integrity engineer. They did not receive the news well but were at least decent enough to realize that they had set the stage by proceeding with an extremely aggressive project without gathering input from those responsible for making it happen. Management asked development engineering for a recovery plan.
The following day, the team isolated themselves in a conference room with a white board, pizza, caffeinated beverages, and mandatory cell phone silence. Adding layers to the board and rerouting did not appeal to anyone. Just when they had run out of ideas, Bob the veteran popped his head in the door because word of their dilemma had reached him and he was curious about how things were proceeding. Not very well, they explained, and invited him to listen in for a few minutes. Bob had earned a great deal of respect among his colleagues by observing intently, listening thoughtfully, and not opening his mouth until he had formed a well-founded opinion. When he finally spoke, people's ears perked up.
"What kind of package does your I2C buffer use?"
"TSSOP."
"Good. I was hoping you wouldn't say BGA. Why don't you lift the lead of the clock output pin and insert a snappy little high-bandwidth FET? This will sharpen up the I2C clock so the crosstalk will simply roll off the edge a bit rather than cause a slope reversal."
It may look good on the balance sheet to hand out pink slips to experienced engineers like Bob from time to time, but when they walk out the door, their experience goes with them. After implementing Bob's fix, the gremlin disappeared for good. Project Coyote was out of the ditch—for the time being.