- Introduction
- Impact of Scaling on Reliability
- Defects, Faults, Errors, and Reliability
- Reliability and Quality Testing and Measurement
- Reliability Characterization
- Reliability Prediction Procedures
- Reliability Simulation Tools
- Mechanisms for Permanent Device Failure
- Safeguarding Against Failures
- Concluding Remarks
1.3 Defects, Faults, Errors, and Reliability
Reliability of a system as a function of time t denotes the probability that the system will function correctly until time t. The variable t has an initial value of zero at the time of fabrication. Physical defects can occur during various stages of fabrication of chips, including silicon crystal formation, oxidation, diffusion, optical lithography, metallization, and packaging. The deep-submicron process technologies that we have today have a statistical behavior and produce a mix of good, bad, and weak devices in various proportions, based on the process maturity. Bad devices are rejected immediately during either normal or stress testing done during manufacture. Weak devices may pass normal manufacturing tests (for functional faults), and some of these devices may barely pass stress tests (such as burn-in) during manufacture. However, such devices may be near the threshold of failure and may fail in the field due to stresses associated with constant operation that lead to large leakage currents.
However, it should be noted that only a subset of manufacturing or field defects may occur in the active circuit area and affect the operation of the memory. When a defect occurs in the active circuit area of a memory device and causes incorrect data to be retrieved from the memory, a fault is said to have occurred. When a fault causes a system malfunction, an error is said to have occurred. Not all faults cause errors, and not all defects cause observable faults. For example, a system that is equipped with redundancy and online error correction may be able to tolerate some faults and continue to operate in an error-free manner. If a physical defect results in, say, a large leakage current, it may not produce an observable fault but may, however, degrade the performance and lead to potential reliability problems down the road, during the operational life of the system. These faults are called parametric faults. At some future time, these faults may cause logical errors. These fault types are discussed in detail in our earlier book [281].
Apart from hard faults, we may have transient failures, such as those due to alpha-particle hits, cosmic rays and electrostatic discharges, and intermittent failures, such as those caused by resistance and capacitance variations, pattern-sensitive faults and coupling faults. These faults affect the system for a very short time and are best addressed with error-correcting codes or system maintenance strategies such as error scrubbing, discussed in Chapter 4, or via mitigation techniques, described in Chapter 3.