- Introduction
- Impact of Scaling on Reliability
- Defects, Faults, Errors, and Reliability
- Reliability and Quality Testing and Measurement
- Reliability Characterization
- Reliability Prediction Procedures
- Reliability Simulation Tools
- Mechanisms for Permanent Device Failure
- Safeguarding Against Failures
- Concluding Remarks
1.9 Consluding Remarks
A variety of processing and circuit techniques are used to safeguard against the types of failure that have been described. To deal with single-event upsets causing soft errors, radiation or SEU hardening techniques, including error detection and correction, are employed. Some other examples of such hardening are (1) process-related structural modifications, which try to make the layout more tolerant to single-event upsets; (2) current-monitoring and current-sensing techniques and associated error correction; and (3) design changes in the storage cells and sense amplifiers, which is another proactive technique for trying to minimize vulnerability to such errors. All these approaches are described in Chapter 3. These techniques are useful in safeguarding against soft errors. To a limited extent, these techniques can be used to fix a combination of hard and soft errors, especially if the total number of erroneous bits per memory word is very small. To correct multiple faulty bits per word, a large number of check bits (see Chapter 4) will be required to encode each word, causing a very large area overhead and memory performance penalty. One shortcoming of conventional error correction is that although the corrected word is sent to the processor reading the memory, the error is not actually corrected in the memory. This causes erroneous data in memory locations to accumulate over time, thereby quickly exhausting the limits of error-correction coding. One technique for reducing such error accumulation is by doing periodic memory system maintenance, using error scrubbing and other techniques that restore the correct data in the erroneous memory bits. These approaches, together with coding schemes used for memory error correction, are described in Chapter 4.
Error-detection and error-correction techniques are not very useful if the number of hard errors per word becomes very large, either because of certain kinds of faults, for example, row decoder faults that affect all the bits in a word at the time of manufacture, or because of error accumulation over time during field use. In such cases, faulty memory locations need to be replaced with fault-free redundant locations, to guarantee reliable behavior. In such replacement, the faulty rows and/or columns in the memory are switched out and fault-free redundant rows and/or columns are switched in. This can be achieved in two ways. One approach, called hard repair comprises blowing on-chip fuse/antifuse devices in the decoders at the time of the manufacturing test, using laser or electrical means, and connecting redundant rows and columns. This approach cannot be employed easily during field operation or for embedded memories, hence we have an alternative approach, called soft repair. Soft repair involves logically switching out faulty elements, by diverting read and write accesses from faulty elements to fault-free redundant elements, either during manufacture or during operation of the device in the field.
Such repair and reconfiguration may be done in conjunction with either external testing or a built-in self-test (BIST). If done in conjunction with BIST, by augmenting the BIST sequence with a repair-and-reconfiguration sequence, we have what is popularly known as built-in self-repair (BISR). Repair and BISR require algorithms not only for detecting faults in memory devices, but also for locating these faults. Fault location and repair algorithms and architectures, and hard and soft repair, are discussed in Chapter 2. In Chapter 6 we describe the circuit and layout issues underlying a design automation approach for built-in self-repairable RAMs.
In Chapter 5 we study the modeling and analysis of manufacturing yield and the factors that affect yield. Yield may be thought of as being equivalent to reliability and fault tolerance immediately after manufacture and is dependent not only on the manufacturing process parameters (e.g., number of masking steps, maturity of the process and scaling) but also on redundancy and repair, which causes a certain percentage of faulty chips to be usable after repair. This chapter also describes the relationship between yield and reliability and summarizes various yield management techniques.