- Introduction
- Impact of Scaling on Reliability
- Defects, Faults, Errors, and Reliability
- Reliability and Quality Testing and Measurement
- Reliability Characterization
- Reliability Prediction Procedures
- Reliability Simulation Tools
- Mechanisms for Permanent Device Failure
- Safeguarding Against Failures
- Concluding Remarks
1.4 Reliability and Quality Testing and Measurement
Reliability and quality of memory devices are evaluated by a set of tests that try to do two things: (1) verify the present functional behavior of each device, the yield of the manufacturing process, and human handling of devices through the supply chain to the consumer, and (2) identify potential problems in the future behavior of each device. Functional testing, on the other hand, deals only with the problem of verifying the correctness of the RAM operation at the time of testing. Functional fault modeling and testing have been described in our earlier book [281] and also by other researchers (see, e.g., [469]), and will not be described here. For high reliability and quality, we need a set of tests for detecting potential future failures, together with functional tests for verifying the present behavior. Some of these testing and measurement schemes are described in the following sections.
1.4.1 AQL measurement
One parameter used to quantify the quality of an integrated circuit is the average quality level (AQL), defined as the ratio of the number of defective ICs received by the customer to the total number of ICs delivered to the customer. A low AQL corresponds to high quality. Many factors, not all technology-related, may influence the AQL. Human mishandling, electrostatic discharge, or electrical overstress (ESD/EOS) may increase the value of AQL. Low fault coverage by the manufacturing test may also increase AQL, because such a test will fail to screen a number of defective ICs. If the manufacturing process is very good and only few defects are likely to happen, the AQL is expected to be low. For a constant defect density, small chips are less prone to defects than larger ones. Hence, for a given AQL, relatively low fault coverage can be tolerated for smaller chips than for larger ones. Similarly, for a given AQL, relatively low fault coverage can be tolerated if the manufacturing process is very mature and only few defects are likely to happen.
1.4.2 Burn-in testing
To ensure reliability, IC manufacturers have used high-temperature and high-voltage burn-in tests (also known as stress tests), to accelerate device failure and consequently, weed out weak devices. Burn-in allows test performance data to be collected over the life of the device by accelerating the infant mortality period with high temperature and high voltage. In such tests, devices are batched together and baked at a high temperature in a controlled fashion for several hours and then tested individually on a single-device tester. Such a system is called a dynamic burn-in system. To better utilize the hours spent in the burn-in oven, some of these systems are equipped with functional testing at the high temperature, thereby providing more data on failures and reducing the cost of device handling. Such a system is called a monitored burn-in system.
Dynamic burn-in systems apply extreme voltages and temperatures while exercising the inputs of the device under test (DUT). The operator of such a system removes the devices from the oven periodically and tests them on a single-device tester. This procedure requires labor and involves expensive device handling and is therefore very expensive. The operator must first stop the burn-in cycle and wait for the device to cool down. Burn-in boards (BIBs) are then removed and devices are extracted from the boards. These devices are placed in handling tubes and mounted on a single-device tester. Functional tests are then applied on the devices. The devices are then sorted according to failure types, and the good devices are plugged back into the BIBs and reinstalled in the system. This entire process is not only very expensive but is also prone to device damage due to electrostatic discharge. Moreover, the large number of steps reduces throughput and increases the testing time.
Monitored burn-in systems equipped with functional testing represent a more economical approach. Functional testing of devices in the burn-in chamber eliminates several handling steps, thereby reducing labor and the chance of damaging devices. At the end of the burn-in cycle, the number of failed devices is monitored to check the effectiveness of burn-in. If the number of failures is high, the cycle can be extended automatically. Monitoring the number of failed devices before and after the extended cycle can be used to predict the mean time between failures.
1.4.3 IDDQ and IDD testing
Another metric for quality and reliability of CMOS devices is the quiescent current drawn from the power supply, known as IDDQ. IDDQ testing is performed by measuring this quiescent current for each functional test vector, using high-speed current monitors for fast measurement. In some sense, IDDQ testing can be viewed as a quality- and reliability-driven enhanced functional testing approach. It may be recalled from the literature on RAM testing algorithms (see [281, 469]) that in functional testing, binary test patterns are applied at normal clock frequencies. Each such test pattern is a bit vector of voltage levels presented to the address, data, and control pins of the RAM under test. For fault-free devices, the quiescent current during functional testing and normal device operation should only be a subthreshold leakage current and its nominal value should be only a few microam-peres. Abnormally high values of IDDQ are associated with many common defects in CMOS, such as gate oxide shorts and bridging faults. Many latent defects, such as gate oxide pinholes [162], can also be detected by IDDQ testing. These defects may not cause immediate functional failure, but could produce faulty behavior during field use.
The effectiveness of IDDQ testing for SRAM chips, especially for short circuit and open circuit defects, has been demonstrated by Meershoek et al. [292]. Soden et al. [413] have shown that IDDQ measurements can in some cases detect stuck-open faults at drain interconnects, particularly at the output nodes of RAM address decoders employing NOR gates. For example, an open interconnect on the drain of a pulldown NMOS transistor of an address decoder may cause the output to be put in a high-impedance state. If the previous state of this drain node was a logic 1, then all the access transistors on the word line driven by this NOR decoder will remain spuriously selected when they should have been turned off. This would cause a high value of IDDQ for the memory access cycle, and possibly also for some future access cycles.
If the floating drain node has a time constant small enough compared to the test vector clock period, the increased value of IDDQ can be monitored in the same test clock in which the vector that causes the high-impedance state is applied. Therefore, IDDQ testing requires very efficient and sensitive current-monitoring circuits. Meershoek et al. [292] used a special op-amp-controlled current mirror circuit to measure the current.
Apart from IDDQ monitoring, researchers have also investigated monitoring of the dynamic (or switching) power supply current IDD. Su and Makki [432] studied the relationship between the dynamic power supply current IDD and various RAM defects that induce pattern-sensitive faults. Some defects considered by them are cell open faults caused by opens in the wiring, gate, or along drain-source and cell short faults caused by breakdown in the insulating layer separating nodes from different cells. Each class of defects results in a peculiar response of the dynamic power supply current IDD. The various fault models for the defects under consideration are illustrated in Figure 1.6.
Figure 1.6. Defect models for IDD testing; courtesy [432] © 1992 Kluwer Academic Publishers
1.4.4 Parametric testing
Parametric tests are performed to measure the analog characteristics of device parameters at the input and output pins. Dc parametric tests measure static parameters such as voltages and currents under steady-state conditions. For a RAM chip, dc parametric testing and measurement include: (1) a continuity or contact test for electrostatic protection diodes, for each electrical path to and from the chip; (2) measurement of input leakage currents, static and dynamic power supply currents, and input switching thresholds; and (3) measurement of output voltages and impedance values. Ac parametric testing includes measurement of dynamic parameters such as rise and fall times and slopes of output signals, setup and hold times of input signals, and signal propagation delays, when a set of functional test vectors is applied to the device. These tests may detect unhealthy deviation of device parameters from the prescribed values. Sometimes, such deviation may be caused by latent layout defects (such as leaky transistors, gate oxide shorts, etc.). These defects may not cause failure immediately after manufacture but could later develop into observable failures during field use and necessitate expensive field replacement for the failing devices.
1.4.5 Dynamic testing
Dynamic testing verifies the timing behavior of the memory device by applying a functional test with a very fast clock and a very slow clock. The fast clock test would verify that the device is not too slow, and the slow clock test would verify that the device is not too fast. For example, a fast clock test may detect a slow sense amplifier or decoder operation, and a slow clock test may detect excessive leakage currents from DRAM cells, some flaw in the self-timing nature of the sense amplifier strobes, or a potential hold time or double-clocking problem if the RAM output is sent to a register.
Figure 1.7 gives an example of the self-timing relationship necessary between the various control signals in a folded bit-line DRAM column, such as the sense amplifier strobes, ØS1 and ØS2, to ensure correct behavior. Note that for DRAMs, most of these internal strobes are generated asynchronously by adding delays to clock edges. The minimum delays necessary from precharge to word line active and to reference word line active (TPW , from word line active to ØS1 active (TWS1), and from ØS1 active to ØS2 active (TS1S2) are shown in the figure. Apart from these timing constraints, there are additional constraints with respect to the clock signal and RAS and CAS signals. If the memory device is too fast (e.g., if the strobes ØS1 and ØS2 do not have adequate delay between them), then the read operation may be faulty.
Figure 1.7. Folded-bit-lineDRAM column and the timing requirements between the various strobes
The presence of timing problems in the device affects its reliability during field operation and may necessitate field replacement, depending on the environment in which the device operates.