- Kernel Parameters Run Amok
- The Quick Fix
- Tools
- syslog
- Removing Bottlenecks
- Summary
7.2 The Quick Fix
Despite our preventive measures, let us suppose a server does get itself into a jam and email backs up. Further suppose that we know the problem is that the disk on which the queue resides is not fast enough to handle the load. We may already have another disk ready and an outage window scheduled to perform the upgrade when the system will be less busy, perhaps after most people have gone home from work for the day. Once we can take the machine down, we plan to carefully back up the data on the queue disk, verify the backup, add the second disk, stripe the old and new disks together using software RAID, bring the system back up, test it, restore the backed-up data, verify that this restoration went well, restart services, monitor them for a while, and call it a success. All in all, this strategy sounds like a well-reasoned upgrade plan.
The question is, What should we do now ? The upgrade window may be hours (or perhaps days) away, the system is running slowly at this moment, and users or management may be asking if something can be done in the short term. Sometimes a quick fix is possible. If the server normally serves other functions, perhaps they can be suspended temporarily. With a POP server, perhaps incoming email could be turned off or at least dialed back long enough so that users can read the email they already have. Temporarily turning off lower-priority services is a reasonable reaction to a short-term performance crunch.
In some circumstances, one might be tempted to try to make short-term alterations to the server to get through the crisis. One could attempt to move older messages out of the queue and into another queue to expedite processing of the main queue. One could lower the RefuseLAparameter in the sendmail.cffile to try to lower the load on the system. Many other things could be attempted as well. In reality, these attempts at short-term fixes rarely help. Usually, its best to just let the server work its way out of a jam.
Some assistance, such as rotating the queue or perhaps changing the queue sort order to be less resource intensive, can prove beneficial, but most of the other problems wont be mitigated by just stirring the pot. For example, if one wants to move messages from one queue to another queue on a different disk, what operations must happen on the busy disk? The files will be located, read, and then unlinked. This is exactly the same load that will be put on the disk if the message is delivered. If the message will be delivered on the next attempt, we gain nothing by trying to move it. If the message will not be delivered for a while, we can lower the total number of operations on the disk by rotating the queues, and then suspending or reducing the processing of the old queue temporarily. Performing a queue rotation requires far fewer disk operations while deferring or reducing the number of attempts that will be made to deliver queued messages.
Similarly, attempting to reduce the number of processes, the maximum load average on the system, or otherwise trying to choke off one resource in order to reduce the load on another typically arises from a spurious assumption. One may be able to reduce the load on the queue disks, for example, by reducing the number of sendmailprocesses that run on a server. However, reducing the load on the disk doesnt solve the problem, because the load on the disk is a symptom of the problem. The real problem is that more email is coming in than the server can handle. In this case, having a saturated disk is a good thing. It means that the disk is processing data as fast as it can. If we lower the amount of data it processes, the server will process less email. The external demand on the busy server will not decline because of our actions, but rather increase because we have voluntarily decreased the servers ability to process data, which is the last thing that we want to do.
Some administrators might voice concerns that a system under saturation load will run less efficiently and, therefore, process fewer messages per unit time than a less busy server does. With some types of systems, this concern is well founded, but two points make this possibility less of an issue for the sorts of email systems discussed in this book than for other types of systems.
The first point is that while there exist a large number of fixed resource pools on an email server (CPU, memory, disk I/O, and so on), each process on that server remains largely independent. Thus one process running slowly generally does not cause another process to run slowly, other than through the side effect that both may compete for a slice of the same fixed resource pie. For example, if a system is running so slowly that the process table fills and the master sendmaildaemon cant fork off a new process to handle a new incoming connection, this issue doesnt cause a currently running sendmailprocess to stop working. These events are largely independent of one another. It doesnt matter to a remote email server whether an SMTP session with the busy server couldnt be established because the master daemon cannot fork a child process or whether the connection is rejected by policy to avoid loading the server.
On the other hand, if, for example, sendmailwere a multithreaded process running on the server with one thread handling each connection, and it didnt have internal protection against running out of memory, then the process running out of available memory could affect any or all other sendmailthreads of execution, which could have catastrophic consequences. Fortunately, todays UNIX versions do a remarkably good job of isolating the effects of one process on another. If yet another process is competing for a fixed resource, that conflict may cause the other processes using that resource to run more slowly, but the resource will almost always be allocated fairly and the total throughput of the system will stay roughly constant , which is what we really want to happen.
The second point is that even if the system is processing less total data in saturation than it would under a more carefully controlled load, maneuvering the system to achieve higher throughput is very tricky. If some threshold, such as MaxDaemonChildren, is lowered too little, it will have no effect on the systems total throughput. If it is lowered too much, resources will go unused, which will lower aggregate throughputa disaster. The sweet spot between these two extremes is often very narrow, hard to find, and, worst of all, time dependent. That is, the right value for MaxDaemonChildrenmight differ depending on whether a queue runner has just started, how large the messages currently being processed are, or how user behavior contributes to the total server load.
In summary:
When a server gets busy, the most important thing is to find the real cause of the problem and schedule a permanent fix for it at the earliest convenient moment.
In the meantime, one might be able to do some things to help out in the short term, such as temporarily diverting resources away from lower-priority tasks.
Making configuration changes to overcome short-term problems is difficult at best, and will often cause the total amount of data processed by the server to go down, not up, which is not desirable.
Because an email server generally allocates limited resources fairly, even when saturated with requests, the best course of action is often to let the server regulate its own resources, as it will likely do so more efficiently than it would with human intervention.
When a real fix for a saturated email server cant be implemented immediately, its usually better to let the server stay saturated and, hopefully, work its way out of a jam rather than to try to interfere.