How SOA Records Control DNS
The SOA record controls how fast updated zones propagate from the master to the slave servers, and how long resource records (RRs) are cached in caching servers before they are flushed. Both of these affect your ability to effectuate "instant" changes in the zones you maintain.
Consider the following two scenarios: moving a web server, and moving a DNS server. How instant you need these changes to be depends on how critical you consider the services to be. If you run DNS for an e-commerce site, everything is quite likely to be considered very critical, even if the powers that be want everything to be done cheaply. You need to be able to tell these powers that be how things must be done to make the changes work with DNS.
Moving a Service
Let's consider the case of moving a web server from one housing service to another. Depending on how many machines provide the service, you may or may not be faced with the whole service being offline for "a while"; perhaps by moving machines one by one, you will be able to maintain service for the whole moving period. In either case, you want DNS to serve the new address of the service as soon as it is up on the new site. I'll be moving www.penguin.bv. This is a extract of the penguin.bv zone showing the records affected:
$TTL 804800 ; 7 days ; @ 3600 SOA ns.penguin.bv. hostmaster.penguin.bv. ( 2000041300 ; serial 86400 ; refresh, 24h 7200 ; retry, 2h 3600000 ; expire, 1000h 172800 ; minimum, 2 days ) NS ns NS ns.herring.bv. ; WEB, both http://www.penguin.bv/ and ; http://penguin.bv/ with A records www A 192.168.55.3 ; People often send mail to webmaster@www.domain MX 10 mail MX 20 mail.herring.bv. HINFO PC Tunes @ A 192.168.55.3
Because the default TTL for the penguin.bv zone is seven days, I need to start the work more than seven days ahead of time. The first thing to do is to reduce the TTL for the web server A records. The question is, how low do you set them? Remember, all cached RRs will stay in the cache for the duration of the TTL from the moment they are cached. When moving www.penguin.bv, you'll be turning off the machine, dragging it into a car, driving for 10 minutes, and then getting it up on the new site with a new address. This should take a total of 20-30 minutes. The zone will be updated with the new record right before the machine is turned off. So, within 20 minutes after that, you want all the users to have the new address. A TTL between 5 and 10 minutes would seem appropriate (I'll use 10 minutes). So, at least seven days (the old TTL) before the machine is moved, I set these values for the A records that have to do with the web server:
www 600 A 192.168.55.3 ... @ 600 A 192.168.55.3
Now they will expire after 10 minutes in the caches. Of course, I changed the serial number as well. And then I reloaded the server and checked the logs—so did you, right?
The second problem is that all the slave servers need to be updated immediately when the update is made, or they will continue serving the old records. Many of your clients will keep caching the old updated A records, to their frustration and—not incidentally—yours too.
If you have full access to the slave servers, this is not much of a problem—a simple trick suffices. You can simply log into the slave server, remove the zone file for the updated zone, and restart named. This will cause the named to immediately request the zone from the master, which solves the problem. If you do not control the slave servers, and this is probably much more common, you need to find another way to force the transfer.
Zone Transfer by NOTIFY
The trouble with the NOTIFY request is that it travels by UDP and may be dropped by the network. The "U" in UDP is not for "Unreliable," although it might as well be. Additionally, if a slave does not support NOTIFY, you're out of luck in any case. Also, this is one instance where the very sensible time delay of NOTIFY will be frustrating. You can't know if your server is still waiting out the delay or if the NOTIFY request got lost. Fortunately, though, you can enable more logging in named.conf so you can see everything that happens:
logging { channel my_log { file "/var/log/named.db"; severity dynamic; print-category yes; print-severity yes; }; category notify { my_log; }; category xfer-out { my_log; }; };
Run "ndc restart" to pick up the configuration change. Update the zone serial number in the zone, run "ndc trace 3," and then run "tail -f /var/log/named.db" to see what happens when you, finally, run "ndc reload" to load the updated zone:
29-Apr-2000 16:39:50.897 ns_notify(penguin.bv, IN, SOA): ni 0x400bf728, zp 0x80e188c, delay 24 29-Apr-2000 16:40:14.899 sysnotify(penguin.bv, IN, SOA) 29-Apr-2000 16:40:14.901 Sent NOTIFY for "penguin.bv IN SOA" (penguin.bv); 1 NS, 1 A 29-Apr-2000 16:40:14.916 Received NOTIFY answer from 192.168.226.3 for "penguin.bv IN SOA" 29-Apr-2000 16:40:15.084 zone transfer (AXFR) of "penguin.bv" (IN) to [192.168.226.3].8478
Actually, there will be more unless you "grep" the tail output, but the preceding contains the interesting bits. First, you see that named decides that a NOTIFY is in order, and to delay it 24 seconds. Then, the time to send the NOTIFY arrives and it is sent. A response is promptly received; in short order, the zone is transferred by the slave. This is what is supposed to happen, for each and every slave server. If it does not, it should suffice to do a "ndc restart", because named will issue NOTIFY requests when starting, just in case a zone changed since the last reload or restart. In this manner, you should be able to get the slaves reloaded promptly.
Zone Transfer by Other Methods
If all your slaves do not implement NOTIFY and you do not have full access, you need to get the slaves to check the zone for updates frequently enough so that the zone transfer happens fast enough. Controlling this is what the refresh value is for. If you would like to have the zone transferred within 10 minutes of an update, set the refresh period to 10 minutes. But be sure to do it more than one old refresh interval before the change takes place so that the new, decreased refresh interval is picked up in time. In cases such as the moving of an important server, a refresh period of one minute would not be out of place. And this is quite possibly the simplest way to accomplish this in any case—if you plan ahead. But do increase it again afterward.
The last technique is to call up the admins of the slave servers and arrange for them to be available when the move is made. Then phone around, getting the admins to reload the zone by force, by removing the zone file from the slave servers and reloading as described earlier. This works best if there aren't many of them to call.