14.2 The NFS
A primary advantage of using NFS in your cluster is that it is readily available and part of most Linux distributions. Its familiarity also is an advantage to most UNIX or Linux system administrators, who are able to deploy NFS almost in their sleep. The knowledge needed to tune and scale an NFS server and the associated clients, however, is not as readily available or applied as you might think.
The implementation of features in NFS also varies between Linux distributions, the version of a particular Linux distribution, and between the NFS client and server software components. Some of the features that we may be used to having on proprietary UNIX systems are not yet available on Linux, or are in a partially completed state. NFS on Linux can be a moving target in some respects.
One example of this type of incomplete functionality lies in the area of NFS read and write block sizes, which may default to 1 KB on some versions and may not be set larger than 8 KB on otherslimiting potential performance. Another area is the support of NFS over TCP, which is enabled for Fedora Core 1, but not for some earlier versions of Red Hat, like 8.0.
Because NFS is a clientserver implementation of a networked file system, setup and tuning is necessary on both the client and the server to obtain optimum performance. Understanding the major NFS software components and the complete NFS data path can help select the best operating parameters for your installation. I will cover these topics in upcoming sections.
14.2.1 Enabling NFS on the Server
Once the local physical file systems are built on your NFS server, there are several steps necessary to make them available to NFS clients in the network. Before starting the NFS server subsystem, there are a number of options that we need to consider for proper operation. We also need to pay attention to potential security issues.
The NFS server processes are started from the /etc/init.d/nfs and /etc/init.d/nfslock files. Both of these files may be enabled with the customer chkconfig commands:
# chkconfig nfs on # chkconfig nfslock on
The behavior of NFS on your server may be controlled by the /etc/sysconfig/nfs file. This file contains the options used by the /etc/init.d/nfs file, and usually does not exist by default.
The /etc/init.d/nfs file uses three programs that are essential to the NFS server subsystem: /usr/sbin/rpc.mountd, /usr/sbin/rpc.nfsd, and /usr/sbin/exportfs. The rpc.mountd daemon accepts the remote mount requests for the server's exported file systems. The rpc.nfsd process is a user space program that starts the nfsd kernel threads from the nfsd.o module that handles the local file system I/O on behalf of the client systems.
The exportfs command is responsible for reading the /etc/exports file and making the exported file system information available to both the kernel threads and the mount daemon. Issuing the exportfs -a command will take the information in /etc/exports and write it to /var/lib/nfs/xtab.
The number of nfsd threads that are started in the kernel determine the simultaneous number of requests that can be handled. The default number of threads started is eight, which is almost never enough for a heavily used NFS server. To increase the number of nfsd threads started, you can set a variable in the /etc/sysconfig/nfs file:
RPCNFSDCOUNT=128
The exact number of threads to start depends on a lot of factors, including client load, hardware speed, and the I/O capabilities of the server.
The nfsd threads receive client requests on socket number 2049 for either UDP/IP or TCP/IP requests. Which transport is used as a default depends on the Linux distribution and version. Some versions do not have NFS over TCP available. Performance is better over UDP, provided the network is robust.
One way to tell if you need more nfsd threads is to check for NFS socket queue overflows. Unfortunately, this method will tell you only if some socket on the system has overflowed, not the specific NFS socket. If the system is primarily an NFS server, you can infer that the most likely socket to overflow is the NFS socket. Check for this situation with
# netstat -s | grep overflow 145 times the listen queue of a socket overflowed
If you are seeing socket overflows on the server, then try increasing the number of nfsd threads you are starting on your server. If the NFS server's input socket queue overflows, then a client request packet is dropped. This will force a client retry, which should show up in client NFS statistics available with the nfsstat command (on a client system):
# nfsstat -rc Client rpc stats: calls retrans authrefrsh 20 0 0
The number of hits in the retrans counter would potentially reflect any server issues with socket overflows, although there are other reasons for this number to increase, such as the end-to-end latency between the client and server or the server being too busy to reply within the time-out period set by the client. I discuss setting the client request time-out values when I talk about mount parameters.
You can easily tell how many nfsd threads are currently running on the system with the ps command. For example,
# ps -ef | grep nfs| grep -v grep root 855 1 0 Mar18 ? 00:00:00 [nfsd] root 856 1 0 Mar18 ? 00:00:00 [nfsd] root 857 1 0 Mar18 ? 00:00:00 [nfsd] root 858 1 0 Mar18 ? 00:00:00 [nfsd] root 859 1 0 Mar18 ? 00:00:00 [nfsd] root 860 1 0 Mar18 ? 00:00:00 [nfsd] root 861 1 0 Mar18 ? 00:00:00 [nfsd] root 862 1 0 Mar18 ? 00:00:00 [nfsd]
This shows the eight default nfsd threads executing on a server system. Oops. The fact that daemons are kernel threads is indicated by the brackets that surround the "process" name in the ps output.
14.2.2 Adjusting NFS Mount Daemon Protocol Behavior
The rpc.mountd process provides the ability to talk to clients in multiple NFS protocols. The most recent protocol, NFS protocol version 3 (NFS PV3) is supported, along with the older protocol version 2 (NFS PV2). Which protocol is used is selected based on the NFS client's mount request, or on the options supplied to the nfsd threads when they are started. The rpc.nfsd command is used to pass the protocol options to the kernel threads when they are started.
If you examine the man page for rpc.nfsd, you will see two options that control the available NFS protocols: --no-nfs-version and --nfs-version. Either of these options may be followed by the value 2 or 3 to disallow or force a particular NFS protocol version respectively. The /etc/sysconfig/nfs file provides two variables that translate into these options.
The MOUNTD_NFS_V2 variable and the MOUNTD_NFS_V3 variable may be set to the values no, yes, or auto. A value of no disallows the associated protocol version, a value of yes enables the associated protocol version, and a value of auto allows the start-up script to offer whichever versions are compiled into the kernel. An example /etc/sysconfig/nfs file that disallows NFS PV2 and enables only NFS PV3 would contain
MOUNTD_NFS_V2=no MOUNTD_NFS_V3=yes
It is better to offer the available mount types, and control which one is used with the client mount parameters, unless there are issues between the client and server.
It is possible to determine which version of the NFS protocol a client system is using by looking at statistics from the nfsstat command. This command, in addition to providing retry information, will show the number of NFS PV3 and PV2 RPC requests that have been made to the server. An example output from this command is
# nfsstat -c Client rpc stats: calls retrans authrefrsh 20 0 0 Client nfs v2: null getattr setattr root lookup readlink 0 0% 0 0% 0 0% 0 0% 0 0% 0 0% read wrcache write create remove rename 0 0% 0 0% 0 0% 0 0% 0 0% 0 0% link symlink mkdir rmdir readdir fsstat 0 0% 0 0% 0 0% 0 0% 0 0% 0 0% Client nfs v3: null getattr setattr lookup access readlink 0 0% 14 70% 0 0% 0 0% 0 0% 0 0% read write create mkdir symlink mknod 0 0% 0 0% 0 0% 0 0% 0 0% 0 0% remove rmdir rename link readdir readdirplus 0 0% 0 0% 0 0% 0 0% 0 0% 0 0% fsstat fsinfo pathconf commit 0 0% 6 30% 0 0% 0 0%
The statistics kept by nfsstat are reset at each reboot or when the command is run with the zero-the-counters option. This command is
# nfsstat -z
It is also possible to use the nfsstat command to examine the server statistics for all clients being serviced. If there are clients using one or the other or both protocol versions, this is visible in the command's output, because the RPC calls are divided into groups according to protocol version. An example of the server statistics from nfsstat follows:
# nfsstat -s Server rpc stats: calls badcalls badauth badclnt xdrcall 1228 0 0 0 0 Server nfs v2: null getattr setattr root lookup readlink 1 100% 0 0% 0 0% 0 0% 0 0% 0 0% read wrcache write create remove rename 0 0% 0 0% 0 0% 0 0% 0 0% 0 0% link symlink mkdir rmdir readdir fsstat 0 0% 0 0% 0 0% 0 0% 0 0% 0 0% Server nfs v3: null getattr setattr lookup access readlink 1 0% 212 17% 0 0% 83 6% 141 11% 0 0% read write create mkdir symlink mknod 709 57% 0 0% 0 0% 0 0% 0 0% 0 0% remove rmdir rename link readdir readdirplus 0 0% 0 0% 0 0% 0 0% 22 1% 38 3% fsstat fsinfo pathconf commit 3 0% 18 1% 0 0% 0 0%
As you can see from the output of nfsstat, the server and client are both using NFS PV3 to communicate. Furthermore, by looking at the server, you can see that all clients are using NFS PV3, because there are no RPC calls listed for the PV2 categories. I am getting just a little ahead, because the nfsstat statistics are available only after NFS has been started. We aren't quite ready for this situation yet.
On an NFS server with a considerable number of clients, the maximum number of open file descriptors allowed by the NFS mount daemon may need to be adjusted. The rpc.mountd process uses the --descriptors option to set this, and the default number is 256. This value has to be passed with the RPCMOUNTD_OPTS variable in /etc/sysconfig/nfs:
RPCMOUNTD_OPTS='--descriptors=371'
This number represents the number of file handles allowed by the server; one is maintained per NFS mount. It is not the number of files that may be opened by the clients.
14.2.3 Tuning the NFS Server Network Parameters
The /etc/sysconfig/nfs file allows tuning the input queue size for the NFS sockets. This is done by temporarily altering the system default parameters for the networking subsystem, specifically the TCP parameters. To understand how this is done, we need to look at the general parameter tuning facility provided by the /sbin/sysctl command.
This command reads the contents of the /etc/sysctl.conf file and applies the parameters listed there to the appropriate system parameters. The application of kernel parameter changes is usually done at start-up time by the init scripts, but may also be done on a "live" system to alter the default kernel parameter values.
Using the sysctl command, temporary changes may be applied to the system's kernel parameters, and these "tunes" may be permanently applied by adding them to the /etc/sysctl.conf configuration file. The NFS start-up script allows us to tune only the NFS socket parameter, instead of making global changes. The commands to do this are listed from the /etc/init.d/nfs start-up file:
# Get the initial values for the input sock queues # at the time of running the script. if [ "$TUNE_QUEUE" = "yes" ]; then RMEM_DEFAULT='/sbin/sysctl -n net.core.rmem_default' RMEM_MAX='/sbin/sysctl -n net.core.rmem_max' # 256kb recommended minimum size based on SPECsfs # NFS benchmarks [ -z "$NFS_QS" ] && NFS_QS=262144 fi [... intermediate commands deleted ...] case "$1" in start) # Start daemons. # Apply input queue increase for nfs server if [ "$TUNE_QUEUE" = "yes" ]; then /sbin/sysctl -w net.core.rmem_default=$NFSD_QS >/dev/null 2>&1 /sbin/sysctl -w net.core.rmem_max=$NFSD_QS >/dev/null 2>&1 fi [... intermediate commands deleted ...] # reset input queue for rest of network services if [ "$TUNE_QUEUE" = "yes" ]; then /sbin/sysctl -w net.core.rmem_default=$RMEM_DEFAULT >/dev/null 2>&1 /sbin/sysctl -w net.core.rmem_max=$RMEM_MAX >/dev/null 2>&1 fi
The script temporarily modifies the values of the net.core.rmem_max and net.core.rmem_default parameters, starts the nfsd threads, then replaces the old values of the parameters. This has the effect of modifying the receive memory parameters for only the socket that gets created by nfsd threads for receiving NFS requests, socket 2049.
These network parameter values are also available in the files /proc/sys/net/core/rmem_max and /proc/sys/net/core/rmem_default. Note the similarity between the file path name in /proc and the parameter names used with the sysctl command. They are indeed the same data; there are just several methods of modifying the parameter value. We could alter the values (globally!) for these parameters by executing
# echo '262144' > /proc/sys/net/core/rmem_default # echo '262144' > /proc/sys/net/core/rmem_max
Any changes we made this way would alter the defaults for all new sockets that are created on the system, until we replace the values.
You can crash or hang your system by making the wrong adjustments. Exercise caution when tuning parameters this way. It is, however, best to try values temporarily before committing them to the /etc/sysctl.conf file or to /etc/sysconfig/nfs. To enter parameters prematurely in this file can render your system unbootable (except in single-user mode). To perform these tuning operations when NFS is started, we can set the following values in the file /etc/sysconfig/nfs:
TUNE_QUEUE=yes NFS_QS=350000
14.2.4 NFS and TCP Wrappers
If you use TCP wrappers (/etc/hosts.deny and /etc/hosts.allow) to tighten security in the cluster's networks, you need to make special provisions for NFS. There are a number of services that must be allowed access for proper NFS operation. One of the standard ways to use TCP wrappers is to deny any service that is not explicitly enabled, so if you use this approach, you need to modify the /etc/hosts.allow file on the NFS server to contain
mountd: .cluster.local portmap: .cluster.local statd: .cluster.local
The client systems must be able to access these services on the server to mount the server's exported file systems and to maintain information about the advisory locks that have been created. Even though the binary executables for the daemons are named rpc.mountd and rpc.statd, the service names must be used in this file.
14.2.5 Exporting File Systems on the NFS Server
The format of the /etc/exports file on Linux systems is a little different than that of other UNIX systems. The file provides the necessary access control information to the NFS mount daemon, determining which file systems are available and which clients may access them. The exportfs command takes the information from the /etc/exports file and creates the information in the /var/lib/nfs/xtab and /var/lib/nfs/rmtab files, which are used by mountd and the kernel to enable client access to the exported directories.
The /etc/exports file contains lines with a directory to export, and a whitespace-separated list of clients that can access it, and each client may have a list of export options enclosed in parentheses. The client specifications may contain a variety of information, including the specific host name, wild cards in a fully qualified host name, a network address and net mask, and combinations of these elements. An example file for our cluster might be
/scratch cs*.cluster.local(rw,async) /admin 10.3.0.0/21(rw) /logs ms*.cluster.local(rw) cs*.cluster.local(rw)
An entry in the /var/lib/nfs/xtab file for an exported directory with no options specified is listed (for each system with a mount) as
/kickstart cs01.cluster.local(ro,sync,wdelay,hide, secure,root_squash,no_all_squash,subtree_check,secure_locks, mapping=identity,anonuid=-2,anongid=-2)
This shows the default values for the options in the /etc/exports entry:
-
Read-only access.
-
Synchronous updates (don't acknowledge NFS operations until data is committed to disk).
-
Delay writes to see if related sequential writes may be "gathered."
-
Require file systems mounted within other exported directories to be individually mounted by the client.
-
Require requests to originate from network ports less than 1024.
-
Map root (UID or GID 0) requests to the anonymous UID and GID.
-
Do not map UIDs or GIDs other than root to the anonymous UID and GID.
-
Check to see whether accessed files are in the appropriate file system and an exported system tree.
-
Require authentication of locking requests.
-
User access permission mapping is based on the identity specified by the user's UID and GID.
-
Specify the anonymous UID and GID as the value -2.
14.2.6 Starting the NFS Server Subsystem
Now that we have all the various files and configuration information properly specified, tuning options set, and file systems built and ready for export, we can actually start the NFS server subsystem and hope that the clients can access the proper file systems. This is done in the normal manner:
# service nfs start # service nfslock start
You should be able to verify that the [nfsd] and [lockd] kernel threads are shown by the ps command. Entering the command
# exportfs
should show the exported file systems on the server with the proper export options. You should also be able to locate processes for rpc.mountd, rpc.statd. portmap, and rpc.rquotad in the ps output. With the configuration parameters and required processes in place, we may proceed to mounting the server directories from a client system.
14.2.7 NFS Client Mount Parameters
The client mount parameters for NFS can control the behavior of the client and server interaction to a large degree. These parameters are shown in Table 14-1.
Table 14-1. Important NFS Client Mount Parameters
NFS Mount Parameter |
Default |
Description |
---|---|---|
rsize |
rsize=1024 |
The number of bytes to request in an NFS read operation (maximum for NFS PV2 is 8,192 bytes) |
wsize |
wsize=1024 |
The number of bytes to write in an NFS write operation (maximum for NFS PV2 is 8,192 bytes) |
timeo |
timeo=7 |
Tenths of a second between RPC retries (NFS minor time-outs); value is doubled each time-out until either 60 seconds or the number of minor time-outs specified in retrans is reached |
retrans |
retrans=3 |
The number of minor time-outs that can occur before a major NFS time-out |
soft |
If a major time-out occurs, report an error and cease retrying the operation (This can result in data corruption! Soft is bad, bad, bad!) |
|
hard |
If a major NFS time-out occurs, report an error and continue to retry; this is the default and the only setting that preserves data integrity |
|
intr |
Similar behavior to soft, do not use unless you have good reasons! |
|
nfsver=<2|3> |
nfsver=2 |
Allow mounting with either NFS PV2 or PV3; PV2 is the default |
tcp |
Mount the file system using TCP as the transport; support may or may not be present for the TCP transport |
|
udp |
Mount the file system using UDP as the transport, which is the default |
Several important points need to be made about NFS mount parameters and their default values.
-
The default size for rsize and wsize (1,024 bytes) will almost certainly have an adverse impact on NFS performance. There are a number of points where these settings can impact the performance of NFS servers and clients: file system read and write bandwidth, network utilization, and physical read/modify/write behavior (if the NFS block size matches file system fragment size). In general, the 8-KB maximum for NFS PV2 is a minimum size for efficiency, with a 32,768-byte block being a better option if supported by the NFS PV3 implementation.
-
The values for timeo and retrans implement an exponential back-off algorithm that can drastically affect client performance. This is especially true if the NFS server is busy or the network is unreliable or heavily loaded. With the default values, timeo=7 and retrans=3, the first sequence will be 0.70 second (minor time-out), 1.40 second (minor time-out), 2.80 seconds (minor time-out), and 5.60 seconds (major time-out). The second sequence doubles the initial value of timeo and continues.
-
The default behavior for a mount is hard and should be left that way unless the remote file system is read-only. NFS is meant to substitute for a local file system, and most applications are unaware that the data they are accessing is over a network. When soft mounts return errors on reads or writes, most applications behave badly (at best) and can lose or corrupt data. The hard setting causes the client to retry the operation forever, so that a "missing" NFS server resulting from a crash or reboot will pick up the operation where it left off on returning. This is extremely important NFS behavior that is little understood.
-
The default NFS transport is UDP. This transport is nonconnection oriented and will perform better than TCP in stable, reliable networks. NFS implements its own error recovery mechanism on top of UDP (see the retrans and timeo parameters), so unless the mount is over a WAN connection, the preferred transport is UDP. Some proprietary UNIX operating systems, like Solaris, default to using TCP as the NFS transport.
-
The default mount protocol is NFS PV2, which is less efficient than NFS PV3. The NFS PV3 protocol supports large files (more than 2 GB), larger block sizes (up to 32 KB), and safe asynchronous writes (acknowledgment of physical commit to disk not required on every write operation). Using NFS PV3 is extremely important to overall NFS performance.
The mount parameters for NFS file systems, whether in /etc/fstab or specified in automounter maps, need to have the following options specified for best performance:
rsize=32768,wsize=32768,nfsvers=3
14.2.8 Using autofs on NFS Clients
Maintaining the NFS mounts in a /etc/fstab file on multiple systems can become an exercise in frustration. A more centralized method of maintaining this information is to use the NIS subsystem to distribute the auto.master map and other submaps to the client systems. The NFS client system may then be configured to use the autofs subsystem to mount file systems automatically when required and unmount them when idle.
Unmounting the NFS file system when idle removes the direct dependency between the client and server that can cause retries if the remote file system's server becomes unavailable. This behavior is supported by "indirect" maps, unlike the direct map entries that simulate an NFS mount created by an entry in the /etc/fstab file. A side benefit is that indirect mounts can be changed without rebooting an NFS client using them. This level of indirection allows the system manager to change the location (server, directory path, and so on) of the data without affecting the client's use of the data.
An auto.master map from either NIS or a local file provides the names of pseudo directories that are managed by a specific map and the autofs subsystem on the client. The settings in the /etc/nsswitch.conf file for the automount service define the order of precedence for locating the map informationfor example,
automount: nis files
Thus, autofs will examine the NIS information for the maps first, then local files located in the /etc directory. A simple example of a master map would be
/data auto.data nfsvers=3,rsize=32768,wsize=32768 /home auto.home nfsvers=3,rsize=32768,wsize=32768
Notice that the mount options in the master map, which override the submap options, are in the third field in the master map data. This is different from the options in the submaps, which are the second field in the data. I use the convention auto.<directory> to name the map that instructs autofs how to mount the information under the pseudo directory named <directory>. An example auto.data map might contain the entries:
scratch fs01:/raid0/scratch proj fs02:/raid5/projects bins fs02:/raid5/executables
Any time a client system references a path, such as /data/proj, the autofs subsystem will mount the associated server directory at the appropriate location. When the directory is no longer in use, it is dismounted after a time-out.
Once the autofs configuration in /etc/nsswitch.conf is complete, and the NIS or local map files are present, the autofs subsystem may be started with
# chkconfig autofs on # service autofs start
You should be able to cd to one of the directories controlled by the indirect maps to verify their proper operation. I do not cover direct maps here, because I consider them to be evil.
14.2.9 NFS Summary
Although NFS is not a parallel file system, its ease of implementation can make it a choice for smaller clusters. With proper tuning and attention to the NFS data path between client and server, NFS can offer adequate performance. Certainly, most UNIX or Linux system administrators have experience with implementing NFS, but tuning this network file system is not always trivial or obvious. ("You can tune a file system, but you can't tune a fish," as stated by the UNIX man page for tunefs.)
There are a number of areas that can be tuned on an NFS server, and the examples that we covered in this section illustrate a general method for setting Linux kernel parameters. The use of the sysctl command and its configuration file /etc/sysctl.conf allow any parameters to be modified at boot time. A cautious approach to tuning is to set the parameter values temporarily with either /proc or the sysctl command before permanently committing them.
In general, NFS mounts in the /etc/fstab file, and direct mounts using autofs are a bad idea because they create direct dependencies between the client systems and the server that is mounted. If you decide to service the NFS server, the client RPC retry behavior for hard mounts will preserve data integrity, but will cause infinite retries until the server returns. It is better to let autofs unmount unused file systems and avoid the direct dependencies.