- Analyze Network Layout
- Measure Network Throughput Capabilities
- Network Troubleshooting Tools
1.3 Network Troubleshooting Tools
At this point, you should be very familiar with the network topology separating the NFS client and server, and have verified that IP packets are taking the appropriate route through the network in both directions. If you have not yet performed these critical steps, refer to the earlier Section 1.1 "Analyze Network Layout" for instructions on how to collect this information. After running network throughput tests using the tools described in Section 1.2 "Measure Network Throughput Capabilities," if you believe your network is experiencing a performance throughput issue then it is time to troubleshoot the network itself.
KEY IDEA
Common Causes of Dropped Network Packets
In many cases, network throughput problems are caused by packets being dropped somewhere on the network, either by the NFS client or server system itself or at some intermediate point in the network separating the two systems. Some of the more common reasons network packets are dropped include:
Defective hardware (i.e. network interface cards, cables, switch ports, etc.)
Mismatching configuration settings between interface cards and switch equipment. The most common configuration issue is where one side of a connection is set to half-duplex and the other side to full-duplex, causing "late" collisions to be logged on the half-duplex side and FCS or CRC errors logged on the full-duplex side.7
Network interconnect device buffer memory exhaustion (described in Section 10.4)
UDP socket overflows occurring on the NFS server, indicating that not enough daemons are running to handle the inbound requests for a specific port
The goal of this phase of the investigation is to determine if the network throughput problem is affecting all IP traffic or only NFS. In some cases, the only tools that can detect these types of problems are external analyzers and reporting tools specific to your network hardware. HP does provide a number of software-based tools to help detect and analyze network problems. Two frequently used network troubleshooting tools are netstat(1) and lanadmin(1M).
1.3.1 netstat -s
The netstat(1) command can be used to display statistics for network interfaces and protocols, it can list active network connections, print routing tables, etc. When executed with the "-s" option, netstat returns a complete list of all network transport statistics (TCP, UDP, IP, ICMP, and IGMP) arranged by protocol.
A portion of "netstat -s" output is shown in Figure 1.9 and Figure 1.10. These two screen shots illustrate how this single command returns an enormous amount of information about the underlying network protocols, including the number of TCP packets sent and received, the number of UDP socket overflows and checksum failures that occurred, etc. Also readily available are statistics such as the total number of IP packets received, the ICMP port unreachable and source quench messages generated, etc. All of this information can be extremely useful when troubleshooting a network or protocol layer problem.
Figure 1.9 netstat -s Output Showing TCP Statistics
Figure 1.10 netstat -s Output Including UDP and IP Statistics
Using netstat and diff to troubleshoot a network problem
Listed below is an example of how to use the "netstat -s" command to help determine if packets are being lost somewhere in your network The underlined steps are the commands you type.
Initialize the "before" file with the current date and time.
# date > netstat.before
Collect a baseline set of netstat -s statistics on both the NFS client and server and append the output to the "before" file created in step 1.
# netstat -s >> netstat.before
Perform a test that exhibits the performance problem using the TCP protocol (such as ttcp or netperf).
# ttcp -stp9 -n 100000 server
Initialize the "after" file with the current date and time.
# date > netstat.after
Collect a second set of netstat -s statistics on both the NFS client and server and append the output to the "after" file created in step 4.
# netstat -s >> netstat.after
Locate any differences between the "before" and "after" netstat outputs to identify which statistics were incrementing during the test.
# diff netstat.before netstat.after
An example of the type of output returned by diff(1) from such an exercise is shown in Figure 1.11. Some of the TCP statistics to monitor are "data packets retransmitted," "completely duplicate packets," and "segments discarded for bad checksum." If these statistics are steadily increasing over time it would indicate that packet loss is occurring somewhere in the network or that a possible network hardware problem is causing TCP checksum failures.
Figure 1.11 Comparing "before" and "after" netstat Outputs
Also of concern would be an increasing number of UDP "bad checksums," as this could indicate that an IP level device in the network (perhaps a router performing IP fragmentation) is not correctly fragmenting UDP datagrams as it forwards them. These datagrams would be discarded by the receiving system, forcing the sending system to re-send this data. UDP "socket overflows" usually indicates that an application, such as NFS, is receiving requests on a particular UDP socket faster than it can process them, and consequently discarding requests.8
Of the IP statistics reported, "fragments dropped" and "fragments dropped after timeout" would indicate packet loss is occurring in the network.
Using netstat and beforeafter to troubleshoot a network problem
Although the diff(1) procedure does work for locating differences in netstat outputs, interpreting diff output can be a bit cumbersome. Not only do you have to manually subtract the "after" numbers from the "before" numbers, since the diff output removes the subsystem header information lines, you need to carefully confirm which statistics you are interpreting. For example, netstat returns "# packets received" under the TCP heading and "# total packets received" under the IP heading. Confusing these two values could lead to incorrect conclusions about the health of your network.
To simplify this procedure of interpreting multiple sets of netstat output, HP developed a tool called beforeafter. This program takes two netstat -s output files and compares them against each other. The output from beforeafter looks identical to netstat -s output except that the statistics represent only the differences between the "before" and "after" files. The tool is available at: ftp://ftp.cup.hp.com/dist/networking/tools/beforeafter.tar.gz.
The procedure for using the beforeafter tool is the same as the diff method outlined earlier, with the exception of step 6. Instead of using the diff command to compare the files, the beforeafter tool should be used as follows:
# beforeafter netstat.before netstat.after
Figure 1.12 contains an example of the beforeafter output. Notice the output looks identical to that of "netstat -s"; however the statistics reported are not cumulative totals but instead represent the differences between the "before" and "after" files.
Figure 1.12 beforeafter Comparing netstat -s Statistics
The beforeafter tool even calculates the difference in wall-clock times between the date contained in the "before" and "after" files. Looking at the date line in Figure 1.12 you can see that the "after" file was collected 2 minutes and 28 seconds after the "before" file. The test caused this system to send 600503 packets containing 819239972 bytes of data, and receive 57535 packets containing acknowledgement packets for 819239975 bytes.
Although the diff output and the beforeafter output reveal the same information, locating and quantifying the key statistics is much easier using the beforeafter tool.
1.3.2 netstat -p <protocol>
Once you have identified a subset of the netstat -s output that is of particular interest, you can limit the statistical output to a single protocol by using the "-p <protocol>" syntax. An example is shown in Figure 1.13. Confining the output to a single protocol can greatly simplify the process of identifying specific protocol-related problems compared to analyzing screens full of "netstat -s" output.
Figure 1.13 netstat -p Output
1.3.3 netstat -r
As discussed earlier in the "Analyze Network Layout" section, it is critical to understand the path NFS packets take through the network as they move between the client and server. We saw how utilities such as traceroute and "ping -o" display the various hops taken by packets going between two network nodes. If the traceroute or "ping -o" output reveals that packets are not taking the route you expect them to, you should verify the routing tables on both the client and server to make sure they are correct. Ensuring the accuracy of the routing tables is especially important on systems with multiple network interfaces, where outbound packets potentially have several paths to their final destination.
On HP systems, the "netstat -r" command is used to display the routing tables. Figure 1.14 shows an example of this. In this example, the "-n" (do not resolve IP addresses to hostnames) and "-v" (verbose) options were used. Included in the output is the interface name associated with each IP address, as well as the PMTU (Path Maximum Transmission Unit) size for each interface. The MTU information can be very useful in environments where the NFS client and server are on different physical networks and packet fragmentation or translation needs to occur (for example FDDI to Ethernet).
Figure 1.14 netstat -r Displaying Network Routing Tables
1.3.4 netstat -i
In large customer environments, particularly those where HP's MC/ServiceGuard9 product is used, it is not uncommon for NFS client and server systems to have multiple network paths to each other for redundancy reasons. In some cases the primary and backup interfaces are not equivalent in terms of their bandwidth capabilities. For example, the systems might use a Gigabit Ethernet interface as their primary connection and have a 100BT interface available as a backup connection. In these environments, sufficient care must be taken when configuring the NFS mount points to ensure that the traffic flows across the faster interface whenever possible. However, even with careful preparation, there is always a possibility that NFS traffic sent between the clients and servers will mistakenly use the slower interface.
A quick and easy way to verify which interface the majority of network traffic is using is to issue the "netstat -i" command and examine the inbound and outbound packet counts for all configured interfaces. Figure 1.15 provides an example of this output.
Figure 1.15 netstat -in Output
By monitoring the inbound and outbound packet rates of the interfaces, you can quickly determine if an unusually high amount of network traffic is using what should be an "idle" or "backup" interface. If this appears to be happening, a network trace can be taken to determine the hostnames of the remote systems that are sending requests to the slower interface.
1.3.5 lanadmin(1M)
As stated earlier, dropped packets on the network can occur if there are problems with the network interface cards, cables, or connectors. While many hardware-based problems can only be detected and identified with external analyzers, HP-UX provides several software-based tools to help monitor the health of the interfaces. The commands available for checking the state of any specific interface card will vary based on interface type (i.e. FDDI, Gigabit Ethernet, etc.). However, the lanadmin(1M) utility applies to all network links and it should be queried first.
The lanadmin command allows a system administrator to display many useful statistics kept by the LAN driver subsystem, regardless of the interface type. Figure 1.16 shows a sample screen output returned by lanadmin.
Figure 1.16 lanadmin Output
By reviewing this information you can learn a great deal about how the queried interface is configured and whether it has been logging any errors at the driver layer. For example, the output shown in Figure 1.16 indicates that this interface card is a 10/100BT card known to the system as device "lan0," the card is enabled and active, it is running at a speed of 100 Mbits/second with an MTU size of 1500, and it is currently configured to run in full-duplex mode. In some cases, this information alone can be enough to determine the cause of a network performance problem (i.e. in the case where LAN interfaces and network switch configurations don't match with regards to speed and duplex settings).
Also available in the lanadmin output are various error counts, collision rates, total inbound and outbound packet counts, etc. By monitoring these counters you can, with the assistance of an HP support representative, try to make a qualified determination as to whether a hardware problem exists somewhere in your network. When software-based analysis tools fail to identify the problem, external tools can be used to provide the definitive view of the traffic patterns on the network and to isolate a device that is losing packets.
KEY IDEA
The Importance of Patching LAN, Transport, and Network Drivers
HP continually strives to improve the quality of HP-UX by distributing software patches containing both defect fixes and functionality enhancements. Many of these fixes and enhancements can significantly improve the performance and behavior of critical system components, such as LAN Common, the Network Transports (TCP, UDP, IP), and the various Network Link Driver subsystems (100BT, 1000BT, FDDI, Token Ring, etc.).
Since NFS relies heavily upon the stability and performance of the network, it is strongly recommended that the latest LAN, Transport, and network link driver patches be installed on every HP-UX system in order to take advantage of these improvements. Contact HP support to obtain a current set of patches for your specific operating system. You can also generate a current patch list using the tools available at HP's IT Resource Center: http://itrc.hp.com.
For a detailed discussion on the importance of keeping your HP-UX NFS client and server systems patched with current code, refer to Appendix B "Patching Considerations."