TCP State Model
TCP is a reliable transport layer protocol that offers a full duplex connection byte stream service. The bandwidth of TCP makes it appropriate for wide area IP networks where there is a higher chance of packet loss or reordering. What really complicates TCP are the flow control and congestion control mechanisms. These mechanisms often interfere with each other, so proper tuning is critical for high-performance networks. We start by explaining the TCP state machine, then describe in detail how to tune TCP, depending on the actual deployment. We also describe how to scale the TCP connection-handling capacity of servers by increasing the size of TCP connection state data structures.
FIGURE 8 presents an alternative view of the TCP state engine.
FIGURE 8 TCP State Engine Server and Client Node
This figure shows the server and client socket API at the top, and the TCP module with the following three main states:
Connection Setup
This includes the collection of substates that collectively set up the socket connection between the two peer nodes. In this phase, the set of tunable parameters includes:
tcp_ip_abort_cinterval: the time a connection can remain in half-open state during the initial three-way handshake, just prior to entering an established state. This is used on the client connect side.
tcp_ip_abort_linterval: the time a connection can remain in half open state during the initial three-way handshake, just prior to entering an established state. This is used on the server passive listen side.
For a server, there are two trade-offs to consider:
Long Abort Intervals The longer the abort interval, the longer the server will wait for the client to send information pertaining to the socket connection. This might result in increased kernel consumption and possibly kernel memory exhaustion. The reason is that each client socket connection requires state information, using approximately 12 kilobytes of kernel memory. Remember that kernel memory is not swappable, and as the number of connections increases, the amount of consumed memory and time delays for lookups for connections increases. Hackers exploit this fact to initiate Denial of Service (DoS) attacks, where attacking clients constantly send only SYN packets to a server, eventually tying up all kernel memory, not allowing real clients to connect.
Short Abort Intervals If the interval is too short, valid clients that have a slow connection or go through slow proxies and firewalls could get aborted prematurely. This might help reduce chances of DoS attacks, but slow clients might also be mistakenly terminated.
Connection Established
This includes the main data transfer state (the focus of our tuning explanations in this article). The tuning parameters for congestion control, latency, and flow control will be described in more detail. FIGURE 8 shows two concurrent processes that read and write to the bidirectional full-duplex socket connection.
Connection Shutdown
This includes the set of substates that work together to shut down the connection in an orderly fashion. We will see important tuning parameters related to memory. Tunable parameters include:
tcp_time_wait_interval: how long the state of the TCP connection can wait for the 2MSL timeout before shutting down and freeing resources. If this value is too high, the socket holds up resources, and if it is a busy server, the port and memory may be desperately needed. The resources will not free up until this time has expired. However, if this value is too short and there have been many routing changes, lingering packets in the network, which might be lost.
tcp_fin_wait_2_flush_interval: how long this side will wait for the remote side to close its side of the connection and send a FIN packet to close the connection. There are cases where the remote side crashes and never sends a FIN. So to free up resources, this value puts a limit on the time the remote side has to close the socket. This means that half open sockets cannot remain open indefinitely.
NOTE
tcp_close_wait is no longer a tunable parameter. Instead, use tcp_time_wait_interval.
TCP Tuning on the Sender Side
TCP tuning on the sender side controls how much data is injected in to the network and the remote client end. There are several concurrent schemes that complicate tuning. So to better understand, we will separate the various components then describe how these mechanisms work together. We will describe two phases: Startup and Steady State. Startup Phase TCP tuning is concerned with how fast we can ramp up sending packets into the network. Steady State Phase tuning is concerned about other facets of TCP communication such as tuning timers, maximum window sizes, and so on.
Startup Phase
In Startup Phase tuning, we describe how the TCP sender starts to initially send data on a particular connection. One of the issues with a new connection is that there is no information about the capabilities of the network pipe. So we start by blindly injecting packets at a faster and faster rate until we understand the capabilities and adjust accordingly. Manual TCP tuning is required to change macro behavior, such as when we have very slow pipes as in wireless or very fast pipes such as 10 Gbit/sec. Sending an initial maximum burst has proven disastrous. It is better to slowly increase the rate at which traffic is injected, based on how well the traffic is absorbed. This is similar to starting from standstill on ice. If we initially floor the gas pedal, we will skid, and then it is hard to move at all. If on the other hand we start slowly and gradually increase speed, we can eventually reach a very fast speed. In networking, the key concept is that we do not want to fill buffers. We want to inject traffic as close as possible to the rate at which the network and target receiver can service the incoming traffic.
During this phase, the congestion window is much smaller than the receive window. This means the sender controls the traffic injected into the receiver by computing the congestion window and capping the injected traffic amount by the size of the congestion window. Any minor bursts can be absorbed by queues. FIGURE 9 shows what happens during a typical TCP session starting from idle.
FIGURE 9 TCP Startup Phase
The sender does not know the capacity of the network, so it starts to slowly send more and more packets into the network trying to estimate the state of the network by measuring the arrival time of the ACK and computed RTT times. This results in a self-clocking effect. In FIGURE 9, we see the congestion window initially starts with a minimum size of the maximum segment size (MSS), as negotiated in the three-way handshake during the socket connection phase. The congestion window is doubled every time an ACK is returned within the timeout. The congestion window is capped by the TCP tunable variable tcp_cwnd_max, or until a timeout occurs. At that point, the ssthresh internal variable is set to half of tcp_cwnd_max. ssthresh is the point where upon a retransmit, the congestion window grows exponentially. After this point it grows additively, as shown in FIGURE 9. Once a timeout occurs, the packet is retransmitted and the cycle repeats.
FIGURE 9 shows that there are three important TCP tunable parameters:
tcp_slow_start_initial: sets up the initial congestion window just after the socket connection is established.
tcp_slow_start_after_idle: initializes the congestion window after a period of inactivity. Since there is some knowledge now about the capabilities of the network, we can take a short cut to grow the congestion window and not start from zero, which takes an unnecessarily conservative approach.
tcp_cwnd_max: places a cap on the running maximum congestion window. If the receive window grows, then tcp_cwnd_max grows to the receive window size.
In different types of networks, you can tune these values slightly to impact the rate at which you can ramp up. If you have a small network pipe, you want to reduce the packet flow, whereas if you have a large pipe, you can fill it up faster and inject packets more aggressively.
Steady State Phase
In Steady State Phase, after the connection has stabilized and completed the initial startup phase, the socket connection reaches a phase that is fairly steady and tuning is limited to reducing delays due network and client congestion. An average condition must be used because there are always some fluctuations in the network and client data that can be absorbed. Tuning TCP in this phase, we look at the following network properties:
Propagation Delay This is primarily influenced by distance. This is the time it takes one packet to traverse the network. In WANs, tuning is required to keep the pipe as full as possible, increasing the allowable outstanding packets.
Link Speed This is the bandwidth of the network pipe. Tuning guidelines for link speeds from 56kbit/sec dial-up connections differ from 10Gbit/sec optical local area networks (LANs).
In short, tuning will be adjusted according to the type of network and associated key properties: propagation delay, link speed, and error rate. These properties actually self-adjust in some instances by measuring the return of acknowledgments. We will look at various emerging network technologies: optical WAN, LAN, wireless, and so on and describe how to tune TCP accordingly.