Small Packet TCP Traffic Performance
This section studies the gigabit performance of TCP with small packet network traffic. The ce interface cards are used for the evaluation. The same set of hardware and software configurations that were used for "Bulk Transfer Traffic Performance" is the test bed. Note that a request/response type of traffic is not investigated, but rather a continuous unidirectional flow of small packets.
Effect of Nagle's Algorithm and Deferred Acknowledgment
As discussed in "Small Packet Traffic Issues," Nagle's algorithm plays an important role in the transmission of small packets. Since Nagle's algorithm asks the sender to accumulate packets when there is one unacknowledged small packet outstanding, packets sent from the application may not be put on the wire as soon as they arrive in TCP. In the meantime, systems at the receiving end typically enable deferred acknowledgment in the hope of having optimal throughput in the bulk transfer case. Hence, if an application is trying to send a series of small messages (less than 1,460 bytes), these messages may not be delivered immediately. Instead, these messages may be delivered with visible delays.
Table 2 Performance Of Small Packet Traffic When The Sender Turns On Nagle's Algorithm and the Receiver Enables Deferred Acknowledgment
Message (bytes) |
On-wire packet payload (bytes) |
Packet rate |
Throughput (Mbps) |
60 |
60, 1440 |
13198 |
79.20 |
100 |
100, 1400 |
15224 |
91.35 |
250 |
250, 1250 |
26427 |
181.56 |
536 |
536,1072 |
36548 |
235.09 |
1024 |
1024 |
42417 |
347.49 |
1460 |
1460 |
57146 |
667.48 |
This section explains what may be happening, but first looks at the raw packet rate the ce interface is capable of delivering when the sender adopts Nagle's algorithm and the receiver activates deferred acknowledgment. Note that the receiver sets tcp_deferred_acks_max to eight in this case. TABLE 2 lists the packet rate and throughput for a ce card when the server only sends packets. The throughput and packet rate goes up as the message size increases. However, the packet rate of a 60-byte payload is only about one-fifth of the packet rate of a 1,460-byte payload. To understand what causes the low packet rate for small packets, the snoop utility in the Solaris 8 OE was used. FIGURE 8 shows what was found in the 100-byte payload case. After the network ramps up, that is, beyond the slow start [4] phase, a cycle like this is seen. First, the server (Machine S) sends the client (Machine C) a packet with 100 bytes of payload. Machine S cannot continue sending without waiting for an ACK packet from machine C because the last packet it sent had less than 1,460 bytes of payload. Machine C, however, is waiting for more packets from S to reduce the number of ACK packets per data packet. In the mean time, machine S accumulates data from the application. Finally, the amount of unsent data in machine S reaches 1500 bytes (15 packets of 100 bytes), that is, above the 1,460-byte MSS, and sends out a packet with 1400 bytes of payload. Note that machine S will not fragment packets to have a packet with a 1,460-byte payload. When machine C gets the packet with 1400-byte payload, it immediately sends machine S an ACK packet, then this cycle restarts.
FIGURE 8 Sending Process Of Packets With 100-byte Payload When Nagle's Algorithm is Enabled at the Sender (S) and Deferred Acknowledgment is Enabled at the Receiver (C)
Now one question pops up. Why doesn't C wait until it receives an amount of data no smaller than eight MSS before it sends out an ACK? Obviously, this could have made the interlocking scenario worse, since machine S would have to accumulate another 1500 bytes before it could send out a packet with 1400 bytes of payload. Solaris 8 OE actually takes care of this situation gracefully by enforcing the following rules. In summary, the receiver will send out an ACK packet immediately if both of the following conditions are true:
A non-MSS segment arrived.
The amount of unacknowledged data is not a multiple of MSS.
Note that Solaris OE is trying to address the end-of-connection issue here, but it also affects the experiment.
After this scenario is explained, you can see that the relationship of the packet rate and throughput is as follows:
Throughput = Packet_Rate * (floor(1460/message_size) + 1) * message_size/2
where the floor function trims the decimal part of its parameter. The packet rate is mostly determined by the system's capability to execute system calls for moving data from user address space to the kernel address space. Obviously, to obtain better performance for small packets, the system must disable Nagle's algorithm or disable deferred acknowledgment. However, disabling deferred acknowledgment may negatively affect the performance of bulk transfer. Also, the opportunity to piggy-back acknowledgments with data packets from machine C to machine S, if machine C has any, may be lost. Hence, the preferred approach is to disable Nagle's algorithm only in the sender. The two ways to achieve this goal are:
Set tcp_naglim_def to one using the ndd command. TCP sends the packet downstream immediately when it receives a message from the application. If the communication involves only Sun servers and workstations, an ACK packet will be delivered after the server transmits two packets.
In the application, use the TCP_NODELAY option in the socket function call to create a socket. Only the application can know whether it will be communicating using small packets. Hence, it makes sense to ask the applications that use small packets to disable Nagle's algorithm for the particular sockets that they need. Disabling Nagle's algorithm in a system-wide manner is not preferred.
TABLE 3 shows the new packet rate for the same list of message sizes in TABLE 2 after Nagle's algorithm is disabled. The value of tcp_maxpsz_multiplier is set to 10 to produce the numbers in this table. The new packet rates for payloads of 60 bytes to 1,024 bytes increase by up to three times to a level close to the packet rate of the 1,460-byte payload (packets with Ethernet MTU size). Note that even though the packet rates are higher, the actual throughput is lower than the numbers shown in TABLE 2 because each packet now only carries the same amount of payload as the message size. However, no visible pauses will be observed during the transmission. The throughput and packet rate do not change much whether deferred acknowledgments are enabled or not. Since disabling deferred acknowledgment means higher overhead per data packet and the loss of opportunity to piggyback acknowledgment, disabling this feature is not recommended.
Table 3 Performance Of Small Packet Traffic When the Sender Turns Off Nagle's Algorithm
Payload (bytes) |
Deferred-ack on packet rate |
Throughput (Mbps) |
Deferred-ack off packet rate |
Throughput(Mbps) |
60 |
37224 |
25.90 |
39010 |
26.36 |
100 |
41987 |
38.15 |
43736 |
41.15 |
250 |
41297 |
102.87 |
42193 |
99.76 |
536 |
43724 |
188.66 |
41861 |
180.15 |
1024 |
42576 |
348.78 |
41024 |
336.07 |
1460 |
57527 |
671.92 |
57554 |
672.24 |
Packet Rate Versus Message Size
Traditionally, the packet processing cost is divided into two parts, the cost associated with processing a packet of minimal size, and the cost associated with moving data from the kernel buffer to the user buffer [7]. The former is called per-packet cost and the latter is called per-byte cost. Under this model, larger packets always take longer to process. This model was developed when the bandwidth of the system backplane was low. However, the current Sun™ Fireplane interconnect can support 9.6 gigabytes per second sustained throughput. This may make the per-byte cost trivial. As a result, the per-packet cost can dominate in processing each packet, which makes the processing time for packets of different sizes very close.
To see the new relationship between packet size and packet rate, the following experiment was conducted:
Measure packet rates with payload ranging from one byte to 1,460 bytes.
Tune the system so that one system call from the user application to send a packet corresponds to one packet on wire. This is done by disabling the Nagle's algorithm (setting tcp_naglim_def to 1).
Tune the system so that only one packet is moved between kernel components and between kernel and user applications. This is done by setting tcp_maxpsz_multiplier to 1 on the sending side (server) and setting tcp_deferred_acks_max to 1 on the receiving side (client).
The server only transmits and only one CPU is enabled.
FIGURE 9 shows how the packet rates change when the payload varies from one byte to 1,460 bytes. The numbers shown in this figure are expected to be lower than those shown in TABLE 2 due to the special setting used previously. The packet rates for message sizes of 180 bytes and smaller stay very close for the ce card. The packet rates for message sizes of 250 bytes and beyond are very close also. However, the packet rate for message size of 1,460 bytes (the packets on wire will be full-size Ethernet frames) is only 25 percent lower than those of 180-byte or smaller messages. TABLE 3 shows the percentage of CPU time for user- mode and user-to kernel copy in some of the preceding test cases. Not surprisingly, copy cost is below 10 percent across the board.7 These findings indicate that the cost associated with copying data in the operating system (which increases more than 350 times from four bytes to 1,460 bytes) is not dominant on Sun Fire 6800 platform. It is the cost associated to process each packet that affects performance most.
Table 4 Percentage of CPU Time for User-Mode and Kernel-to-User Copy
Payload (bytes) |
Percent user time |
Percent kernel to user copy |
100 |
9 |
1 |
180 |
9 |
2 |
250 |
9 |
3 |
536 |
8 |
5 |
1024 |
6 |
5 |
1460 |
6 |
7 |
FIGURE 9 Percentage of CPU Time for User-Mode and Kernel-to-User Copy