- 6.1 About Ethernet
- 6.2 About Hubs, Switches, and Routers
- 6.3 About TCP/IP
- 6.4 About Packets
- 6.5 About Remote Procedure Calls (RPCs)
- 6.6 Slop
- 6.7 Observing Network Traffic
- 6.8 Sample RPC Message Definition
- 6.9 Sample Logging Design
- 6.10 Sample Client-Server System Using RPCs
- 6.11 Sample Server Program
- 6.12 Spinlocks
- 6.13 Sample Client Program
- 6.14 Measuring One Sample Client-Server RPC
- 6.15 Postprocessing RPC Logs
- 6.16 Observations
- 6.17 Summary
- Exercises
6.7 Observing Network Traffic
In this chapter we will observe and measure some simple RPCs. In contrast to observing local CPU, memory, and disk activity, it takes two connected machines and two sets of software to observe network traffic. Rather than just observing isolated packets, we will observe an RPC system that has client software, server software, multi-packet RPC request and response messages, multiple server threads, and overlapped client calls. As usual, we wish to observe in enough detail to detect anomalous dynamics.
Figure 6.8 shows one example of such dynamics, captured by RPC logs and Dapper [Sigelman 2010]. Using the style of our single-RPC diagram from Figure 6.6, the notched lines in Figure 6.8 show the time layout of 93 parallel RPCs, similar to the top-level RPCs shown as timeless light green arcs in Figure 6.7.
Figure 6.8 Diagram of ~100 RPCs at top level of a single web-search RPC. The “pyxxxx” notations to the right of each line are individual server names.
The very top notched blue line labeled pyge65 in Figure 6.8 shows an incoming single RPC requesting a single web search on server pyge65. It takes about 160 msec total. Underneath it are the outbound sub-RPCs that this initial call spawns, directed at other servers. You can see the sub-RPC transaction latency variation and can see that the 99th percentile slowest parallel RPC on pyhr29 determines the response time of the overall web-search RPC. You can also see that understanding and removing the sources of long latency can speed up this example by about a factor of 2, from 160 msec total to about 80 msec.
Look at the spawned RPCs. First, at the far upper left there is a short barely visible call to server pyhr19 (yellow line just under the “m” in “0ms") to check for a previously cached immediate answer. Using cached previous search answers noticeably speeds up identical searches. Then a blue call to pyde35 is a canary request [Dean 2010]. Only when that returns successfully, i.e., without crashing pyde35, are the other RPCs done. If you have a request that hits a code bug that crashes a server, and you will, the canary strategy results in crashing just one server instead of thousands. In Figure 6.8 the canary returns, so the 90-odd parallel calls to pygj11 .. pygk39 are started. Not shown are the 20-odd RPCs that each of these spawn in turn, about 2,000 in total, corresponding to the dark blue arcs in Figure 6.7.
Only when pyhr29 returns, the slowest of these parallel calls, does the initial web-search RPC complete. At the very lower right are two parallel calls to update duplicate cached results on pyej23 and on pyhr19 (yellow lines). These actually occur after the initial RPC completes. The vertical white line at 50 msec is just a time grid.
If you look carefully at the canary call to pyde35, you will notice that the request message takes over 10 msec to go from client user-mode code to server user-mode code, and the response message also takes over 10 msec to go from server to client. This slop is much longer than the slop time for most of the subsequent RPCs, so we have our first unexpected source of excess latency. For datacenter networks within a single building, the delay through the routers of the hardware switching fabric rarely exceeds 20 usec. So a delay that is 500x longer than that can only be a software, not hardware, delay, somewhere on the client or server in either user code or kernel code. We examine such delays in Part IV.
If you look carefully, the 93 parallel calls do not all start at exactly the same time—there is a slight tilt to the nearly vertical left edge. Their start times increment by about 6 usec each, reflecting the CPU time to create and send each RPC. The leftmost notch on each line shows that almost all the RPC requests arrive at their corresponding server program fairly quickly, except the calls to pyhr29 and pyfi22, which take over 20 msec to arrive. This is another latency mystery to be resolved.
The rightmost notch on each line shows that almost all the RPC responses are sent soon before they arrive at the client program, so there is no latency mystery for those.
Initially, however, we would be more interested in the exceptionally slow response times of pyhr29 and pygj11, since they delay the overall response time of the initial RPC by about 70 msec. Understanding those delays requires observing what is happening on each of those CPUs, applying our observation tools and thought to each in turn. The same kind of transaction log files that were used to create Figure 6.8 can be used on the logs from pyhr29 and pygj11 to see the dynamics of their delays. The general pattern is: observe to focus on the big issues and ignore the inconsequential artifacts, examine the important artifacts in more detail, resolve them, and then repeat.
The good news is that our picture of the RPC activity on just one machine has revealed two message-delivery latency mysteries and has pinpointed exactly the other two machines and time of day to the microsecond that contribute to overall slow response time for this one web search. Looking at multiple such web searches over a few tens of seconds will reveal whether pyhr29 and pygj11 are always slow or just happened to be slow for this one observation.
Our goal in this chapter is to capture enough information about each RPC to be able to draw diagrams like Figure 6.8 and then use those to track down root causes for delays. In later chapters, especially Chapter 26, we will add tools for observing the underlying reasons for delay(s) that our RPC diagrams reveal.
But before exploring the observation of network dynamics, we need to describe the sample “database” RPC system in a little more detail.