- Introduction
- A QP Is a Bi-Directional Message Transport Engine
- Verb Layer Is an OS-Independent API
- QP Context Defines QP's Operational Characteristics
- Sending a Message to a Destination CA
Sending a Message to a Destination CA
Definition of Requester and Responder
In the specification, the QP SQ Logic that is sending request packets is referred to as the requester, while the QP RQ Logic that is receiving them is referred to as the responder.
Example Scenario Assumptions
This example scenario assumes that the QPs being used were created earlier in time. It also makes the following assumptions:
The companion QPs in the two CAs are both of the Reliable Connected (RC) type (see "QP Type" on page 41).
The start PSN assigned to the SQ Logic in CA "X" is 100, while the start PSN assigned to the SQ Logic in CA "Y" is 2000see "SQ Start PSN (Packet Sequence Number)" on page 41.
The two QPs have just been created and set up and have not yet sent any packets to each other.
The ePSN assigned to the RQ Logic in CA "X" is therefore 2000, and the ePSN assigned to the RQ Logic in CA "Y" is 100see "RQ Logic's Expected PSN (ePSN)" on page 41.
The example operation will cause the transmission of a message from the local memory of CA "X" to the local memory of CA "Y".
The message is 5KB in size and the maximum data payload size is 2KBsee "Maximum Data Payload Size" on page 42. This means that the entire message will require the transmission of three Send request packets: a "Send First" request packet with a data payload of 2KB, one "Send Middle" request packet with a data payload of 2KB, and a "Send Last" request packet with a data payload of 1KB.
The Ack Retry Count assigned to the SQ Logic in both QPs = 7see "Ack Timeout and Missing Packet(s) Retry Counter" on page 44.
The RNR Retry Count assigned to the SQ Logic in both QPs = 7see "Receiver Not Ready (RNR) Retry Count" on page 45.
The destination CA is in the same subnet as the source CA. Packets therefore only require a Local Router Header (LRH) containing the SLID and DLID addresses. They do not require a Global Route Header (GRH) containing a SGID and DGIDsee "Global Source/Destination Addresses" on page 45.
Step One: Posting the Message Receive Request
The process of creating a QP is described later in this book. In this example scenario, the requester QP's SQ Logic (item L) in CA "X" will read a 5KB message from its local memory and send it to the target responder QP's RQ Logic (item Q). Upon receipt of the message data, the responder RQ Logic will use the top entry (WQE) currently posted to its RQ (item S) to determine where to write the incoming message data in its CA's local memory. This obviously presupposes that software associated with the requester QP in CA "Y" had posted a WR to the local QP's RQ. The software application (item V) takes the following actions to post such a request:
The software application makes a call to the OS memory management routine to request the allocation of a buffer (or multiple buffers) to hold the next expected incoming message. This brings up the question of how it would know in advance the size of the next inbound message. The specification doesn't cover this issue, but it can be handled in a number of ways:
This information could have been received in a previous message.
The application may have device- or application-specific knowledge of the entity with whom it will trade messages and would "know" the size of messages to be expected.
Perhaps the software application associated with the example CA "Y" QP may not have any idea of the size of yet-to-arrive incoming messages and will therefore not post any WRs to the RQ in advance. Instead, when the QP's RQ Logic detects the first packet of an inbound Send operation (as an example), it could return an RNR Nak packetsee "Receiver Not Ready (RNR) Retry Count" on page 45to the sender's SQ Logic. HCA "Y" could then set a status bit in a CA-specific register to indicate that a WR should be posted to that QP's RQ to handle the expected retry (i.e., retransmission) of the Send request packet. Software associated with CA "Y" is invoked by an interrupt, checks status, and, as a result, posts a WR to the indicated QP's RQ to handle the retry.
The example scenario assumes that software will post a WR to the QP's RQ (item S) prior to the receipt of the first request packet of the Send operation. The WR is supplied to CA "Y" by executing a Post Receive Request verb call and supplying the following input parameters:
The QP handle that identifies the QP to whose RQ the WR is to be posted. The handle was returned when the Create QP verb was called.
A unique 64-bit WR ID. This WR ID will be deposited in the Completion Queue Entry (CQE; pronounced "cookie") that is created on the RQ's CQ (Completion Queue; item T) once all of the message data has been written to CA "Y's" local memory.
The operation type: in this example, it is a Receive operation.
A Scatter Buffer List identifying the main memory buffer(s) to which the inbound message data will be written.
Upon receipt of the WR, the Post Receive Request verb causes the WR to be posted to the next entry in the QP's RQ (item S). Once posted on the RQ, the WR is referred to as a WQE (Work Queue Entry; pronounced "wookie").
Step Two: Posting the Message Send Request
Software can cause a QP's SQ in a CA to transmit a message transfer request to a QP's RQ in another CA in the following manner:
The software application (item B) builds the message to be transferred in memory local to the CA (item C). In the illustration, the software application writes the data that comprises the message into one or more buffers (memory ranges) in main memory.
The software application then executes a Post Send Request verb call (item D) and supplies the verb with a Work Request (WR) consisting of the following parameters (note that this is not an all-inclusive list):
A QP identifier in the form of the handle returned when the Create QP verb was called earlier.
A unique 64-bit WR ID. This WR ID will be deposited in the CQE that is created once the message Send operation has been completed.
The operation type: in this example, it is a Send operation.
A Gather Buffer List identifying the main memory buffer(s) that hold the example 5KB message to be sent to the remote CA.
Optionally, a 32-bit immediate data value can be supplied. Upon receipt of the entire 5KB message, the remote CA's QP RQ Logic (item Q) deposits this value in the CQE it creates on the RQ's associated CQ (item T). This data value can be used to inform software associated with the destination CA regarding the nature of the message it just received.
Upon receipt of the WR, the Post Send Request verb causes the WR to be posted to the next entry in the QP's SQ (item J). Once posted on the SQ, the WR is referred to as a WQE.
Step Three: 'Send First' Request Packet Sent
The QP's SQ Logic (item L) starts processing the top SQ entry. The WQE specifies a multi-packet Send operation to send the 5KB message from this HCA's local memory (i.e., main memory) to a QP within a remote CA.
The SQ Logic determines the maximum amount of data that can be placed in each request packet by checking the Maximum Data Payload Size value (aka PMTU) stored in the QP's Context. In this case, it is 2KB. The SQ Logic uses the WR's Gather Buffer List and reads the first portion of the message (2KB of data) from main memory.
The SQ Logic sets the Opcode field in the first request packet to "Send First", thereby indicating to the remote QP's RQ Logic that this is the first packet of a multi-packet Send operation. It should be noted that the target QP's RQ Logic will not know the length of the Send operation until it has received the "Send Last" request packet.
The SQ Logic sets the PSN field in the first request packet to the Send Logic's cPSN. Our example assumes that this is the first packet sent by this QP since the QP was created. The cPSN is therefore set to the start PSN assigned to the SQ Logic, 100see "Example Scenario Assumptions" on page 47.
The SQ Logic sets the Destination QP (DestQP) field in the first request packet to the destination QP's QPN (supplied from the local QP's Context).
When the QP's SQ Logic forwards the request packet to port X for transmission, it supplies the port with the offset to add to its base LID address in forming the LID address to be placed in the packet's SLID field. See "Source Port's LID Address" on page 45.
The SQ Logic sets the DLID field in the first request packet to the DLID address of the destination CA port behind which the target QP resides. This LID address is supplied from this QP's Context.
The SQ Logic sets the SL (Service Level) field in the first request packet to the SL from the QP Contextsee "Desired Local Quality of Service" on page 43.
If the requested operation (in this case, a Send operation) had a global destination (i.e., the destination CA is in a different subnet and the packets therefore have to cross one or more routers), the SQ Logic would insert the following global address information into the respective fields of the first request packet's GRH. The values would be supplied from the QP Context (see "Global Source/Destination Addresses" on page 45):
TClass.
Flow Label.
Hop Limit.
Index into HCA port's GUIDInfo attribute table. The index is supplied to the port's Link Layer so it can choose the desired SGID address from the port's GUID list (stored in the port's GUIDInfo attribute table).
The DGID (see "Global Source/Destination Addresses" on page 45) of the destination CA port (item Y).
In this example, however, it should be noted that both the source and destination CAs reside in the same subnet. The request and response packets will therefore not contain a GRH.
The first request packet of the Send operation is sent to the Network Layer and then forwarded to the Link Layer of the HCA port (port X) identified in the QP Context. In addition, at this point the SQ Logic also does the following:
Updates its nPSN by incrementing it from 100 to 101. This is the PSN that will be inserted in the next request packet sent.
Begins awaiting the receipt of the "Send First" request packet's corresponding Ack packet. The receipt of the first Ack packet is covered in "Step Four: First Ack Packet Returned" on page 53.
Begins to form the next request packet (a "Send Middle") to send to the Link Layer for transmit. The transmission of the "Send Middle" request packet is covered in "Step Five: 'Send Middle' Request Packet Sent and Ack Returned" on page 55.
Upon receipt of the first packet from the Network Layer, the port's Link Layer takes following actions:
It adds the offset from the port's base LID address (supplied from the QP Context) to the base LID address and inserts the resulting LID address in the request packet's SLID field (see "Source Port's LID Address" on page 45).
If the destination CA resides in a different subnet (i.e., it's a global destination), then the 128-bit SGID address is formed as follows:
The port's Link Layer forms the SGID address by combining the port's assigned 64-bit GIDPrefix attribute (aka Subnet ID) with the 64-bit GUID selected by the index into HCA port's GUIDInfo attribute table (the index is sourced from this QP's Context; see "Global Source/Destination Addresses" on page 45).
Uses the SLsupplied from the QP context; see "Desired Local Quality of Service" on page 43to perform a lookup in the port's SLtoVLMappingTable attribute. The entry selected (1-of-16) identifies which of the port's Link Layer transmit buffers (referred to as a Virtual Lane, or VL) the packet is placed in. As will be seen later (in "QoS within the Subnet: SL and VLs" on page 617), during configuration the SM set up the port's SLtoVLMappingTable attribute table to map each of the 16 possible SL values to a specific Link Layer transmit buffer. The SM had also set up an arbitration scheme that assigns a level of importance to each of these transmit buffers. This defines in what order the transmit buffers get to transmit packets to the port's Physical Layer.
The request packet is posted in the Link Layer VL transmit buffer selected in the previous step.
When that VL transmit buffer's turn for packet transmission has come (based on the port's VL transmit buffer arbitration mechanism), the port's Link Layer forwards the request packet to the port's Physical Layer for transmission. The packet's VL (Virtual Lane) field identifies the respective VL receive buffer that is to receive the packet on the other end of the physical link immediately connected to this port (either a port on an intervening switch or router, or the target port on the destination CA).
The request packet byte stream from the port's (item X) Link Layer is encoded into 10-bit characters by the port's Physical Layer, is converted into a serial bit stream, and is transmitted over the wire.
The request packet traverses one or more links until it arrives at the destination port (item Y). Each switch along the way uses the packet's DLID field to perform a lookup in its Forwarding Table to determine through which of its ports the packet must be transmitted to move it towards the destination port (item Y). A switch's Forwarding Table is set up by the configuration software at startup time.
The destination port's Physical Layer deserializes the data, decodes the 10-bit characters into an 8-bit byte stream, and sources the request packet's byte stream to the port's Link Layer.
The port's Link Layer address decode logic decodes the request packet's DLID field and determines that this is the destination port.
The port's Link Layer accepts the packet's byte stream into the VL receive buffer indicated by the packet's VL field. This is the VL that was chosen by the Link Layer at the other end of the physical link (item X) that is connected to this port.
The request packet is forwarded to the port's Network Layer.
The Network Layer passes the packet to the RQ Logic (item Q) of the QP targeted by the request packet's DestQP field.
The QP's RQ Logic compares the request packet's PSN (100 in this case) to its current ePSN value (100) to determine if the request packet has the expected PSN.
Assuming that the packet's PSN = ePSN, packet processing continues.
If the packet's PSN is greater than the ePSN, the RQ Logic schedules a PSN Sequence Error Nak packet to be sent back to the remote QP's SQ Logic and the requested operation (a Send in this case) is not executed (i.e., the packet's data payload is not written to this CA's local memory) by the receiving QP's RQ Logic.
If the packet's PSN falls within the range of PSNs for request packets that were previously received (i.e., it's a duplicate request packet), the RQ Logic does not re-write the packet's data payload to memory, but it does schedule an Ack packet to be returned.
The QP's RQ Logic checks the packet's opcode to determine if a RQ WQE is required to determine where the message is to be written in CA "Y's" local memory. If it is a Send (which this is) or an RDMA Write With Immediate (covered later), a RQ WQE is required. If the QP's RQ currently does not have any WQEs posted, the RQ Logic schedules an RNR Nak packet to be sent back to the sender's SQ Logic (item L) and the requested operation (in this case, a Send) is not executed by the receiving QP's RQ Logic. Assuming that there is at least one WQE posted to the RQ (and in this case, there is), the packet's processing continues.
The QP's RQ Logic (item Q) checks the packet's opcode to ensure it makes sense. In this case, it should be a "Send First", not a "Send Middle" or some other nonsense opcode that is out of sequence. Assuming the opcode makes sense, the packet's processing continues.
The RQ Logic (item Q) schedules a positive Ack packet to be returned to the sender's QP's SQ Logic (item L). Its transmission and subsequent arrival back at the sender's SQ Logic is covered in "Step Four: First Ack Packet Returned" on this page.
The RQ Logic uses the information in the WQE at the top of the RQ (item S) to determine where to write the packet's data payload in the CA's local memory. Earlier in time ("Step One: Posting the Message Receive Request" on page 48), software associated with this CA (i.e., CA "Y") executed a Post Receive Request verb call and posted a WR to the QP's RQ.
The request packet's data payload is written to the CA's local memory (item W) using the Scatter Buffer List from the RQ WQE.
The RQ Logic updates the memory address pointer in the RQ WQE to point to where it left off (and, therefore, where the data payload of the next packet of the Send operation is to be written when it arrives).
The RQ Logic updates its ePSN (currently = 100) to ePSN + 1 and awaits the arrival of the next packet of the Send operation.
Step Four: First Ack Packet Returned
Upon receipt of the "Send First" request packet, the responder QP's RQ Logic (item Q) schedules a positive Ack packet to be sent back to the requester QP's SQ Logic (item L). The Ack packet's PSN is the same one contained in the "Send First" request packet. The transmission of the Ack packet involves the following steps:
The responder QP's RQ Logic forwards the Ack packet to the CA's Network Layer which, in turn, forwards it to the Link Layer of the port that received the request packet (item Y).
The Ack packet does not contain a data payload field. It does, however, contain the Acknowledge opcode and an Acknowledge Extended Transport Header (AETH) field. The opcode indicates that this is an Ack packet, and the AETH indicates the type of Ack packet: positive Ack, or negative Ack (Nak). If it's a Nak packet, the reason for the Nak is also indicated.
The Ack packet's DestQP field is loaded with the QPN (QP Number) that identifies the QP that sourced the request packet (this QPN is sourced from the receiving QP's Context).
In this example, the AETH indicates that this is a positive Ack packet.
The SL used in the Ack packet must be the same as the one used in the request packet.
The SLID and DLID fields received in the request packet are swapped in the Ack packet for the return journey.
Using the Ack packet's SL value, the port's (item Y) Link Layer performs a lookup in its SLtoVLMappingTable attribute to determine which of the Link Layer's VL transmit buffers to post the Ack packet in for transmission.
The Ack packet is posted in the selected VL transmit buffer.
When that VL transmit buffer's turn for packet transmission has come (based on the VL transmit buffer arbitration mechanism), the port's Link Layer forwards the Ack packet to the port's Physical Layer for transmission. The Ack packet's VL (Virtual Lane) field identifies the respective VL receive buffer that is to receive the packet on the other end of the physical link immediately connected to this port (either a port on an intervening switch or the target port on the destination CA).
The Ack packet's 8-bit byte stream from the port's (item Y) Link Layer is encoded into 10-bit characters by the port's Physical Layer, is converted into a serial bit stream, and is transmitted over the wire.
The Ack packet traverses one or more links until it arrives at the destination port (item X, the port that originally sourced the request packet into the fabric).
The destination port's Physical Layer deserializes the data, decodes the 10-bit characters into an 8-bit byte stream, and sources the Ack packet's byte stream to the port's Link Layer.
The port's Link Layer address decode logic decodes the DLID field and determines that this is the destination port.
Port's Link Layer accepts the Ack packet byte stream into the VL receive buffer indicated by the Ack packet's VL field. This is the VL chosen by the Link Layer at the other end of the physical link connected to this port.
The Ack packet is forwarded to the Network Layer.
The Network Layer passes the Ack packet to the SQ Logic (item L) of the QP targeted by the Ack packet's DestQP field. This is the same SQ Logic that originally generated the corresponding request packet.
The SQ Logic examines the AETH to determine if this is a positive or negative Ack. In this case, it is a positive Ack.
The SQ Logic compares the Ack packet's PSN to determine which of the following is true (in the example, the first case is true; the other three possibilities are covered later in this book):
Ack packet's PSN = PSN of the oldest unAck'd request packet (i.e., the Ack packet's PSN = 100). In this example it is equal, so the requester QP's SQ Logic rolls up the lower end of its unAck'd request window by one (in other words, the oldest unAck'd request packet is now = 101).
Ack packet's PSN is > the SQ Logic Start PSN but < the PSN of the oldest unAck'd request packet (i.e., it's a duplicate Ack packet).
Ack packet's PSN is > the PSN of the oldest unAck'd request packet but less than the high-water mark the SQ Logic has reached in issuing new request packets.
Ack packet's PSN is < the SQ Logic Start PSN or > the high-water mark the SQ Logic has reached in issuing new request packets (i.e., it's an invalid Ack packet).
Step Five: 'Send Middle' Request Packet Sent and Ack Returned
It should be noted that the requester QP's SQ Logic doesn't wait for the Ack for the just-issued request packet to arrive before it launches the next request packet into the fabric.
The requester QP's SQ Logic continues as follows:
Using the top entry on the SQ (item J), it reads the next 2KBs of message data from main memory using the Gather Buffer List in the WQE on top of the SQ.
It adjusts the WQE's read pointer to point to where it left off.
It places the next sequential PSN(101) in the "Send Middle" request packet.
It transmits the request packet to the responder QP's RQ Logic with a "Send Middle" opcode.
Upon receipt of the request packet, the responder QP's RQ Logic (item Q) takes the following actions:
The responder QP's RQ Logic compares the request packet's PSN (101 in this case) to its current ePSN value (101) to determine if the request packet has the expected PSN:
Assuming that the packet's PSN = ePSN, the packet's processing continues (see step 2 below).
If the packet's PSN is greater than the ePSN, the RQ Logic schedules a PSN Sequence Error Nak packet to be sent back to the remote QP's SQ Logic and the requested operation (a Send in this case) is not executed (i.e., the packet's data payload is not written to this CA's local memory) by the receiving QP's RQ Logic.
If the packet's PSN falls within the range of PSNs for request packets that were previously received (i.e., it's a duplicate request packet), the RQ Logic does not re-write the packet's data payload to memory, but it does schedule an Ack packet to be returned.
The QP's RQ Logic (item Q) checks the packet's opcode to ensure it makes sense. In this case, it should be a "Send Middle" or a "Send Last", not a "Send First" or some other nonsense opcode. Assuming the opcode makes sense, the packet's processing continues.
The packet's 2KB data payload is written to the CA's local memory (item W) using the updated Scatter Buffer List pointer from the top RQ WQE.
The RQ Logic updates the memory address pointer in the RQ WQE to point to where it left off (and, therefore, where the data payload of the next request packet of the Send operation is to be written when it arrives).
The RQ Logic updates its ePSN to ePSN + 1 (102) and awaits the arrival of the next packet of the Send operation.
The responder QP's RQ Logic schedules a positive Ack packet to be sent back to the requester QP's SQ Logic. The Ack packet's PSN is the same one contained in the "Send Middle" request packet just received. If this were a longer message Send operation, the steps listed in this section would be repeated for each of the "Send Middle" packets.
Step Six: 'Send Last' Request Packet Sent
The requester QP's SQ Logic (item L) continues as follows:
It reads the final 1KB of message data from the last buffer identified in the Gather Buffer List of the WQE on top of the SQ (item J). The data payload in the last packet of a message Send operation will contain 1KB of data.
The next sequential PSN (102) is placed in the request packet.
It sends the request packet with a "Send Last" opcode.
Upon receipt of the "Send Last" request packet, the responder QP's RQ Logic (item Q) takes the following actions:
The responder QP's RQ Logic compares the request packet's PSN (102 in this case) to its current ePSN value (102) to determine if the request packet has the expected PSN.
Assuming that the packet's PSN = ePSN, the packet's processing continues (see step 2 below).
If the packet's PSN is greater than the ePSN, the RQ Logic schedules a PSN Sequence Error Nak packet to be sent back to the remote QP's SQ Logic and the requested operation (a Send in this case) is not executed (i.e., the packet's data payload is not written to this CA's local memory) by the receiving QP's RQ Logic.
If the packet's PSN falls within the range of PSNs for request packets that were previously received (i.e., it's a duplicate request packet), the RQ Logic does not re-write the packet's data payload to memory, but it does schedule an Ack packet to be returned.
The QP's RQ Logic (item Q) checks the packet's opcode to ensure it makes sense. In this case, it should be a "Send Middle" or a "Send Last" (which it is), not a "Send First" or some other nonsense opcode. Assuming the opcode makes sense, the packet's processing continues.
The packet's 1KB data payload is written to the CA's local memory (item W) using the updated pointer in the Scatter Buffer List from the top WQE on the RQ (item S).
All packets of the message send operation have now been received and written to the CA's local memory, so the RQ Logic updates its ePSN to ePSN + 1 (103) and awaits the arrival of the first request packet of the next message transfer operation.
The top WQE is retired from the RQ (item S) and a CQE is created on the CQ associated with the RQ (item T). This CQE contains the completion status of the message receive operation. In addition, if the "Send Last" request packet contained an ImmDtETH (Immediate Data Extended Transport Header), the 32-bit immediate data value it contains is stored in the CQE.
This completes the receipt of the message (but the final Ack is yet to be sent; see the next section).
CA "Y" could be designed to generate an interrupt whenever a CQE is posted to a CQ associated with any QP's SQ or RQ.
Step Seven: Final Ack Returned
The responder QP's RQ Logic schedules a positive Ack packet to be sent back to the requester QP's SQ Logic (item L). The Ack packet's PSN (102) is the same one contained in the "Send Last" request packet. Upon arrival at the requester QP's SQ Logic, the SQ Logic takes the following actions:
The SQ Logic examines the Ack packet's AETH field to determine if this is a positive or negative Ack. In this case, it is a positive Ack.
Since this Ack is ack'ing the "Send Last" request packet, the SQ Logic takes the following actions:
The top WQE is retired from the SQ (item J).
A CQE is created on the CQ associated with the SQ (item G). This CQE contains the completion status of the message send operation.
This completes the message send operation.
Figure 3-4: Example Scenario (left half of illustration)
Figure 3-5: Example Scenario (right half of illustration)