A network device includes a port and packet processing circuitry. The port is to connect to a network. The packet processing circuitry is to transmit a data packet to the network, and, after transmitting the data packet, transmit to the network an Implicit Loss Indication (ILI) packet that (i) references the data packet and (ii) is provisioned to traverse a same route via the network as the data packet.
Legal claims defining the scope of protection, as filed with the USPTO.
a port, to connect to a network; and transmit a data packet to the network; and after transmitting the data packet, transmit to the network an Implicit Loss Indication (ILI) packet that (i) references the data packet and (ii) is provisioned to traverse a same route via the network as the data packet. packet processing circuitry, to: . A network device, comprising:
claim 1 . The network device according to, wherein the ILI packet is smaller than the data packet.
2 3 2 3 claim 1 . The network device according to, wherein the packet processing circuitry is to assign the ILI packet a Layer-2 (L) header and a Layer-3 (L) header that match the Lheader and the Lheader of the data packet.
claim 1 . The network device according to, wherein the packet processing circuitry is to assign the ILI packet a Base Transport Header (BTH) that refers to a BTH of the data packet.
claim 1 . The network device according to, wherein the ILI packet references the data packet by indicating a Packet Serial Number (PSN) of the data packet.
claim 1 . The network device according to, wherein the packet processing circuitry is to transmit to the network at least one additional ILI packet that references the data packet and is provisioned to traverse the same route via the network as the data packet.
claim 1 . The network device according to, wherein the ILI packet references both the data packet and one or more other data packets.
a port, to connect to a network; and receive from the network an Implicit Loss Indication (ILI) packet that references a data packet; check whether the data packet referenced by the ILI packet was received before the ILI packet; and in response to finding that the data packet referenced by the ILI packet was not received before the ILI packet, request retransmission of the data packet. packet processing circuitry, to: . A network device, comprising:
claim 8 . The network device according to, wherein the packet processing circuitry is to discard the ILI packet in response to finding that the data packet referenced by the ILI packet was received.
claim 8 . The network device according to, wherein the packet processing circuitry is to request the retransmission by sending a negative acknowledgement (NACK).
claim 8 . The network device according to, wherein the ILI packet references both the data packet and one or more other data packets.
claim 8 . The network device according to, wherein the packet processing circuitry is to exclude the ILI packet from at least one authentication check applied to data packets.
transmitting a data packet to the network; and after transmitting the data packet, transmitting to the network an Implicit Loss Indication (ILI) packet that (i) references the data packet and (ii) is provisioned to traverse a same route via the network as the data packet. . A method, comprising:
claim 13 . The method according to, wherein the ILI packet is smaller than the data packet.
claim 13 . The method according to, further comprising transmitting to the network at least one additional ILI packet that references the data packet and is provisioned to traverse the same route via the network as the data packet.
claim 13 . The method according to, wherein the ILI packet references both the data packet and one or more other data packets.
receiving from the network an Implicit Loss Indication (ILI) packet that references a data packet; checking whether the data packet referenced by the ILI packet was received before the ILI packet; and in response to finding that the data packet referenced by the ILI packet was not received before the ILI packet, requesting retransmission of the data packet. . A method, comprising:
claim 17 . The method according to, and comprising discarding the ILI packet in response to finding that the data packet referenced by the ILI packet was received.
claim 17 . The method according to, wherein requesting the retransmission comprises sending a negative acknowledgement (NACK).
claim 17 . The method according to, and comprising excluding the ILI packet from at least one authentication check applied to data packets.
Complete technical specification and implementation details from the patent document.
The present disclosure relates generally to packet communication, and particularly to methods and systems for loss indication in network devices.
Some packet communication networks are lossy by design. In a lossy network, network devices will occasionally drop packets. Packet drops may occur, for example, when a buffer or queue becomes full or when the required bandwidth on a link or port exceeds the available bandwidth. Lossy network protocols typically include retransmission mechanisms in which a destination network device detects missing packets and requests a source network device to retransmit them.
An embodiment that is described herein provides a network device including a port and packet processing circuitry. The port is to connect to a network. The packet processing circuitry is to transmit a data packet to the network, and, after transmitting the data packet, transmit to the network an Implicit Loss Indication (ILI) packet that (i) references the data packet and (ii) is provisioned to traverse a same route via the network as the data packet.
2 3 2 3 Typically, the ILI packet is smaller than the data packet. In some embodiments, the packet processing circuitry is to assign the ILI packet a Layer-2 (L) header and a Layer-3 (L) header that match the Lheader and the Lheader of the data packet. In some embodiments, the packet processing circuitry is to assign the ILI packet a Base Transport Header (BTH) that refers to a BTH of the data packet. In some embodiments, the ILI packet references the data packet by indicating a Packet Serial Number (PSN) of the data packet.
In an embodiment, the packet processing circuitry is to transmit to the network at least one additional ILI packet that references the data packet and is provisioned to traverse the same route via the network as the data packet. In an embodiment, the ILI packet references both the data packet and one or more other data packets.
There is additionally provided, in accordance with an embodiment that is described herein, a network device including a port and packet processing circuitry. The port is to connect to a network. The packet processing circuitry is to receive from the network an Implicit Loss Indication (ILI) packet that references a data packet, to check whether the data packet referenced by the ILI packet was received before the ILI packet, and, in response to finding that the data packet referenced by the ILI packet was not received before the ILI packet, to request retransmission of the data packet.
In some embodiments, the packet processing circuitry is to discard the ILI packet in response to finding that the data packet referenced by the ILI packet was received. In some embodiments, the packet processing circuitry is to request the retransmission by sending a negative acknowledgement (NACK). In an embodiment, the ILI packet references both the data packet and one or more other data packets. In an embodiment, the packet processing circuitry is to exclude the ILI packet from at least one authentication check applied to data packets.
There is additionally provided, in accordance with an embodiment that is described herein, a method including transmitting a data packet to the network. After transmitting the data packet, an Implicit Loss Indication (ILI) packet is transmitted to the network. The ILI packet (i) references the data packet and (ii) is provisioned to traverse a same route via the network as the data packet.
There is also provided, in accordance with an embodiment that is described herein, a method including receiving from the network an Implicit Loss Indication (ILI) packet that references a data packet. A check is performed whether the data packet referenced by the ILI packet was received before the ILI packet. In response to finding that the data packet referenced by the ILI packet was not received before the ILI packet, retransmission of the data packet is requested.
The present disclosure will be more fully understood from the following detailed description of the embodiments thereof, taken together with the drawings in which:
Packet retransmission mechanisms are problematic since they incur excessive latency and complexity. The latency incurred by retransmission includes, among others, the time needed for the destination network device to detect that a packet was lost, and the Round-Trip Time (RTT) needed for the destination network device to request retransmission and for the source network device to retransmit the lost packet. Detecting a lost packet is especially slow and difficult when the network does not guarantee in-order delivery of packets from the source network device to the destination network device, as the arrival of packet following a missing packet, does not necessarily imply that the missing packet is lost.
Embodiments that are described herein provide improved techniques that enable a network device to detect loss of packets simply, quickly and reliably.
2 3 In some embodiments, after transmitting a data packet, the source network device transmits an additional packet referred to herein as an Implicit Loss Indication (ILI) packet. The ILI packet (i) references the data packet and (ii) is provisioned to travel the same route via the network as the data packet. For example, the ILI packet may be generated with the same Layer-2 (L) and Layer-3 (L) headers as the data packet, ensuring that network elements will forward the data packet and the ILI packet over the same route. To minimize bandwidth overhead, the ILI packet is typically much smaller than the data packet it references.
Since the ILI packet is transmitted after the data packet and traverses the same route, it will typically arrive after the data packet even if the network does not guarantee in-order packet delivery. Therefore, if the destination network device receives an ILI packet that was not preceded by a corresponding data packet, it can immediately conclude that the data packet has been lost.
Thus, in some embodiments, upon receiving an ILI packet, the destination network device checks whether the data packet referenced by the ILI packet was already received. If so, the destination network device may discard the ILI packet. If the data packet was not received before the ILI packet, the destination network device immediately requests the source network device to retransmit the data packet in question.
The disclosed technique is simple to implement and provides fast and reliable detection of lost packets. Although the transmission of ILI packets incurs some inevitable bandwidth overhead, this penalty is small due to the small size of the ILI packets, and is typically well worth the gain in packet-loss detection performance. The disclosed technique can be used with any lossy network protocol. The embodiments described herein refer mainly to Remote Direct Memory Access (RDMA) networks, in which case the source network device is referred to as a “requestor” and the destination network device is referred to as a “responder”.
1 FIG. 20 20 24 24 28 24 32 24 32 32 32 is a block diagram that schematically illustrates a packet communication systemusing Implicit Loss Indication (ILI) packets, in accordance with an embodiment that is described herein. Systemcomprises a requestor Network Interface Controller (NIC)A and a responder NICB that communicate over a network. NICA serves a hostA, and NICB serves a hostB. HostsA andB may comprise, for example, Central Processing Units (CPUs), Graphics Processing Units (GPUs) or any other suitable computing platform.
28 24 24 24 24 In the present example, networkis an Ethernet network and NICsA andB communicate in accordance with the RDMA protocol. Generally, however, requestor NICA and responder NICB are regarded herein as non-limiting examples of a source network device and a destination network device, respectively. In alternative embodiments, the network devices may comprise, for example, Data Processing Units (DPUs, also referred to as “smart NICs”).
24 24 Typically, NICsA andB are similar or identical in design, and their roles as “requestor” and “responder” apply to a specific RDMA transaction. Each of the NICs may serve as a requestor for some transactions, and as a responder for other transactions.
28 24 24 28 24 24 The disclosed techniques can also be used in various other suitable types of networks. Networkis typically a lossy network, and does not necessarily guarantee in-order delivery of packets between NICsA andB. For example, networkmay employ multipathing techniques in which the packets sent from NICA to NICB are distributed across multiple different routes that may differ in latency.
24 24 36 32 32 40 28 44 36 36 40 Each of NICsA andB comprises a host interfacefor communicating with its respective host (A orB), a network interfacefor communicating over network, and packet processing circuitryfor performing the various processing tasks of the NIC. Host interfacesmay communicate with the hosts using any suitable interface or protocol, e.g., over a peripheral bus such as Peripheral Component Interconnect express (PCIe) or Nvlink. Alternatively, host interfacesmay comprise Chip-to-Chip (C2C) or Die-to-Die (D2D) links such as Ground Reference Signaling (GRS), Low Power Interface (LPI) or Low Latency Interface (LLI). Network interfacesare also referred to as the ports of the respective network devices.
44 48 52 56 48 28 52 28 56 Packet processing circuitrycomprises a transmit (TX) pipeline, a receive (RX) pipeline, and an ILI module. TX pipelinegenerates and processes outbound packets, i.e., packets transmitted to network. RX pipelineprocesses inbound packets, i.e., packets received from network. ILI modulecarries out the processing relating to ILI packets, as described in detail below.
20 24 24 1 FIG. The configuration of systemand NICsA andB, as illustrated in, are example configurations chosen purely for the sake of conceptual clarity. In alternative embodiments, any other suitable configuration can be used.
28 For example, the network device that generates the ILI packet need not necessarily be the network device serving the source host. By the same token, the network device that detects loss of the data packet using the ILI packet, and requests retransmission, need not necessarily be the network device serving the destination host. In other words, the disclosed technique can be used between intermediate network devices, e.g., network switches or routers, within network.
24 24 NICsA andB may be implemented using suitable hardware, such as in one or more Application-Specific Integrated Circuits (ASICs) or Field-Programmable Gate Arrays (FPGAs), using software, using hardware, or using a combination of hardware and software elements. Elements that are not mandatory for understanding of the disclosed techniques have been omitted from the figure for the sake of clarity.
In some embodiments, some NIC functions described herein may be implemented in a general-purpose processor, which is programmed in software to carry out the functions described herein. The software may be downloaded to the processor in electronic form, over a network, for example, or it may, alternatively or additionally, be provided and/or stored on non-transitory tangible media, such as magnetic, optical, or electronic memory.
1 FIG. 60 64 24 28 24 64 60 24 60 Referring again to, an inset at the bottom of the figure illustrates a data packetand a corresponding ILI packet. Both packets are transmitted from requestor NICA via networkto responder NICB. ILI packetis transmitted after (typically immediately after) data packet, to enable responder NICB to detect whether data packetwas lost.
60 64 2 68 3 72 76 80 2 68 3 72 In some embodiments, data packetand ILI packeteach comprises a Layer-2 (L) header, a Layer-3 (L) header, a Base Transport Header (BTH), and a payload. Lheadermay comprise, for example, a Medium Access Control (MAC) header. Lheadermay comprise, for example, an Internet Protocol (IP) header.
64 60 80 64 60 64 In the present example, ILI packetis considerably smaller than data packet, e.g., due to the much smaller size of payload. The small size of ILI packetserves two purposes — (i) reducing the extra bandwidth consumed by the ILI packet, and (ii) reducing the likelihood that the ILI packet itself will be dropped. In an example embodiment, the size of data packetis 4Kbytes, whereas the size of ILI packetis sixty-four bytes. Alternatively, any other suitable packet sizes can be used.
2 68 64 2 68 60 3 72 64 3 72 60 28 60 64 60 64 In some embodiments, Lheaderof ILI packetis identical to Lheaderof data packet; and Lheaderof ILI packetis identical to Lheaderof data packet. This condition ensures that the network switches or routers of networkwill forward data packetand ILI packetover the same route. More generally, any other header field values, which ensure that data packetand ILI packetwill travel the same route, can be used.
64 60 24 60 76 64 60 60 76 64 60 ILI packetreferences data packet. In the present context, the term “references” means that the ILI packet comprises information that enables NICB to determine uniquely the identity of the corresponding data packet. In an example embodiment, data packetcomprises a Packet Serial Number (PSN). The PSN may be specified, for example, in BTHof the data packet. ILI packetmay reference data packetby specifying the PSN of data packet, e.g., as part of BTHof the ILI packet. In alternative embodiments, ILI packetmay reference data packetin any other suitable way.
64 In some embodiments, ILI packethas a unique opcode that identifies it as an ILI packet.
60 56 24 64 24 56 24 64 24 60 24 In some embodiments, data packetcomprises a Cyclic Redundancy Check (CRC) (e.g., an Invariant Cyclic Redundancy Check—ICRC—used in InfiniBand) that is calculated over at least some of the packet for detecting errors. In one embodiment, ILI moduleof requestor NICA recalculates the CRC over at least part of ILI packet, and inserts the recalculated CRC into the ILI packet. In this embodiment, responder NICB may validate the CRC of the receive ILI packet to ensure it is correct. In an alternative embodiment, ILI moduleof requestor NICA does not recalculates the CRC for ILI packet(e.g., requestor NICA may simply reuse the CRC of data packet). In this embodiment, the ILI packet will not have a CRC that matches its content, but this may be tolerable since the ILI packet is not an actual data packet. When using the latter embodiment, responder NICB should refrain from validating the CRCs received ILI packets.
2 FIG. 1 FIG. 48 24 24 90 48 24 24 94 56 48 56 is a flow chart that schematically illustrates a method for packet communication using ILI packets, in accordance with an embodiment that is described herein. The method begins with TX pipelineof requestor NICA sending a data packet to responder NICB, at a data packet transmission stage. Following the data packet, TX pipelineof requestor NICA sends an ILI packet to responder NICB, at an ILI packet transmission stage. In the configuration of, the ILI packet is generated by ILI moduleand provided to TX pipelinefor transmission. As explained above, ILI modulegenerates the ILI packet so as to (i) reference the data packet, and (ii) travel the same route as the data packet.
98 24 102 56 24 56 106 At an ILI packet reception stage, responder NICB receives the ILI packet. (This stage may or may not be preceded by reception of the data packet.) At a checking stage, ILI moduleof responder NICB checks whether the data packet referenced by the ILI packet was already received. If so, ILI modulediscards the ILI packet and the method terminates, at a termination stage.
56 24 110 24 Otherwise, i.e., if ILI moduleof responder NICB finds that the referenced data packet was not received before the ILI packet, ILI module initiates a retransmission request, at a retransmission requesting stage. The retransmission request may have any suitable format that indicates the identity of the lost data packet to requestor NICA.
56 48 24 48 28 32 In an example embodiment, the retransmission request is a NACK packet indicating the PSN of the lost data packet. In an example embodiment, the retransmission request comprises a bitmap that references a block of packets and indicates which packets in the block were lost and need to be retransmitted. This sort of retransmission request is sometimes referred to as “block ACK” or “block NACK”. Alternatively, any other suitable type of retransmission request can be used. In an embodiment, ILI moduleprovides the retransmission request to TX pipelineof responder NICB for transmission. TX pipelinesends the retransmission request to network, addressed to requestor NICA.
2 FIG. The method flow ofis an example flow that is depicted purely for the sake of conceptual clarity. Alternatively, the disclosed techniques can be implemented using any other suitable flow.
44 24 24 For example, in some embodiments, packet processing circuitryof requestor NICA generates two or more ILI packets that reference the same data packet. This feature increases the likelihood that at least one of the ILI packets will reach responder NICB, at the cost of some additional bandwidth and packet generation overhead.
44 24 24 24 As another example, in some embodiments, packet processing circuitryof requestor NICA generates a single ILI packet that references two or more data packets (and transmits the ILI packet after all the referenced data packets). Upon receiving this ILI packet, responder NICB checks which of the referenced data packets were previously received. If any of the referenced data packets did not arrive before the ILI packet, NICB may decide that the data packet was lost and request retransmission.
44 24 44 24 As yet another example, in some embodiments, packet processing circuitryof requestor NICA generates ILI packets selectively, only for certain data packets. For example, when a certain message is conveyed by multiple data packets, ILI packets may be generated only for one or more data packets that carry the end of the message, so as to protect against "tail drops". Generally, packet processing circuitryof requestor NICA may use any other suitable criterion for selecting which data packets to protect using ILI packets.
44 24 32 44 24 In some embodiments, packet processing circuitryof responder NICB performs certain authentication checks on received data packets before processing them and/or before forwarding their data to hostB. In an embodiment, packet processing circuitryof responder NICB excludes ILI packets from one or more of these authentication checks, e.g., an ICRC check. The protocol can be defined to selectively check or refrain from checking the ICRC.
3 FIG. 1000 1000 1000 is a block diagram that schematically illustrates a computing system, e.g., a data center or a High-Performance Computing (HPC) cluster, in accordance with an embodiment that is described herein. Systemcomprises a plurality of subsystems, e.g. multiple processing devices coupled to each other, multiple network devices, and multiple networks, according to at least one embodiment. Computing systemis designed with multiple integrated circuits (referred to as processing devices), where each integrated circuit can include one or more CPUs and GPUs, forming a powerful and flexible architecture.
1000 1030 1036 1000 1048 1028 1030 1050 1032 1036 The various processing devices are interconnected via an NVLink or other high-speed interconnect, enabling high-speed communication between the subsystems, and are also connected through a NIC or DPU to ensure efficient data transfer across computing systemand to one or more external networks,. In the present example, systemcomprises a packet switchthat connects NIC/DPUto network, and a packet switchthat connects NIC/DPUto network.
1000 The coupling of processing devices through NVLink allows for seamless data exchange and parallel processing, enhancing overall computational performance. The processing devices are connected to multiple networks through one or more network interface cards (NICs) or DPUs, enabling the system to handle complex, multi-network tasks with high bandwidth and low latency. This configuration is highly suitable for demanding applications that require significant processing power, such as artificial intelligence (AI), machine learning (ML), and data-intensive computing, while ensuring robust connectivity and scalability across various networked environments. The integrated circuits of the computing systemcan include one or more CPUs and one or more GPUs.
3 FIG. 1000 1002 1002 1006 1008 1010 1006 1008 1012 1006 1010 1014 1006 1008 1010 also demonstrates an example architecture of a multi-GPU architecture. As illustrated in the figure, computing systemincludes a processing devicewith a multi-GPU architecture. In particular, processing devicemay be a system-on-chip and includes multiple subsystems such as a CPU, a GPU, and a GPU. CPUcan be coupled to GPUvia a die-to-die (D2D) or chip-to-chip (C2C) interconnect, such as a Ground-Referenced Signaling interconnect (GRS interconnect). CPUcan be coupled to GPUvia a D2D or C2C interconnect. CPUcan also couple to GPUand GPUvia PCIe interconnects.
1006 1006 1026 1030 1006 1028 1030 1048 1026 1028 1030 3 FIG. CPUcan be coupled to one or more NICs or DPUs, which are coupled to one or more networks. For example, as illustrated in, CPUis coupled to a first NIC/DPU, which is coupled to a network. CPUis also coupled to a second NIC/DPU, which is coupled to networkvia switch. NIC/DPUand NIC/DPUcan be coupled to networkover Ethernet (ETH), NVLINK or InfiniBand (IB) connections, for example.
1000 1004 1004 1016 1018 1020 1016 1018 1022 1016 1020 1024 1016 1018 1020 1016 1016 1032 1036 1016 1034 1036 1050 1032 1034 1036 3 FIG. Computing systemalso includes a processing devicewith a multi-GPU architecture. In particular, processing deviceincludes multiple subsystems including a CPU, a GPU, and a GPU. CPUcan be coupled to GPUvia an D2D or C2C interconnect. CPUcan be coupled to GPUvia a D2D or C2C interconnect. CPUcan also couple to GPUand GPUvia PCIe interconnects. CPUcan be coupled to one or more NICs or DPUs, which are coupled to one or more networks. For example, as illustrated in, CPUis coupled to a first NIC/DPU, which is coupled to a network. CPUis also coupled to a second NIC/DPU, which is coupled to networkvia switch. NIC/DPUand NIC/DPUcan be coupled to networkover Ethernet (ETH), NVLINK or InfiniBand (IB) connections.
1002 1004 1038 1002 1004 1040 In at least one embodiment, processing deviceand processing devicecan communication with each other via a NIC/DPU, such as over PCIe interconnects. Processing deviceand processing devicecan also communicate with each other over a high-bandwidth communication interconnects, such as an NVLink interconnect or other high-speed interconnects.
1000 1026 1028 1032 1034 1038 1048 1050 3 FIG. In various embodiments, any of the network devices of system, e.g., any of NICs/DPUs,,,and, and/or any of switchesand, may use ILI packets in accordance with the techniques described herein. The packet switches inmay comprise, for example, Nvidia Quantum-2 switches. The NICs/DPUs in the figure may comprise, for example, Nvidia Bluefield DPUs.
Although the embodiments described herein mainly address lossy network protocols, the methods and systems described herein can also be used in lossless protocols in which packets may still be dropped, for example due to bit-flipping. Further alternatively, the disclosed techniques can be used in any other suitable application.
It will thus be appreciated that the embodiments described above are cited by way of example, and that the present invention is not limited to what has been particularly shown and described hereinabove. Rather, the scope of the present invention includes both combinations and sub-combinations of the various features described hereinabove, as well as variations and modifications thereof which would occur to persons skilled in the art upon reading the foregoing description and which are not disclosed in the prior art. Documents incorporated by reference in the present patent application are to be considered an integral part of the application except that to the extent any terms are defined in these incorporated documents in a manner that conflicts with the definitions made explicitly or implicitly in the present specification, only the definitions in the present specification should be considered.
Cooperative Patent Classification codes for this invention. Click any code to explore related patents in that topic.
October 15, 2024
April 16, 2026
Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.