Network Switch with Integrated Gradient Aggregation for Distributed Machine Learning

PublishedFebruary 25, 2025

Assigneenot available in USPTO data we have

InventorsWilliam Brad MATTHEWS Puneet AGARWAL

Technical Abstract

Patent Claims

20 claims

Legal claims defining the scope of protection, as filed with the USPTO.

1. A network switching apparatus, comprising: a plurality of communication interfaces configured to connect to specific computing devices in a network, including compute devices of a distributed learning system; packet-switching logic configured to receive data units via the communication interfaces; machine learning logic configured to: determine, in the data units, particular data units containing data from a machine learning model being trained against a training data set; based on information contained in the particular data units: identify one or more actions to be performed on the data in the particular data units; perform the one or more actions on the data in the particular data units; aggregate results from the one or more actions; and return the aggregated results to one or more of the compute devices.

2. The apparatus of claim 1, wherein the one or more actions include one or more of: a reduction operation, a broadcast operation, a scatter operation, a collective action, or a gather operation.

3. The apparatus of claim 1, wherein the one or more actions include one or more sub-actions.

4. The apparatus of claim 1, wherein the machine learning logic includes a machine learning subsystem, the machine learning subsystem including: a data buffer configured to store data from the particular data units; one or more processing queues whose nodes point to specific data containers in the data buffer; a compute controller configured to coordinate the processing of the data from the data buffer using a compute engine, based on the one or more processing queues; the compute engine configured to process the data from the data buffer using the one or more actions to generate the aggregated results.

5. The apparatus of claim 4, wherein the data buffer is shared between the packet-switching logic and the machine learning logic, the data units excluding the particular data units being buffered in the data buffer while awaiting processing by one or more packet processors.

6. The apparatus of claim 1, wherein the machine learning logic is embedded in traffic management logic of the network switching apparatus.

7. The apparatus of claim 1, wherein the machine learning logic is embedded in or coupled to an ingress packet processor of the network switching apparatus.

8. The apparatus of claim 1, wherein the data units are TCP/IP packets, cells, or frames, and the switch device is a level 2 switch, wherein each of the communication interfaces include an Ethernet port.

9. The apparatus of claim 1, wherein the particular data units include identifiers indicating epochs with which the data therein is associated, wherein the machine learning logic is configured to automatically aggregate results associated with particular epochs based on their identifiers.

10. The apparatus of claim 1, wherein each of the aggregated results is associated with a different epoch and is a sum or average of all data associated with that epoch.

11. The apparatus of claim 1, further comprising: a compute memory configured to aggregate data elements as the data elements are being written to the memory, the compute memory including a compute block configured to, when writing a value of a data element to a particular address, aggregate a running total previously stored at the particular address for the data element with the value of the data element, and to write a result of the aggregating over the running total at the particular address; wherein the aggregate results comprises writing each data element of the data to an address associated with the data element in the compute memory as the particular data units are received, wherein the aggregated results comprise each running total of each data element in the data.

12. The apparatus of claim 1, wherein the aggregate results comprises, for each of a plurality of data sets within the data, sending particular data units having a same data set identifier from a data buffer to a compute engine, the compute engine configured to perform one or more reduction operations between each of the particular data units to produce an aggregated data unit, wherein returning the aggregated results comprises forwarding each aggregated data unit for each of the plurality of data sets to the packet-switching logic, with destination data identifying each of the one or more of the compute devices.

13. The apparatus of claim 12, wherein, for a given set of the plurality of data sets, the compute engine is configured to aggregate different subsets of the particular data units with an intermediate result over a number of clock cycles, each subset comprising a data unit stored in a different buffer memory bank.

14. A method, comprising: receiving, at a switch device, data units via communication interfaces coupled to a plurality of compute devices, one or more of the data units containing data from a machine learning model being trained against a training data set; determining, in the data units, particular data units containing data from a machine learning model being trained against a training data set; based on information contained in the particular data units: identifying one or more actions to be performed on the data in the particular data units; performing the one or more actions on the data in the particular data units; aggregating results from the one or more actions; and returning the aggregated results to one or more of the compute devices.

15. The method of claim 14, wherein the one or more actions include one or more of: a reduction operation, a broadcast operation, a scatter operation, a collective action, or a gather operation.

16. The method of claim 14, wherein the one or more actions include one or more sub-actions.

17. The method of claim 14, wherein the data units are TCP/IP packets, cells, or frames, and the switch device is a level 2 switch, wherein each of the communication interfaces include an Ethernet port.

18. The method of claim 14, wherein the particular data units include identifiers indicating epochs with which the data therein is associated, wherein the machine learning logic is configured to automatically aggregate results associated with particular epochs based on their identifiers.

19. The method of claim 14, wherein each of the aggregated results is associated with a different epoch and is a sum or average of all data associated with that epoch.

20. One or more non-transitory computer-readable media, storing one or more sequences of instructions, wherein execution of the one or more sequences of instructions by one or more processors causes the one or more processors to perform: receiving, at a switch device, data units via communication interfaces coupled to a plurality of compute devices, one or more of the data units containing data from a machine learning model being trained against a training data set; determining, in the data units, particular data units containing data from a machine learning model being trained against a training data set; based on information contained in the particular data units: identifying one or more actions to be performed on the data in the particular data units; performing the one or more actions on the data in the particular data units; aggregating results from the one or more actions; and returning the aggregated results to one or more of the compute devices.

Patent Metadata

Filing Date

Unknown

Publication Date

February 25, 2025

Inventors

William Brad MATTHEWS

Puneet AGARWAL

Want to explore more patents?

Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.

Browse All Patents Try Prior Art Search