US-12166829

Artificial intelligence workload migration for planet-scale artificial intelligence infrastructure service

PublishedDecember 10, 2024

Assigneenot available in USPTO data we have

Inventorsnot available in USPTO data we have

Technical Abstract

The disclosure herein describes platform-level migration for deep learning training (DLT) jobs from a checkpointed stated between a source node and a destination node. The checkpointing is performed through capturing GPU state (e.g., device state) and CPU state (e.g., host state). The GPU state includes GPU data (e.g., model parameters, optimizer state, etc.) that is located in the GPU and GPU context (e.g., the default stream in GPU, various handles created by libraries). Restoring the DLT job on the destination node involves resumption of processing of a destination GPU at the same checkpointed state.

Patent Claims

9 claims

Legal claims defining the scope of protection, as filed with the USPTO.

2. The computerized method of claim 1, implementing a barrier so that each node is independently able to be checkpointed.

3. The computerized method of claim 2, wherein implementing the barrier includes executing a meta-all-reduce command.

7. The computerized method of claim 1, wherein the checkpointed state includes GPU data including model parameters and an optimizer state.

8. The computerized method of claim 1, wherein the migration is performed from a plurality of proxy nodes to a destination node.

10. The computerized method of claim 1, wherein the resumption of processing of the DLT job from the checkpointed state on the destination node comprises resuming the processing of the DLT job on a second GPU and a second CPU of the destination node that are different from the GPU and the CPU, respectively, of the plurality of source nodes.

13. The system of claim 12, wherein the instructions executed by the processor cause the processor to implement the barrier by executing a meta-all-reduce command.

17. The system of claim 11, wherein the checkpointed state includes GPU data including model parameters and an optimizer state.

18. The system of claim 11, wherein the migration is performed from a plurality of proxy nodes to a destination node.

19. The system of claim 11, wherein the resumption of processing of the DLT job from the checkpointed state on the destination node comprises resuming the processing of the DLT job on a second GPU and a second CPU of the destination node that are different from the GPU and the CPU, respectively, of the plurality of source nodes.

Classification Codes (CPC)

Cooperative Patent Classification codes for this invention. Click any code to explore related patents in that topic.

G06F H04L G06N G06T

Patent Metadata

Filing Date

June 7, 2023

Publication Date

December 10, 2024

Want to explore more patents?

Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.

Browse All Patents Try Prior Art Search