The disclosure herein describes platform-level migration for deep learning training (DLT) jobs from a checkpointed stated between a source node and a destination node. The checkpointing is performed through capturing GPU state (e.g., device state) and CPU state (e.g., host state). The GPU state includes GPU data (e.g., model parameters, optimizer state, etc.) that is located in the GPU and GPU context (e.g., the default stream in GPU, various handles created by libraries). Restoring the DLT job on the destination node involves resumption of processing of a destination GPU at the same checkpointed state.
Legal claims defining the scope of protection, as filed with the USPTO.
2. The computerized method of claim 1, implementing a barrier so that each node is independently able to be checkpointed.
3. The computerized method of claim 2, wherein implementing the barrier includes executing a meta-all-reduce command.
7. The computerized method of claim 1, wherein the checkpointed state includes GPU data including model parameters and an optimizer state.
8. The computerized method of claim 1, wherein the migration is performed from a plurality of proxy nodes to a destination node.
10. The computerized method of claim 1, wherein the resumption of processing of the DLT job from the checkpointed state on the destination node comprises resuming the processing of the DLT job on a second GPU and a second CPU of the destination node that are different from the GPU and the CPU, respectively, of the plurality of source nodes.
13. The system of claim 12, wherein the instructions executed by the processor cause the processor to implement the barrier by executing a meta-all-reduce command.
17. The system of claim 11, wherein the checkpointed state includes GPU data including model parameters and an optimizer state.
18. The system of claim 11, wherein the migration is performed from a plurality of proxy nodes to a destination node.
19. The system of claim 11, wherein the resumption of processing of the DLT job from the checkpointed state on the destination node comprises resuming the processing of the DLT job on a second GPU and a second CPU of the destination node that are different from the GPU and the CPU, respectively, of the plurality of source nodes.
Cooperative Patent Classification codes for this invention. Click any code to explore related patents in that topic.
June 7, 2023
December 10, 2024
Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.