Mechanism for Process Migration on a Massively Parallel Computer

PublishedFebruary 5, 2013

Assigneenot available in USPTO data we have

InventorsCharles Jens Archer David L. Darrington Patrick Joseph McCarthy Amanda Peters Albert Sidelnik

Technical Abstract

Patent Claims

20 claims

Legal claims defining the scope of protection, as filed with the USPTO.

1. A method of migrating a process running on first compute node of a parallel computing system having a plurality of compute nodes, comprising: quiescing a data communications network connecting the plurality of compute nodes; while the data communications network is quiesced: identifying a process identifier (ID) associated with the process running on the first compute node; identifying a network address associated with the first compute node; flushing, from a mapping data structure maintained by the parallel system, a first entry mapping the identified process ID to the identified network address; transmitting a message to the plurality of compute nodes to flush a local cache of mappings between process IDs and network addresses; migrating the process running on the first compute node to a second compute node, of the plurality of compute nodes of the parallel computing system, wherein the migrating is performed upon: (i) detecting a network congestion by recording a number of network packets that pass through any of six network ports of the first compute node, and (ii) predicting a hardware failure for the first compute node; and updating the mapping data structure maintained by the parallel system to include a second entry mapping the identified process ID to a network address of the second compute node, wherein the data communication network is a three-dimensional torus and the network address of the first and second compute node is a respective <x, y, z> coordinate position of the first and second compute node within the three-dimensional torus.

2. The method of claim 1 , further comprising, after updating the mapping data structure maintained by the parallel system to include the second entry, restarting the data communication network.

3. The method of claim 1 , wherein the process ID is a Message Passing Interface (MPI) rank.

4. The method the claim 1 , wherein the first compute node is migrated to alleviate network congestion on the data communications network connecting the plurality of compute nodes.

5. The method of claim 1 , further comprising, selecting the second compute node from the plurality of compute nodes in order to alleviate network congestion on the data communications network connecting the plurality of compute nodes.

6. The method of claim 1 , wherein the second compute node is selected to optimize the mapping data structure maintained by the parallel system.

7. A non-transitory computer-readable storage medium containing a program which, when executed, performs an operation of migrating a process running on a first compute node of a parallel computing system having a plurality of compute nodes, the operation comprising: quiescing a data communications network connecting the plurality of compute nodes; while the data communications network is quiesced: identifying a process identifier (ID) associated with the process running on the first compute node; identifying a network address associated with the first compute node; flushing, from a mapping data structure maintained by the parallel system, a first entry mapping the identified process ID to the identified network address; transmitting a message to the plurality of compute nodes to flush a local cache of mappings between process IDs and network addresses; migrating the process running on the first compute node to a second compute node, of the plurality of compute nodes of the parallel computing system, wherein the migrating is performed upon: (i) detecting a network congestion by recording a number of network packets that pass through any of six network ports of the first compute node, and (ii) predicting a hardware failure for the first compute node; and updating the mapping data structure maintained by the parallel system to include a second entry mapping the identified process ID to a network address of the second compute node, wherein the data communication network is a three-dimensional torus and the network address of the first and second compute node is a respective <x, y, z> coordinate position of the first and second compute node within the three-dimensional torus.

8. The computer-readable storage medium of claim 7 , wherein the operations further comprise, after updating the mapping data structure maintained by the parallel system to include the second entry, restarting the data communication network.

9. The computer-readable storage medium of claim 7 , wherein the process ID is a Message Passing Interface (MPI) rank.

10. The computer-readable storage medium the claim 7 , wherein the first compute node is migrated to alleviate network congestion on the data communications network connecting the plurality of compute nodes.

11. The computer-readable storage medium of claim 7 , further comprising, selecting the second compute node from the plurality of compute nodes in order to alleviate network congestion on the data communications network connecting the plurality of compute nodes.

12. The computer-readable storage medium of claim 7 , wherein the second compute node is selected to optimize the mapping data structure maintained by the parallel system.

13. A parallel computing system, comprising: a plurality of compute nodes, each having at least a processor and a memory, wherein the plurality of compute nodes is configured to execute a parallel computing task, and wherein a process executing on each compute node is identified by a respective process identifier (ID); an input/output (I/O) node having a processor and a memory, wherein the I/O node is configured to maintain a mapping data structure that maps the process ID for the process running on a given compute node to a network address of the given compute node; a data communications network connecting the plurality of compute nodes, and connecting the plurality of compute nodes to the I/O node; and a service node having at least a processor and a memory, wherein the memory of the service node includes a program which, when executed by the processor of the service node, migrates the process running on a first compute node of the parallel computing system to a second compute node of the parallel computing system by performing an operation, the operation comprising: quiescing a data communications network connecting the plurality of compute nodes; while the data communications network is quiesced: identifying a process identifier (ID) associated with the process running on the first compute node; identifying a network address associated with the first compute node; flushing, from a mapping data structure maintained by the parallel system, a first entry mapping the identified process ID to the identified network address; transmitting a message to the plurality of compute nodes to flush a local cache of mappings between process IDs and network addresses; migrating the process running on the first compute node to a second compute node, of the plurality of compute nodes of the parallel computing system, wherein the migrating is performed upon: (i) detecting a network congestion by recording a number of network packets that pass through any of six network ports of the first compute node, and (ii) predicting a hardware failure for the first compute node; and updating the mapping data structure maintained by the parallel system to include a second entry mapping the identified process ID to a network address of the second compute node, wherein the data communication network is a three-dimensional torus and the network address of the first and second compute node is a respective <x, y, z> coordinate position of the first and second compute node within the three-dimensional torus.

14. The parallel computing system of claim 13 , wherein the program is further configured to, after updating the mapping data structure maintained by the parallel system to include the second entry, restart the data communication network.

15. The parallel computing system of claim 13 , wherein the process ID is a Message Passing Interface (MPI) rank.

16. The parallel computing system the claim 13 , wherein the first compute node is migrated to alleviate network congestion on the data communications network connecting the plurality of compute nodes.

17. The parallel computing system of claim 13 , wherein the program is further configured to select the second compute node from the plurality of compute nodes in order to alleviate network congestion on the data communications network connecting the plurality of compute nodes.

18. The method of claim 1 , wherein the hardware failure of the first compute node is predicted based on at least one factor comprising: (i) a CPU temperature, (ii) an L3 parity error, and (iii) a torus and tree retransmit.

19. The computer-readable storage medium of claim 7 , the hardware failure of the first compute node is predicted based on at least one factor comprising: (i) a CPU temperature, (ii) an L3 parity error, and (iii) a torus and tree retransmit.

20. The parallel computing system of claim 13 , wherein the hardware failure of the first compute node is predicted based on at least one factor comprising: (i) a CPU temperature, (ii) an L3 parity error, and (iii) a torus and tree retransmit.

Patent Metadata

Filing Date

Unknown

Publication Date

February 5, 2013

Inventors

Charles Jens Archer

David L. Darrington

Patrick Joseph McCarthy

Amanda Peters

Albert Sidelnik

Want to explore more patents?

Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.

Browse All Patents Try Prior Art Search