A checkpoint of a parallel program is taken in order to provide a consistent state of the program in the event the program is to be restarted. Each process of the parallel program is responsible for taking its own checkpoint, however, the timing of when the checkpoint is to be taken by each process is the responsibility of a coordinating process. During the checkpointing, various data is written to a checkpoint file. This data includes, for instance, in-transit message data, a data section, file offsets, signal state, executable information, stack contents and register contents. The checkpoint file can be stored either in local or global storage. When it is stored in global storage, migration of the program is facilitated. When a parallel program is to be restarted, each process of the program initiates its own restart. The restart logic restores the process to the state at which the checkpoint was taken.
Legal claims defining the scope of protection, as filed with the USPTO.
1. A method of checkpointing parallel programs, said method comprising: taking a checkpoint of a parallel program, said parallel program comprising a plurality of processes, and wherein said taking a checkpoint comprises: writing, by a process of said plurality of processes, message data to a checkpoint file corresponding to said process, said message data including an indication that there are no messages, or including one or more in-transit messages between said process writing the message data and one or more other processes of said plurality of processes.
2. The method of claim 1 , wherein said taking a checkpoint further includes writing, by a process of said plurality of processes, at least one of a data section, signal state and one or more file offsets to a checkpoint file corresponding to said process writing said at least one of said data section, said signal state and said one or more file offsets.
3. The method of claim 1 , wherein said taking a checkpoint further includes writing, by a process of said plurality of processes, at least one of executable information, stack contents and register contents to a checkpoint file corresponding to said process writing said at least one of said executable information, said stack contents and said register contents.
4. The method of claim 1 , wherein said writing of said message data to said checkpoint file is performed without logging said message data to a log file.
5. The method of claim 1 , wherein said checkpoint file is stored in local storage accessible by said process.
6. The method of claim 1 , wherein said checkpoint file is stored in global storage accessible by said plurality of processes of said parallel program.
7. The method of claim 1 , further comprising restoring said process that wrote said message data to said checkpoint file, wherein said restoring comprises copying said message data from said checkpoint file to memory of a computing unit executing said process.
8. The method of claim 7 , wherein said computing unit executing said process is a different computing unit from when said checkpoint was taken by said process.
9. The method of claim 1 , wherein said taking a checkpoint further comprises taking a checkpoint by a number of processes of said plurality of processes, wherein said taking a checkpoint by said number of processes comprises writing data to a number of checkpoint files, wherein each process of said number of processes takes a corresponding checkpoint.
10. The method of claim 9 , further comprising coordinating the taking of said corresponding checkpoints by said number of processes.
11. The method of claim 10 , wherein said coordinating comprises: sending a ready message from each process of said number of processes to a coordinating task indicating readiness to take said corresponding checkpoint; and providing, by said coordinating task to said each process, a message indicating that said corresponding checkpoint is to be taken, said providing occurring after receipt of said ready message from said each process.
12. The method of claim 11 , wherein said coordinating further comprises: sending a done message from said each process to said coordinating task indicating completion of said corresponding checkpoint; and forwarding, by said coordinating task to said each process, a commit message indicating that said corresponding checkpoint is to be committed, said forwarding occurring after receipt of said done message from said each process.
13. The method of claim 12 , further comprising: committing, by each process of said number of processes, to said corresponding checkpoint; and deleting, by each process of said number of processes, any previous corresponding checkpoint information, after committing to said corresponding checkpoint.
14. A method of checkpointing parallel programs, said method comprising: taking a checkpoint by a process of a parallel program, said taking a checkpoint comprising: writing to a data section of said process at least one of a signal state and one or more file offsets; subsequently, writing said data section to a checkpoint file corresponding to said process; writing message data to said checkpoint file, said message data including an indication that there are no messages, or including one or more in-transit messages between said process and one or more other processes of said parallel program; and writing at least one of executable information, stack contents and register contents to said checkpoint file.
15. The method of claim 14 , wherein said taking a checkpoint further comprises at least one of stopping message traffic of said process and blocking signals of said process, prior to writing to said data section.
16. The method of claim 15 , wherein said parallel program has a plurality of processes, and wherein said taking a checkpoint is performed by each process of said plurality of processes.
17. The method of claim 16 , further comprising restoring said parallel program, said restoring using the checkpoints taken by said plurality of processes.
18. A method of restoring parallel programs, said method comprising: restarting one or more processes of a parallel program on one or more computing units, wherein at least one process of said one or more processes is restarted on a different computing unit from the computing unit that was previously used to take at least one checkpoint for said at least one process; and copying data stored in one or more checkpoint files corresponding to said one or more restarted processes into memory of said one or more computing units executing said one or more restarted processes, wherein said data restores said one or more restarted processes to an earlier state.
19. The method of claim 18 , wherein said one or more checkpoint files are stored in global storage accessible by said one or more computing units.
20. A method of checkpointing parallel programs, said method comprising: indicating, by a process of a parallel program, that said process is ready to take a checkpoint; receiving, by said process, an indication to take said checkpoint; taking said checkpoint, wherein said taking said checkpoint comprises having said process copy data from memory associated with said process to a checkpoint file corresponding to said process; and indicating, by said process, completion of said taking of said checkpoint.
Cooperative Patent Classification codes for this invention. Click any code to explore related patents in that topic.
October 29, 1998
May 21, 2002
Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.