Systems, apparatuses, and methods for a hardware and software system to automatically decompose a program into multiple parallel threads are described. In some embodiments, the systems and apparatuses execute a method of original code decomposition and/or generated thread execution.
Legal claims defining the scope of protection, as filed with the USPTO.
1. A method comprising: executing original code on a first processor core; placing a second processor core into a detect phase, wherein in the detect phase the second processor core is to detect an indication to switch into a different, cooperative execution mode with the first processor core; during execution of the original code, profiling the original code to generate cooperative code from the original code to be cooperatively executed by the first and second processor cores as a first thread on the first processor core and a second thread on the second processor core; detecting the indication to switch into a different, cooperative execution mode; executing the generated cooperative code in the first and second processor cores, wherein during execution of the generated cooperative code, the first and second threads do not communicate; and halting execution of the generated cooperative code in the first and second processor cores upon a violation and rolling back to a last commit point.
2. The method of claim 1 , further comprising: arming the first processor core to enter into a different execution mode upon hitting the indication to switch.
3. The method of claim 1 , wherein profiling the original code comprises gathering information about loads, stores, and branches for a set amount of instructions.
4. The method of claim 1 , further comprising: halting execution of the generated cooperative code in the first and second processor cores upon a successful completion of the generated cooperative code.
5. The method of claim 1 , wherein executing the generated cooperative code in the first and second processor cores comprises: executing two threads in separation; buffering memory loads and stores using wrapper hardware; checking the buffered memory loads and stores for possible violations; and atomically committing a state to provide forward process while maintaining memory ordering.
6. The method of claim 1 , further comprising: upon halting execution of the generated cooperative code in the first and second processor cores upon a violation, executing the original code in the first processor core, and placing the second processor core into a detect phase, wherein in the detect phase the second processor core is to detect an indication to switch into a different, cooperative execution mode with the first processor core.
7. An apparatus comprising: a first processor core and a second processor core to execute cooperative code upon a detection of an entry point in original code running on the first processor core, wherein the entry point is a beginning point in the original code which corresponds to a part of dynamic execution of the original code and wherein the cooperative code is a threaded version of the original code along with possible entry points comprising a first thread to execute on the first processor core and a second thread to execute on the second processor core, wherein during execution of the generated cooperative code, the first and second threads do not communicate; and a hardware wrapper to: detect a hot region of the original code, wherein a hot region of code is a portion of code which corresponds to the part of dynamic execution of the original code, profile the hot region of code of the original code, in the first processor core, to generate the cooperative code, buffer memory loads and stores executed by the first and second processing cores, check the buffered memory loads and stores for possible violations, and atomically commit a state to provide forward progress while maintaining memory ordering.
8. The apparatus of claim 7 , further comprising: a mid-level cache to merge an execution state of the cooperative code.
9. The apparatus of claim 7 , further comprising: a last level cache.
10. The apparatus of claim 7 , wherein the hardware wrapper is to discard the buffered memory loads and stores upon an abort.
11. The apparatus of claim 10 , wherein the abort is found upon a store or store violation.
12. The apparatus of claim 10 , wherein the abort is found upon a load or store violation.
13. The apparatus of claim 10 , wherein upon the abort the first processing core is rolled back to a last commit point.
14. The apparatus of claim 7 , wherein the first processing core is to execute the original code until the entry point is reached.
15. The apparatus of claim 7 , wherein the first processing core is armed after the hardware wrapper has profiled the original code to enter into a different execution mode upon hitting an indication to switch.
16. The apparatus of claim 7 , wherein the hardware wrapper is to profile the original code by gathering information about loads, stores, and branches for a set amount of instructions.
17. The apparatus of claim 7 , wherein the hardware wrapper is to detect the hot region of the original code by detecting an instruction pointer of the hot region in a hardware table of accessed hot region instruction pointers.
Cooperative Patent Classification codes for this invention. Click any code to explore related patents in that topic.
June 6, 2017
July 28, 2020
Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.