Apparatus, system and methods are provided for performing speculative data prefetching in a chip multiprocessor (CMP). Data is prefetched by a helper thread that runs on one core of the CMP while a main program runs concurrently on another core of the CMP. Data prefetched by the helper thread is provided to the helper core. For one embodiment, the data prefetched by the helper thread is pushed to the main core. It may or may not be provided to the helper core as well. A push of prefetched data to the main core may occur during a broadcast of the data to all cores of an affinity group. For at least one other embodiment, the data prefetched by a helper thread is provided, upon request from the main core, to the main core from the helper core's local cache.
Legal claims defining the scope of protection, as filed with the USPTO.
1. An apparatus comprising: an integrated circuit including: decode logic adapted to decode a trigger instruction from a main thread instruction stream into a decoded trigger instruction; a first processor core adapted to execute the main thread instruction stream; a first private cache coupled to the first processor core, the first private cache being adapted to be private to the first processor core; a second processor core adapted to execute a helper thread instruction stream in response to the first processor core executing the decoded trigger instruction from the main thread; a second private cache coupled to the second processor core, the second private cache being adapted to be private to the second processor core; and control logic adapted to prefetch data into the first private cache coupled to the first processor core in response to the second processor core executing the helper thread instruction stream.
2. The apparatus of claim 1 , wherein control logic adapted to prefetch data into the first private cache coupled to the first processor core in response to the second processor core executing the helper thread instruction stream comprises: the second processor core being adapted to execute an instruction from the helper thread, the instruction to be included in both the main thread and the helper thread; the second private cache to initiate a fetch of the data from a higher-level memory in response to a miss of the data in the second private cache; and control logic adapted to push the data into the first private cache without a request from the first processor core in response to the fetch of the data.
3. The apparatus of claim 2 , wherein: control logic adapted to push the data into the first private cache in response to the fetch of the data comprises the control logic being adapted to broadcast the data to an affinity group of processor cores including at least the first processor core.
4. The apparatus of claim 2 , wherein: control logic adapted to push the data into the first private cache in response to the fetch of the data comprises the control logic being adapted to unicast the data to the first processor core.
5. The apparatus of claim 1 , wherein: the instruction comprises a delinquent instruction from the main thread.
6. A non-transitory machine-readable medium including program code, which when executed by a machine, causes the machine to perform the operations of: executing, with a first processor core in the machine, a trigger instruction from a main thread instruction stream; caching information for the main thread instruction stream in a first cache that is private to the first processor core; executing a helper thread instruction stream with a second processor colocated with the first processor core on an integrated circuit in response to executing the trigger instruction from the main thread instruction stream; fetching data in response to executing the helper thread instruction stream; and pushing the data into the first private cache without a request from the first processor core in response to fetching the data.
7. The machine readable medium of claim 6 , wherein: pushing the data into the first private cache without a request from the first processor core in response to fetching the data comprises broadcasting the data to an affinity group of processor cores including at least the first processor core.
8. The apparatus of claim 6 , wherein: pushing the data into the first private cache without a request from the first processor core in response to fetching the data comprises unicasting the data to the first processor core.
Cooperative Patent Classification codes for this invention. Click any code to explore related patents in that topic.
October 21, 2010
December 13, 2011
Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.