One embodiment of the present invention enables threads executing on a processor to locally generate and execute work within that processor by way of work queues and command blocks. A device driver, as an initialization procedure for establishing memory objects that enable the threads to locally generate and execute work, generates a work queue, and sets a GP_GET pointer of the work queue to the first entry in the work queue. The device driver also, during the initialization procedure, sets a GP_PUT pointer of the work queue to the last free entry included in the work queue, thereby establishing a range of entries in the work queue into which new work generated by the threads can be loaded and subsequently executed by the processor. The threads then populate command blocks with generated work and point entries in the work queue to the command blocks to effect processor execution of the work stored in the command blocks.
Legal claims defining the scope of protection, as filed with the USPTO.
1. A method for generating work within a parallel processing subsystem, the method comprising: generating a first command block that includes one or more entries; generating instructions to be executed by the parallel processing subsystem; loading the instructions into N command blocks; generating one additional command block; loading a Semaphore Acquire command into the single entry of the additional command block; determining, via atomic increment of a first pointer and comparison to a second pointer, that at least N+1 entries in the plurality of entries of a work queue are free and that an addition of the first pointer and N+1 does not exceed the second pointer; pointing N entries in the plurality of entries of the work queue, starting at the result of the atomic increment, to the first N command blocks; and pointing an Nth+1 entry in the plurality of entries of a work queue to the additional command block.
2. The method of claim 1 , further comprising: determining that no additional work remains to be generated; and releasing a semaphore.
3. The method of claim 2 , wherein releasing the semaphore comprises writing a value in an area of a memory of the parallel processing subsystem to which the semaphore is directed.
4. The method of claim 1 , wherein the first pointer and the second pointer shadow Peripheral Component Interconnect Express (PCI-E)-based pointers that are inaccessible and control the manner in which the parallel processing subsystem executes work included in the work queue.
5. The method of claim 2 , further comprising: determining that all other threads executing in the parallel processing subsystem have completed generating work within the parallel processing subsystem; and returning execution control back to a central processing unit (CPU) that is in communication with the parallel processing subsystem.
6. The method of claim 1 , wherein determining that at least N+1 entries in the plurality of entries are free comprises comparing the first pointer against the second pointer.
7. The method of claim 1 , wherein the parallel processing subsystem is a graphics processing unit (GPU).
8. The computer-implemented method of claim 1 , wherein each of the first pointer and the second pointer is accessible to threads executing on the parallel processing subsystem, and each of a third pointer and a fourth pointer is not accessible to threads executing on the parallel processing subsystem.
9. The computer-implemented method of claim 8 , wherein the third pointer comprises an index of a first available entry in the plurality of entries, and the fourth pointer comprises an index of a last entry of the plurality of entries.
10. The computer-implemented method of claim 1 , further comprising: determining, via atomic increment of the first pointer and comparison to the second pointer, that at least N+1 entries in the plurality of entries of the work queue are not free; and in response to determining that the at least N+1 entries are not free, inserting into the work queue an entry pointing to a command block that includes a Wait for Idle command, a Semaphore Release command, and the Semaphore Acquire command.
11. A non-transitory computer-readable storage medium storing instructions that, when executed by a processor, cause the processor to generate work within a parallel processing subsystem, by performing the steps of: generating a first command block that includes one or more entries; generating instructions to be executed by the parallel processing subsystem; loading the instructions into N command blocks; generating one additional command block; loading a Semaphore Acquire command into the single entry of the additional command block; determining, via atomic increment of a first pointer and comparison to a second pointer, that at least N+1 entries in the plurality of entries of a work queue are free and that an addition of the first pointer and N+1 does not exceed the second pointer; pointing N entries in the plurality of entries of the work queue, starting at the result of the atomic increment, to the first N command blocks; and pointing an Nth+1 entry in the plurality of entries of a work queue to the additional command block.
12. The non-transitory computer-readable storage medium of claim 11 , further comprising: determining that no additional work remains to be generated; and releasing the semaphore.
13. The non-transitory computer-readable storage medium of claim 12 , wherein releasing the semaphore comprises writing a value in an area of a memory of the parallel processing subsystem to which the semaphore is directed.
14. The non-transitory computer-readable storage medium of claim 11 , wherein the first pointer and the second pointer shadow Peripheral Component Interconnect Express (PCI-E)-based pointers that are inaccessible and control the manner in which the parallel processing subsystem executes work included in the work queue.
15. The non-transitory computer-readable storage medium of claim 12 , further comprising: determining that all other threads executing in the parallel processing subsystem have completed generating work within the parallel processing subsystem; and returning execution control back to a central processing unit (CPU) that is in communication with the parallel processing subsystem.
16. The non-transitory computer-readable storage medium of claim 11 , wherein determining that at least two entries in the plurality entries are free comprises comparing the first pointer against the second pointer.
17. The non-transitory computer-readable storage medium of claim 11 , wherein the parallel processing subsystem is a graphics processing unit (GPU).
18. A computing device, comprising: a parallel processor configured to launch one or more threads, wherein each thread is configured to: generate a first command block that includes one or more entries; generate instructions to be executed by the parallel processing subsystem; load the instructions into N command blocks; generate one additional command block; load a Semaphore Acquire command into the single entry of the additional command block; determine, via atomic increment of a first pointer and comparison to a second pointer, that at least N+1 entries in the plurality of entries of a work queue are free and that an addition of the first pointer and N+1 does not exceed the second pointer; point N entries in the plurality of entries of the work queue, starting at the result of the atomic increment, to the first N command blocks; and point an Nth+1 entry in the plurality of entries of a work queue to the additional command block.
19. The computing device of claim 18 , wherein the thread is further configured to: determine that no additional work remains to be generated; and release the semaphore.
20. The computing device of claim 19 , wherein releasing the semaphore comprises writing a value in an area of a memory of the parallel processor to which the semaphore is directed.
21. The computing device of claim 18 , wherein the first pointer and the second pointer shadow Peripheral Component Interconnect Express (PCI-E)-based pointers that are inaccessible and control the manner in which the parallel processor executes work included in the work queue.
22. The computing device of claim 19 , further comprising: determining that all other threads executing in the parallel processor have completed generating work within the parallel processor; and returning execution control back to a central processing unit (CPU) that is in communication with the parallel processing subsystem.
23. The computing device of claim 18 , wherein determining that at least two entries in the plurality entries are free comprises comparing the first pointer against the second pointer.
Cooperative Patent Classification codes for this invention. Click any code to explore related patents in that topic.
October 26, 2012
November 8, 2016
Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.