9916162

Using a Global Barrier to Synchronize Across Local Thread Groups in General Purpose Programming on GPU

PublishedMarch 13, 2018
Assigneenot available in USPTO data we have
InventorsNiraj Gupta
Technical Abstract

Patent Claims
14 claims

Legal claims defining the scope of protection, as filed with the USPTO.

1

1. A system comprising: one or more transceivers; a host processor in communication with the one or more transceivers; a system memory associated with the host processor; a processor, in communication with the system memory, to receive a workload from the host processor, wherein the workload is partitioned into a plurality of kernels each containing a thread group, the processor including: a first plurality of processing elements to receive and process a first thread group having a plurality of threads, wherein each of the plurality of processing elements processes at least one of the plurality of threads, a second plurality of processing elements to receive and process a second thread group having a plurality of threads, wherein each of the plurality of processing elements processes at least one of the plurality of threads, and a global barrier in communication with the first plurality of processing elements and the second plurality of processing elements, the global barrier to enable the workload to be partitioned into the plurality of kernels and to synchronize the processing of the workload across the first thread group and the second thread group, wherein the plurality of kernels are to be processed concurrently and in parallel.

2

2. The system of claim 1 , wherein the first thread group and the second thread group form a global thread group.

3

3. The system of claim 1 , wherein the global barrier determines that all threads across the first thread group and the second thread group have been completed.

4

4. The system of claim 3 , wherein the determination is made without polling.

5

5. The system of claim 1 , wherein the processor is a graphics processor.

6

6. An apparatus comprising: a graphics processor to receive a workload from a host processor, wherein the workload is partitioned into a plurality of kernels each containing a thread group, the graphics processor including: a first plurality of processing elements to receive and process a first thread group having a plurality of threads, wherein each of the plurality of processing elements processes at least one of the plurality of threads, a second plurality of processing elements to receive and process a second thread group having a plurality of threads, wherein each of the plurality of processing elements processes at least one of the plurality of threads, and a global barrier in communication with the first plurality of processing elements and the second plurality of processing elements, the global barrier to enable the workload to be partitioned into the plurality of kernels and to synchronize the processing of the workload across the first thread group and the second thread group, wherein the plurality of kernels are to be processed concurrently and in parallel.

7

7. The apparatus of claim 6 , wherein the first thread group and the second thread group form a global thread group.

8

8. The apparatus of claim 6 , wherein the global barrier determines that all threads across the first thread group and the second thread group have been completed.

9

9. A method comprising: receiving, at a graphics processor, a workload from a host processor, wherein the workload is partitioned into a plurality of kernels each containing a thread group; receiving, at a first plurality of processing elements, a first thread group having plurality of threads, wherein each of the plurality of processing elements processes at least one of the plurality of threads, receiving, at a second plurality of processing elements, a second thread group having plurality of threads, wherein each of the plurality of processing elements processes at least one of the plurality of threads, synchronizing, at a global barrier in communication with the first plurality of processing elements and the second plurality of processing elements, the processing of the workload across the first thread group and the second thread group, wherein the global barrier enables the workload to be partitioned into the plurality of kernels and the plurality of kernels are to be processed concurrently and in parallel.

10

10. The method of claim 9 , wherein the first thread group and the second thread group form a global thread group.

11

11. The method of claim 9 , wherein the global barrier determines that all threads across the first thread group and the second thread group have been completed.

12

12. At least one non-transitory computer readable storage medium comprising a set of instructions which, if executed by a graphics processor, cause a computer to: receive, at a processor, a workload from a host processor, wherein the workload is partitioned into a plurality of kernels each containing a thread group; receive, at a first plurality of processing elements, a first thread group having a plurality of threads, wherein each of the plurality of processing elements processes at least one of the plurality of threads, receive, at a second plurality of processing elements, a second thread group having a plurality of threads, wherein each of the plurality of processing elements processes at least one of the plurality of threads, synchronize, at a global barrier in communication with the first plurality of processing elements and the second plurality of processing elements, the processing of the workload across the first thread group and the second thread group, wherein the global barrier enables the workload to be partitioned into the plurality of kernels and the plurality of kernels are to be processed concurrently and in parallel.

13

13. The at least one non-transitory computer readable storage medium of claim 12 , wherein the first thread group and the second thread group form a global thread group.

14

14. The at least one non-transitory computer readable storage medium of claim 12 , wherein the global barrier determines that all threads across the first thread group and the second thread group have been completed.

Patent Metadata

Filing Date

Unknown

Publication Date

March 13, 2018

Inventors

Niraj Gupta

Want to explore more patents?

Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.

Citation & reuse

Analysis on this page is generated by Patentable — an AI-powered patent intelligence platform. AI-generated summaries, explanations, and analysis may be reused with attribution and a visible link back to the canonical URL below. Patent abstracts and claims are USPTO public domain.

Cite as: Patentable. “USING A GLOBAL BARRIER TO SYNCHRONIZE ACROSS LOCAL THREAD GROUPS IN GENERAL PURPOSE PROGRAMMING ON GPU” (9916162). https://patentable.app/patents/9916162

© 2026 Patentable. All rights reserved.

Patentable is a research and drafting-assistant tool, not a law firm, and does not provide legal advice. Documents we generate are drafts for review by a licensed patent attorney.

USING A GLOBAL BARRIER TO SYNCHRONIZE ACROSS LOCAL THREAD GROUPS IN GENERAL PURPOSE PROGRAMMING ON GPU — Niraj Gupta | Patentable