A method for scheduling tasks from a program executed by a multi-processor core system is disclosed. The method includes a scheduler that groups a plurality of tasks, each having an assigned priority, by priority in a task group. The task group is assembled with other task groups having identical priorities in a task group queue. A hierarchy of task group queues is established based on priority levels of the assigned tasks. Task groups are assigned to one of a plurality of worker threads based on the hierarchy of task group queues. Each of the worker threads is associated with a processor in the multi-processor system. The tasks of the task groups are executed via the worker threads according to the order in the hierarchy.
Legal claims defining the scope of protection, as filed with the USPTO.
1. A multi-thread system, comprising:
2. The multi-thread system of, wherein the atomic exchange function is an atomic increment function.
3. The multi-thread system of, wherein the atomic exchange function is an atomic add function.
4. The multi-thread system of, wherein second processor core is configured to, based at least upon the determination that the first task group is completed, send a task group status signal to the first processor core, the task group status indicating that the first task group is completed.
5. The multi-thread system of, wherein the first processor core is configured, based at least upon the task group status signal, to schedule a second plurality of tasks of a second task group to be concurrently executed on a second plurality of worker threads, the second plurality of worker threads comprising one or more worker threads of the first plurality of worker threads.
Complete technical specification and implementation details from the patent document.
The present application is a continuation of U.S. patent application Ser. No. 17/569,275, filed on Jan. 5, 2022, now allowed, which is a continuation of U.S. patent application Ser. No. 15/011,127, filed on Jan. 29, 2016, now U.S. Pat. No. 11,249,807, which is a continuation of U.S. patent application Ser. No. 14/077,899, filed on Nov. 12, 2013, now U.S. Pat. No. 9,250,953, issued on Feb. 2, 2016, each of which is hereby incorporated by reference herein in its entirety.
A portion of the disclosure of this patent document contains material that is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent disclosure, as it appears in the Patent and Trademark Office patent files or records, but otherwise reserves all copyright rights whatsoever.
The present invention relates generally to scheduling tasks in a multi-thread system, and more particularly, to a task scheduler that orders program tasks in task groups and task group queues for execution by worker threads in a multi-core system.
Current processing systems have multiple processing cores to provide parallel processing of computational tasks, which increases the speed of completing such tasks. For example specialized processing chips such as graphic processing units (GPU) have been employed to perform complex operations such as rendering graphics. A GPU is understood as a specialized processing circuit designed to rapidly manipulate and alter memory to accelerate the creation of images in a frame buffer intended for output to a display. GPUs may include hundreds if not thousands of processing cores since graphic processing may be massively parallelized to speed rendering of graphics in real-time. GPUs perform various graphic processing functions by performing calculations related to 3D graphics. These include accelerating memory-intensive work such as texture mapping and rendering polygons, performing geometric calculations such as the rotation and translation of vertices into different coordinate systems. GPUs may also support programmable shaders, which can manipulate vertices and textures, oversampling and interpolation techniques to reduce aliasing, and very high-precision color spaces.
In multi-core systems, it is desirable to perform multi-threading in order to accomplish parallel processing of programs. Multi-threading is a widespread programming and execution model that allows multiple software threads to exist within the context of a single process. These software threads share the resources of the multi-core system, but are able to execute independently. Multi-threading can also be applied to a single process to enable parallel execution on a multi-core system. This advantage of a multi-threaded program allows it to operate faster on computer systems that have multiple CPUs, CPUs with multiple cores, or across a cluster of machines because the threads of the program naturally lend themselves to concurrent execution.
A task scheduler is a program or a module of a program that is responsible for accepting, ordering, and scheduling portions of the program to be executed on one or more threads that are executed by the cores in a multi-core system. These portions of a program are typically referred to as tasks. In any multi-thread capable system, scheduling and executing tasks requires synchronization. This synchronization introduces a serial point that effectively renders a multi-thread system singular and the subsequent effect on performance is explained with Amdahl's law. Amdahl's law states that if P is the proportion of a program that can be made parallel (i.e., benefit from parallelization), and (1−P) is the proportion that cannot be parallelized (remains serial), then the maximum speedup that can be achieved by using N processors is S(N)=1/(1−P)+P/N.
Presently, there are three synchronization mechanisms employed by computer programs to address ordering or serialization issues that arise when using multiple threads to parallelize program execution. The least expensive is atomic instructions or operations, which is the least costly in regard for the number of CPU cycles required to synchronize an operation. The second and next expensive mechanism is typically referred to as “lockless,” in which one or more atomic instructions are used to synchronize data and program operation. The third are mutual exclusion (Mutexes), critical sections, and locks. These mechanisms are typically used to guard a region of a program from multiple simultaneous access from multiple threads. Not only are these mechanisms the most expensive, they tend to suffer an additional issue in which if a user or thread is pre-empted in its execution while it owns the lock, it can serialize a program for a significant amount of time.
In addition to the cost of the serialization mechanism another factor must also be considered, namely, simultaneous accesses to that specific mechanism. This is typically referred to as “contention” and is directly related to the number of users, tasks, or threads attempting to synchronize the same portion of a program. Contention issues reduce the speed of execution because cores must wait for the completion of other tasks by other cores.
Therefore, to maximize the potential of a multi-thread system to run a program in parallel, the serial tasks managed by a task scheduler must be minimized. In smaller scale multi-thread systems, concurrent execution is relatively simple. For example, a program with 500-1000 tasks on four worker threads (e.g., one thread for graphics, one thread for artificial intelligence, etc.) will not encounter serious contention issues. However, as the number of tasks increases from more complex issues and the number of cores increases (e.g., 20,000 tasks on eight cores or more with hyper-threading), contention is a major issue in maximizing the parallel execution of the program.
The number of CPU cycles required to be executed during the synchronization is also a consideration. In the case of atomic operations, the CPU can only serialize a small amount of data (typically 4 to 8 bytes) in which the cost may only be the number of CPU cycles require to execute the instruction in addition to the number of cycles required to propagate the data change. However in the case of Mutexes and critical sections, not only is the atomic penalty incurred (since they are implemented using atomics), but in addition they are commonly used to perform much more complex work that cannot be expressed with a singular instruction. This additional complexity of work will incur many more CPU cycles, which in turn will increase the cost of the synchronization.
In this way, the overall cost of synchronization or the amount of serial execution can be described or computed as “TotalCost=Synchronization Mechanism Cost*CPU Cycles*Amount of Contention.” To reduce serialization to a minimum it is therefore required to consider and reduce the total cost of synchronization.
Thus, there is a need for a task scheduler that minimizes the amount of serial execution of program tasks in assigning threads to cores for parallel execution in a multi-core system. There is also a need for a task scheduler that organizes tasks in task groups and task group queues, which are in turn organized in a hierarchy for assignment to worker threads. There is a further need for a task scheduler that efficiently uses workers to perform tasks in parallel while minimizing locks. There is also a need for a task scheduler that minimizes the amount of contention a multi-core system incurs when multiple worker threads are attempting to acquire the same lock.
According to one example, a task scheduler for scheduling a plurality of tasks of a program to be executed on one or more worker threads is disclosed. The task scheduler includes a task group component that creates task groups by assigning each of the plurality of tasks to a task group. A task group queue component organizes the task groups according to a predetermined criterion in a task group queue and creates a hierarchy of task group queues. A worker thread pool includes a group of worker threads each associated with one of a plurality of processor cores. A scheduler logic component assigns the worker threads in the worker thread pool to execute the task group queues according to the hierarchy of task group queues.
Another example is a method for scheduling tasks in a multi-core system. A plurality of tasks, each having an assigned priority, is grouped by priority in a task group. The task group is assembled with other task groups having identical priorities in a task group queue. A hierarchy of task group queues is established based on priority levels of the assigned tasks. Task groups are assigned to one of a plurality of worker threads based on the hierarchy of task group queues. Each of the worker threads is associated with a processor core in the multi-core system. The tasks of the task groups are executed via the worker threads according to the order in the hierarchy.
Another example is a non-transitory, machine readable medium having stored thereon instructions for scheduling tasks for execution by a plurality of processor cores. The stored instructions comprise machine executable code, which, when executed by at least one machine processor, causes the machine processor to group a plurality of tasks, each having an assigned priority, by priority in a task group. The instructions cause the machine processor to assemble the task group with other task groups having identical priorities in a task group queue. The instructions cause the machine processor to establish a hierarchy of task group queues based on priority levels of the assigned tasks. Task groups are assigned to one of a plurality of worker threads based on the hierarchy of task group queues. Each of the worker threads is associated with a processor core of the plurality of processor cores. The instructions cause the machine processor to execute the tasks of the task groups via the worker threads according to the order in the hierarchy.
Additional aspects of the invention will be apparent to those of ordinary skill in the art in view of the detailed description of various embodiments, which is made with reference to the drawings, a brief description of which is provided below.
While the invention is susceptible to various modifications and alternative forms, specific embodiments have been shown by way of example in the drawings and will be described in detail herein. It should be understood, however, that the invention is not intended to be limited to the particular forms disclosed. Rather, the invention is to cover all modifications, equivalents, and alternatives falling within the spirit and scope of the invention as defined by the appended claims.
shows a multi-core processing systemthat includes a multi-core processor, an active worker thread pool, an inactive worker thread pool, and a task scheduler. The multi-core processormay be any device that includes multiple processing cores such as a multi-core CPU, GPU, and APU.
The multi-core processorexecutes a program by distributing the tasks in the program among worker threads from the worker thread pool. The programs executed by the multi-core processorare segmented into tasks with different priorities for each task that are assigned by a programmer. The task schedulerreduces serialization of a program and minimizes total processing cost, when scheduling and executing tasks in a program executed by the multi-core processoras will be explained below. The task schedulerprovides a hierarchy of task group queues where task groups are organized according to a predetermined criterion such as priority. The task groups are a collection of tasks from the program and are grouped into task groups. The task group queues allow for ordered access to task groups that have been submitted for execution. The task schedulermanages worker threads in the worker thread poolthat are scheduled to execute tasks. The task scheduleris responsible for the state of the worker threads and the logic associated with assignment of the worker threads to task queues, task groups, and tasks.
In this example, the multi-core processorincludes processing cores,,, and. The processing cores,,, andin this example each are hyper-threaded and therefore each of the cores,,, andmay include multiple hardware threads. This allows each of the cores,,, andto run multiple software threads simultaneously. In this example, two hardware threads such as hardware threadsandare assigned to each of the processing cores,,, and. It is to be understood that processing cores of more than two threads may be used or processing cores with a single thread. It is also to be understood that the multi-core systems may include many more cores than the four cores,,, andshown in.
As explained above, the task schedulerschedules a plurality of tasks of a program to be executed on one or more worker threads that are each associated with a core of the multi-core processor. The task schedulerincludes a task group component that creates task groups by assigning each of the plurality of tasks to a task group. The task scheduler also includes a task group queue component that organizes the task groups according to a predetermined criterion in a task group queue and creates a hierarchy of task group queues. The task schedulerincludes scheduler logicthat organizes tasks that are assigned to each logical worker thread in the worker thread pooland the corresponding hardware thread in a processing core of the multi-core processor. As will be explained below, the task schedulerorders the tasks in a hierarchy of task group queues for execution by the cores of the multi-core processor. In this example, the active worker poolincludes logical worker threads,,,,,,, and. Each of the logical worker threads,,,,,,, andare assigned to one of the hardware threads of the processing cores,,, and. As will be explained below, the scheduler logicassigns tasks to each of the worker threads,,,,,,, and, which in combination, may execute the tasks of a program in parallel on the corresponding processor cores,,, and. The inactive worker poolincludes worker threads that do not have a corresponding hardware thread assigned from the hardware cores and are therefore inactive. When an active worker thread is finished or idle, worker threads from the inactive worker pool may be activated and assigned to a hardware thread. In this case, the active worker thread associated with the hardware thread may be deactivated and the now activated worker thread would be assigned to the inactive worker pool.
In this example, the task scheduler softwareis run on one of the processor cores,,, orto manage the execution of a program by the multi-core processor. However, the task schedulerand corresponding hierarchy of tasks may run on a separate processor such as a CPU or an ASIC. The task scheduler instruction set may also be transferred from one core to another of the multi-core processor. The scheduler logicof the task schedulertypically is employed by the worker threads to determine tasks to execute. Once worker threads in the worker poolcomplete an assigned task, the worker thread will execute the scheduler logicto determine the next task to be executed as will be explained in detail below.
shows a flow diagram of the process or ordering tasks in task groups and ordering task groups in task group queues by the task scheduler. An example task scheduling processis performed by the task scheduler. The task scheduling processincludes a series of task groups, which are organized into different task group queuesby a predetermined criteria such as priority level. In this example, the task group queuesare organized into a hierarchical task group queue, which organizes the task group queuesinto different priority task group queues,, and. In this example the priority queueis the highest priority task group queue and the priority queueis the lowest priority task group queue. The task group queuesare all grouped into the priority queues,, oraccording to priority of a task group. Although there are three levels of priority in this example, it is to be understood that any number of priority levels may be used to organize the task group queues. The task scheduler process accesses a worker pool, which includes the available worker threads from the available worker poolin. The task scheduleralso includes the scheduler logic, which is used to organize the order of the tasks for performance by worker threads in the worker pool.
The hierarchyof task group queues organizes task group queues such as the task group queueaccording to priority. The use of task group queuesallows for ordered access to task groups that have been submitted by the user for execution. The task groups, such as the task group, each are a collection of tasks to be executed. The software thread pool or worker thread poolis a group of software threads that exist and may be scheduled to execute tasks. The task scheduleris responsible for managing the state of the software or worker threads and the scheduler logicis associated with assignment of worker threads to the task queues, task groups, and tasks within the task groups.
These components are utilized together to generate a framework in which a hierarchy of synchronization may be expressed in which serialization is minimized and ordering of work can be maintained for a multi-core system such as the systemin. This hierarchy allows utilization of a layered approach to synchronization in which mechanisms with a much lower cost and complexity may be applied to multi-core systems such as the multi-core processorinto allow parallel processing. For example, atomic add (Atomic_Add), atomic increment (Atomic_Increment), atomic exchange add (Atomic_Exchange), and atomic compare and exchange (Atomic_CompareAndExchange) instructions may all be used to coordinate the acquiring of a task without requiring “locking.” These mechanisms may be considered lockless synchronization primitives. An example for a linked list of tasks within a task group is the following:
In addition, even if a “lock” technique was used, acquiring a task within a specific group is the only critical consideration and since the scope of work is much smaller, the lock needs to be held for fewer processor execution cycles, which results in less opportunity for contention.
The task groupis the lowest level component used within the scheduling hierarchy produced by the processand is designed to allow the user to associate a set of tasks from 1−n from the program with the task group, which may then be executed by the worker threads. Users may also specify the maximum number of worker threads allowed to execute tasks within the task group. By altering both the number of tasks in the task group, as well as the maximum number of worker threads, the user is able to configure a task group which will have minimal contention.
For example, the expected execution time for a simple task, “AddValueToField( )” may be very small such as 50 cycles. The performance of the task by a worker thread through an “AcquireTask( )” command, even if lockless, may be 100 cycles. Therefore the likelihood of contention being an issue is super linear with regards to the number of worker threads in the task group, as the worker threads in this group will spend more time acquiring tasks then executing them. So in this case, instead of having eight worker threads in one task group, it will be more efficient to have one worker thread assigned to eight task groups. The inverse is also true, in that if the task “AddValueToField( )” takes several thousand cycles, then the likelihood of contention drops dramatically and it will be more efficient to have eight worker threads in one task group.
In addition, each worker thread has two unique IDs that may be used by the tasks. The first ID is an Application ID and is guaranteed unique for each worker thread in the application. The second is the TaskGroupID and is guaranteed unique to each worker within a specific task group. In this way the user can use either the TaskGroupID, or the ApplicationID as a key to separating the workload of the task groups. Also the task group itself can be referenced to add more context to the task, which allows for even finer grained sequence control.
Contention refers to multiple threads trying to access the same resources, not just locks and synchronization mechanisms. An example of contention is if a user has values they would like to sum in the following code.
With a task designed like this example, the more worker threads executing tasks, the more contention will result from executing the instruction “&SharedSum,” which adds the sum, as only one hardware thread can write to it at a time. This would be a case of high contention. The example schedulerresolves this problem by the following instructions.
The task schedulerworks with two identifiers associated with tasks, TaskGroupID and Application ID. Either the TaskGroupID or ApplicationID identifiers, or both, may be used as keys to allow the separation of data. The TaskGroupID identifier is unique to the task group, whereas the ApplicationID identifier is unique to the application. Since the number of workers in a task group may be limited, there are fewer TaskGroupIDs to handle and therefore the user may have finer grained control over how they use the key to reference specific data or use regions of program memory. If the user were to use just the ApplicationID of the worker thread, the user would need to handle a greater range of values, which may be less optimal or convenient for them. This is especially true as the number of possible worker threads increases. Hence the TaskGroupID better constrains the problem for the user.
Users may also specify a signal function that will be called once all tasks have been completed within a group by the last remaining worker thread. This allows for re-entrant task groups, as well as the dynamic building and execution of task group graphs. The signal function is a way for the user to specify specific code to be executed once all tasks in a task group have completed. The signal function is only run once per completion of all tasks in the task group. An example of a signal function is shown in the following pseudo-code:
Then somewhere else in the application,
In this example, the value “bSumValuesTaskGroupComplete” assigned by the user is shared and therefore the scheduleralerts the user when the tasks are complete. Another example is where the user may set up the task group to be re-entrant or cyclical as shown below.
The user may also dynamically build a graph or tree of task groups. For task groups A, B and C in the below example, when task group A is complete, the completion signal “SignalA( )” is sent and task group B is added. When task group B, the completion signal “SignalB( )” is sent and task group C is added.
In this way task group C may be dynamically added once task groups A and B are both completed. Dependencies like this are representative of simple graphs of execution, but more elaborate ones may be constructed.
Due to the ability of task groups to reside in prioritized task group queues that can have stricter ordering rules, the amount of data required to be synchronized between worker threads executing within the task group is minimized. This minimization allows for a much less costly synchronization mechanism to be employed to ensure proper ordering of tasks within the task group. In addition, by aggregating a series of tasks within a group, an additional level or ordering or priority can be considered by the task scheduler.
The below is an example of pseudo code for a user (programmer) to populate a task group and add the task group to the task schedulerin.
The task group queueis the next level component used within the scheduling hierarchy and is designed to apply another layer of ordering within the hierarchy established by task scheduler. The task group queueis responsible for maintaining application state regarding the next level of task ordering and functions as an ordered queue such as a priority heap, FIFO, etc. Task group queues also associate state information that is used to determine which worker threads are allowed access to a particular task group queue. Prioritization of task groups may be determined by any number of user specified factors as may be dictated by the task scheduler. For example, in the case of N task group queues, where N is the number of task group queues, the priorities could range from highest to lowest with the N task group queues being assigned to discrete high, medium, and low priority task group queues. To reduce contention between an example four worker threads, all the worker threads may be assigned valid for the highest priority task group queue. Half of the worker threads may be assigned valid for the medium priority task group queue. Only one worker thread may be assigned valid for the lowest priority task group queue. In this way the worker threads distribute themselves to make it less likely any one of them will fight over a particular task group queue. If contention is detected while acquiring a task group queue, a worker thread may move to another associated queue. Instead of different priority levels to arrange the task group queues, other criteria—such as immediate, frame, and background status—may be used for the task group queue breakdown. In this case, tasks having the immediate status could always be checked, tasks having frame status are inspected once per frame, and tasks having background status are inspected once every second.
is an example diagramof a FIFO task group queue such as the task group queueinwhere task groups that are placed into a task group queue first are operated on by worker threads first, with task groups added afterwards being operated on by worker threads in order of their addition to the task group queue.shows an example task group queueinthat has organized task groups by first in first out (FIFO). The scheduler logiccontrols a plurality of worker threads including worker threads,,, and. The tasks have been organized into task groups,,,, and. Each of the task groups such as the task groupincludes a series of user tasks that are ordered according to the designation in the program written by the user. At the conclusion of the last user task in a task group, such as the task group, a user end signal is encountered that allows the task group queueto proceed to the next task group such as the task group.
The task group queue hierarchy is the last level of ordering and organization utilized that allows the user and the task schedulerto reduce contention when assigning task group queues to the available worker threads. When a task group is added, its priority level is considered and the task group is then assigned an appropriate task group queue based on that priority. In an application, task groups tend to span more than one priority level, and therefore the priority assignment allows for a reduction in queue contention and therefore total cost. Contention at this level can be considered to be reduced at a maximum by 1/(Number of Total Priorities).
The scheduler logicis aware of these priorities and may appropriately assign worker threads based on the current state of the task group queues, the worker threads, and the scheduler state itself. Priority levels for task group queues need not be fixed. The priority levels may change. For example the priority level could be decreased if the task group queue is idle or empty for a specified period of time. The priority may be increased if a task group with a significant workload is recently added to a task group queue, or a significant number of task groups are added to a specific task group queue, or if the number of outstanding task groups in a task group queue becomes significant, or if the program itself changes state (e.g., from background to foreground or to minimized). Another example is if some application specific state such as “Paused,” or “Asset Loading,” or “Multi-Player Enabled” in a game application occurs. To that extent, even the number of players in a game might be used to re-prioritize the queues. The priority may be changed based on the current history since it is unlikely that there will be future work, or if a task group queue has not been used, the task group queue could be elevated in priority to service the tasks in the task group queues. Thus, if certain task group queues are underutilized, they may be reprioritized to a higher level so that contention by the worker threads on the task group queues that are currently used is reduced. For example, if task groups are always added to the highest priority task group queue, more contention may occur on that task group queue. If the medium priority task group queue is remapped to the highest priority, then worker threads may be redistributed more evenly between task group queues.
A user does not typically need to interact directly with the task group queues or the task group queue hierarchy as it is automatically performed by the task scheduler. Alternatively, a user may reprioritize queues or assign worker threads via a simple API call provided by the task scheduler, which allows a user to rearrange the queues for assignment to worker threads. This feature may be used when the user is doing something “unexpected” with the system and the current scheduler logic is conflicting with the user's wishes. It may also be used to augment the logic so that the users may tweak performance/reduce contention when the user is going out of the typical expected bounds of the program.
Unknown
October 14, 2025
Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.