Iterating Group Sum of Multiple Accumulate Operations

PublishedApril 19, 2022

Assigneenot available in USPTO data we have

Technical Abstract

Patent Claims

20 claims

Legal claims defining the scope of protection, as filed with the USPTO.

1. A system comprising: a scheduler operative to schedule a single instruction, multiple data (SIMD) thread; one or more processors operative to execute the SIMD thread; and logic encoded in one or more non-transitory computer-readable storage media for execution by the one or more processors and when executed operative to cause the one or more processors to perform operations comprising: initiating, by the scheduler, the SIMD thread; fetching, by the one or more processors, a plurality of instructions for the SIMD thread from a memory; determining, by a thread arbiter of the processor, at least one instruction of the plurality of instructions that is a walk instruction block, wherein the walk instruction block includes a walk-endwalk pair of instructions, wherein the walk instruction block includes a GSOMAC (Group Sum of Multiply Accumulate) instruction; iterating a block of instructions within the walk-endwalk pair of instructions of the walk instruction block for a subset of channels of the SIMD thread, wherein the walk-endwalk instructions are responsible for iterating the block of instructions when a size of the SIMD thread is greater than a maximum native SIMD instruction width, and an execution mask is responsible for iterating the block of instructions when the size of the SIMD thread is less than the maximum native SIMD instruction width, wherein the walk instruction block includes a walk size, and wherein the walk size is a number of channels in the subset of channels of the SIMD thread that are processed in the iterating in association with the walk instruction block; providing, by the thread arbiter, the walk instruction block to a code block iterator; and executing, by the thread arbiter, the walk instruction block based on the walk size.

2. The system of claim 1 , wherein the GSOMAC instruction includes a GDP (Group Dot Product) operative to convolve an entire M×N block against an (M+a)×(N+b) block to generate an a×b output, wherein M, N, a, and b are all positive integers.

3. The system of claim 1 , wherein the GSOMAC instruction includes a GCONV (Group Convolve) operative to convolve an entire M×N block against an M×N×P block to generate a 1×P output, wherein M, N, and P are all positive integers.

4. The system of claim 1 , wherein the logic when executed is further operable to cause the one or more processors to perform operations comprising: determining, an instruction size of the GSOMAC instruction, wherein the instruction size indicates a number of iterations that the GSOMAC instruction is to be executed.

5. The system of claim 1 , wherein the logic when executed is further operable to cause the one or more processors to perform operations comprising: reading a first source operand of a plurality of source operands of the GSOMAC instruction from a register file, wherein the first source operand includes one or more terms, wherein each source operand of the plurality of source operands is a register from a corresponding register file that is an input to the instruction; and determining if all terms of the first source operand are zero.

6. The system of claim 5 , wherein the logic when executed is further operable to cause the one or more processors to perform operations comprising: skipping execution of the GSOMAC instruction and reading a next instruction for execution when all terms of the first source operand are zero.

7. The system of claim 5 , wherein the logic when executed is further operable to cause the one or more processors to perform operations comprising: selecting a second instruction of the plurality of instructions for execution in an arithmetic logic unit (ALU) pipeline of the processor, and reading the second instruction; and determining whether the second instruction is a GSOMAC instruction with an instruction size, wherein the instruction size indicates a number of iterations that the GSOMAC instruction is to be executed.

8. The system of claim 4 , wherein the logic when executed is further operable to cause the one or more processors to perform operations comprising: reading a second source operand of the plurality of source operands of the GSOMAC instruction from the register file when all terms of the first source operand are not zero, wherein the second source operand includes a number of sets of one or more terms, wherein the number of sets is the instruction size; determining an instruction mask, wherein the instruction mask includes a plurality of bits, and each bit is determined based which sets of the number of sets of the second operand have all terms of the set being zero.

9. The system of claim 8 , wherein each bit of the plurality of bits corresponding to a set of the plurality of sets of the second source operand having all terms of zero are reset, and each bit of the plurality of bits corresponding to a set of the plurality of sets of the second source operand having at least one term non-zero are set.

10. The system of claim 9 , wherein the logic when executed is further operable to cause the one or more processors to perform operations comprising: executing multiply and accumulate operations of the GSOMAC operation for the iterations which are not disabled (mask bit is set) and skipping the iterations which are disabled (mask bit is reset) based on the instruction mask.

11. The system of claim 10 , wherein the logic when executed is further operable to cause the one or more processors to perform operations comprising: reading a destination operand of the plurality of operands of the GSOMAC instruction; adding a sum-of-multiply result to the destination operand; writing the multiply-accumulate result back to the destination operand, wherein the destination operand is a register from the register file that is an output of the instruction; wherein the destination operand is read and updated for each iteration, wherein there is a separate destination operand for each iteration.

12. The system of claim 1 , wherein the SIMD thread includes a dispatch mask, and wherein the dispatch mask indicates which channels of a plurality of channels are enabled and/or indicates which channels of the plurality of channels are disabled at an initial point in time.

13. The system of claim 1 , wherein the walk instruction block involves processing data from the subset of the channels of the plurality of channels during execution of an iteration of a block of instructions for the SIMD thread.

14. The system of claim 1 , wherein the logic when executed is further operable to cause the one or more processors to perform operations comprising generating, by the code block iterator, a walk mask of the SIMD thread, wherein the walk mask indicates which subset of channels that are enabled and/or disabled during a particular walk iteration of executing the plurality of instructions for the SIMD thread.

15. The system of claim 1 , wherein the logic when executed is further operable to cause the one or more processors to perform operations comprising generating, by the code block iterator, a walk mask of the SIMD thread based at least on the walk size and the execution mask, wherein the execution mask is a mask that is applied when performing the plurality of instructions for the SIMD thread during a particular iteration of executing instructions for the SIMD thread.

16. The system of claim 1 , wherein the logic when executed is further operable to cause the one or more processors to perform operations comprising utilizing a subset of walk registers of a plurality of walk registers to execute the walk instruction block.

17. The system of claim 1 , wherein the logic when executed is further operable to cause the one or more processors to perform operations comprising: initiating the walk instruction block during execution of a first iteration of executing instructions for the SIMD thread; ending the walk instruction block; initiating a second walk instruction block during execution of a second iteration of executing instructions for the SIMD thread; and ending the second walk instruction block.

18. A computer-implemented method comprising: initiating, by a scheduler, a single instruction, multiple data (SIMD) thread, wherein the scheduler is operative to schedule the SIMD thread; fetching, by the one or more processors, a plurality of instructions for the SIMD thread from a memory; determining, by a thread arbiter of the processor, at least one instruction of the plurality of instructions that is a walk instruction block, wherein the walk instruction block includes a walk-endwalk pair of instructions, wherein the walk instruction block includes a GSOMAC (Group Sum of Multiply Accumulate) instruction; iterating a block of instructions within the walk-endwalk pair of instructions of the walk instruction block for a subset of channels of the SIMD thread, wherein the walk-endwalk instructions are responsible for iterating the block of instructions when a size of the SIMD thread is greater than a maximum native SIMD instruction width, and an execution mask is responsible for iterating the block of instructions when the size of the SIMD thread is less than the maximum native SIMD instruction width, wherein the walk instruction block includes a walk size, and wherein the walk size is a number of channels in the subset of channels of the SIMD thread that are processed in the iterating in association with the walk instruction block; providing, by the thread arbiter, the walk instruction block to a code block iterator; and executing, by the thread arbiter, the walk instruction block based on the walk size.

19. The method of claim 18 , wherein the GSOMAC instruction includes a GDP (Group Dot Product) operative to convolve an entire M×N block against an (M+a)×(N+b) block to generate an a×b output, wherein M, N, a, and b are all positive integers.

20. The method of claim 18 , wherein the GSOMAC instruction includes a GCONV (Group Convolve) operative to convolve an entire M×N block against an M×N×P block to generate a 1×P output, wherein M, N, and P are all positive integers.

Patent Metadata

Filing Date

Unknown

Publication Date

April 19, 2022

Inventors

Satyaki Koneru

Kamaraj Thangam

Want to explore more patents?

Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.

Browse All Patents Try Prior Art Search