Solution to Divergent Branches in a Simd Core Using Hardware Pointers

PublishedMay 2, 2017

Assigneenot available in USPTO data we have

Technical Abstract

Patent Claims

17 claims

Legal claims defining the scope of protection, as filed with the USPTO.

1. A processor comprising: a plurality of parallel execution lanes within a single instruction multiple data (SIMD) micro-architecture; a plurality of program counter (PC) registers, wherein each PC register is configured to store a PC of an instruction; a size register configured to store a number of instructions in a variable length instruction word (VLIW); a plurality of storage locations, wherein each storage location of the plurality of storage locations is associated with a different lane of the plurality of execution lanes and is programmable to store an identifier corresponding to one of the plurality of PC registers; and control logic configured to: update the size register to indicate N instructions are included in a next VLIW, wherein N is an integer greater than one when a transition from a first instruction to a second instruction represents a divergence point; update N of the PC registers, each of the updated PC registers identifying a different instruction of the next VLIW; fetch a number of instructions indicated by the size register, wherein a separate PC register of the plurality of PC registers is used to identify which instructions are fetched; execute the number of instructions in the plurality of execution lanes simultaneously.

2. The processor as recited in claim 1 , wherein the size register is configured to store an integer value.

3. The processor as recited in claim 1 , wherein a number of valid identifiers stored in the plurality of storage locations is equal to the number of instructions indicated by the size register.

4. The processor as recited in claim 1 , wherein the control logic is further configured to: decode a number of fetched instructions equal to the number of instructions indicated by the size register; and assign a given one of the decoded instructions to a given lane of the plurality of execution lanes based at least in part on an identifier stored in one of the plurality of storage locations associated with the given lane and a PC value stored in a register of the plurality of PC registers.

5. The processor as recited in claim 4 , wherein in response to a given one of the plurality of PC registers pointing to a plurality of resource-independent instructions, an associated one of the plurality of execution lanes is further configured to simultaneously execute the plurality of resource-independent instructions.

6. The processor as recited in claim 4 , wherein in response to a given lane of the plurality of execution lanes having reached the end of a loop of instructions, the control logic is further configured to: write a sleep state in a corresponding one of the plurality of storage locations responsive to determining at runtime the given lane is scheduled to branch back to the beginning of the loop; and write an exit state in the corresponding one of the plurality of storage locations responsive to determining at runtime the given lane is scheduled to branch out of the loop.

7. The processor as recited in claim 6 , wherein in response to the given lane being in the sleep state or the exit state, the control logic is further configured to: halt execution within the given lane; and store at least a next program counter (PC) and a work-item identifier (ID) for a corresponding trace.

8. The processor as recited in claim 7 , wherein in response to determining each lane of the plurality of execution lanes is in the sleep state or is in the exit state, the control logic is further configured to restart execution for each lane by branching to a respective stored next PC.

9. The processor as recited in claim 8 , wherein in response to each lane of the plurality of execution lanes being halted and at least one lane being in a different state than another lane, the control logic is further configured to restart execution for only lanes in a sleep state by branching to a respective stored next PC.

10. A non-transitory computer readable storage medium storing at least one program configured for execution by at least one processor of a computer system, wherein the at least one program comprising instructions executable to: assign program instructions for execution on a plurality of parallel execution lanes within a single instruction multiple data (SIMD) micro-architecture; maintain a plurality of program counter (PC) registers, wherein each PC register is configured to store a PC of an instruction; maintain a size register configured to store a size of a number of instructions in a variable length instruction word (VLIW); maintain a plurality of storage locations, wherein each storage location of the plurality of storage locations is associated with a different lane of the plurality of execution lanes and is programmable to store an identifier corresponding to one of the plurality of PC registers; update the size register to indicate N instructions are included in a next VLIW, wherein N is an integer greater than one when a transition from a first instruction to a second instruction represents a divergence point; update N of the PC registers, each of the updated PC registers identifying a different instruction of the next VLIW; fetch a number of instructions indicated by the size register, wherein a separate PC register of the plurality of PC registers is used to identify which instructions are fetched; and execute the number of instructions in the plurality of execution lanes simultaneously.

11. The non-transitory computer readable storage medium as recited in claim 10 , wherein assigning a program instruction to a given lane of the plurality of parallel execution lanes is based at least in part on a branch direction found at runtime for the given lane at a given divergent point.

12. The non-transitory computer readable storage medium as recited in claim 11 , further comprising instructions executable to update the size register to indicate M instructions are in a next VLIW to be fetched.

13. The non-transitory computer readable storage medium as recited in claim 12 , further comprising instructions executable to: update each of M PC registers with a corresponding PC that corresponds to a unique one of the M instructions in the next VLIW; and update a plurality of identifiers in the plurality of storage locations mapping each one of the plurality of execution lanes to a given one of the updated PC registers.

14. A method comprising: assigning program instructions for execution on a plurality of parallel execution lanes within a single instruction multiple data (SIMD) micro-architecture; maintaining a plurality of program counter (PC) registers, wherein each PC register is configured to store a PC of an instruction; maintaining a size register configured to store a number of instructions in a variable length instruction word (VLIW); maintaining a plurality of storage locations, wherein each storage location of the plurality of storage locations is associated with a different lane of the plurality of execution lanes and is programmable to store an identifier corresponding to one of the plurality of PC registers; updating the size register to indicate N instructions are included in a next VLIW, wherein N is an integer greater than one when a transition from a first instruction to a second instruction represents a divergence point; updating N of the PC registers, each of the updated PC registers identifying a different instruction of the next VLIW; fetching a number of instructions indicated by the size register, wherein a separate PC register of the plurality of PC registers is used to identify which instructions are fetched; and executing the number of instructions in the plurality of execution lanes simultaneously.

15. The method as recited in claim 14 , wherein a number of valid identifiers stored in the plurality of storage locations is equal to the number of instructions indicated by the size register.

16. The method as recited in claim 15 , further comprising: decoding a number of fetched instructions equal to the number of instructions indicated by the size register; and assigning a given one of the decoded instructions to a given lane of the plurality of execution lanes based at least in part on an identifier stored in one of the plurality of lane registers associated with the given lane and a PC value stored in a register of the plurality of PC registers.

17. The method as recited in claim 16 , wherein in response to detecting a given one of the plurality of PC registers points to a plurality of resource-independent instructions, the method further comprises simultaneously executing the plurality of resource-independent instructions in an associated one of the plurality of execution lanes.

Patent Metadata

Filing Date

Unknown

Publication Date

May 2, 2017

Inventors

Reza Yazdani

Want to explore more patents?

Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.

Browse All Patents Try Prior Art Search