Legal claims defining the scope of protection, as filed with the USPTO.
1. An apparatus, comprising: a processor that includes: one or more processing elements having one or more compute units and one or more register files, wherein the one or more register files includes one or more registers divisible into lanes for parallel processing; and one or more mask registers associated with the one or more processing elements, wherein the one or more mask registers include a number of mask bits, such that the lanes have corresponding mask bits, wherein the one or more processing elements are operable to set each of the mask bits to a first state or a second state when a loop count of a loop of an operation associated with an instruction indicates that the operation is in a last iteration of the loop, and are operable to set all of the mask bits to the first state otherwise, wherein the one or more processing elements are operable to enable lanes having corresponding mask bits of the first state to execute the instruction and disable lanes having corresponding mask bits of the second state from executing the instruction.
2. The apparatus of claim 1 , wherein the processor further includes one or more predicate registers associated with the one or more processing elements, wherein the one or more predicate registers include a number of predicate bits equal to the maximum number of divisible lanes, such that the lanes have corresponding predicate bits.
3. The apparatus of claim 2 , wherein the one or more processing elements are further operable to: perform an operation defined by the instruction in lanes having corresponding predicate bits of a third state; and not perform the operation defined by the instruction in lanes having corresponding predicate bits of a fourth state.
4. The apparatus of claim 3 , wherein: the first digital state and the third digital state are a same state; and the second digital state and the fourth digital state are a same state.
5. The apparatus of claim 3 , wherein the one or more processing elements are operable to set the predicate bits to the third state or the fourth state based on a condition of the lane that corresponds with the predicate bits.
6. The apparatus of claim 2 , wherein a lane has more than one corresponding predicate bit and the one or more processing elements are operable to set all corresponding predicate bits based on a condition of the lane.
7. The apparatus of claim 2 , wherein a single consolidated register implementations the mask register and the predicate register, the single consolidated register comprising a number of bits configured to serve either as the mask bits or as the predicate bits.
8. The apparatus of claim 1 , wherein, for the last iteration of the loop, the one or more processing elements are operable to set the mask bits to the first state or the second state based on a trip count of the loop.
9. The apparatus of claim 1 , wherein the processor is operable to perform a reduction operation across the lanes of the one or more of the processing elements.
10. The apparatus of claim 9 , wherein the reduction operation comprises one or more of summing across the lanes of the one or more processing elements, moving data across the lanes of the one or more processing elements, performing minimum operations across the lanes of the one or more processing elements, and performing maximum operations across the lanes of the one or more processing elements.
11. The apparatus of claim 1 , wherein the processor is operable to generate an address for each of the lanes of the one or more processing elements.
12. The apparatus of claim 1 , wherein the one or more processing elements are operable to: set all the mask bits to the first state if the loop count is not equal to one; and set each of the mask bits to the first state or the second state based on a trip count of the loop if the loop count is equal to one.
13. The apparatus of claim 1 , wherein the number of mask bits is equal to a maximum number of divisible lanes of the one or more registers.
14. The apparatus of claim 1 , wherein the one or more processing elements are further operable to execute the instruction by at least one of the processing lanes.
15. A method to be performed by a processor, the method comprising: issuing an instruction to one or more processing elements which include one or more registers divisible into processing lanes for parallel processing; setting each of mask bits corresponding with the processing lanes to a first state or a second state when a loop count of a loop of an operation associated with the instruction indicates that the operation is in a last iteration of the loop and setting all of the mask bits to the first state otherwise; enabling processing lanes having corresponding mask bits of the first state to execute the instruction; and disabling processing lanes having corresponding mask bits of the second state from executing the instruction.
16. The method of claim 15 , wherein a number of mask bits is equal to a maximum number of divisible processing lanes of the one or more registers.
17. The method of claim 15 , further comprising: setting predicate bits corresponding with the processing lanes.
18. The method of claim 17 , further comprising: performing an operation defined by the instruction in lanes having corresponding predicate bits of a third state; and not performing the operation defined by the instruction in lanes having corresponding predicate bits of a fourth state.
19. The method of claim 18 , wherein setting the predicate bits includes setting the predicate bits to the third state or the fourth state based on a condition of the processing lane that corresponds with the predicate bits.
20. The method of claim 18 , wherein: the first digital state and the third digital state are a same state; and the second digital state and the fourth digital state are a same state.
21. The method of claim 18 , wherein: the first digital state and the fourth digital state are a same state; and the second digital state and the third digital state are a same state.
22. The method of claim 17 , wherein: a number of predicate bits is greater than a number of processing lanes, such that a processing lane has more than one corresponding predicate bit; and the setting the predicate bits includes setting the more than one corresponding predicate bit of the processing lane based on a condition of the processing lane.
23. The method of claim 17 , wherein a number of predicate bits is equal to the maximum number of divisible processing lanes.
24. The method of claim 15 , wherein setting the mask bits corresponding with the processing lanes includes: if the loop count is not equal to one, setting all the mask bits to the first state; if the loop count is equal to one, setting each of the mask bits to the first state or the second state based on a trip count of the loop.
25. The method of claim 24 , wherein, when the loop count is equal to one, each of the mask bits are set according to MASK=2^(Remainder((VectorLength+NLanes−1)/NLanes)+1)−1, wherein VectorLength is a vector length of a vectorizable loop associated with the operation, and NLanes is a number of the processing lanes.
26. The method of claim 15 , further comprising: executing, by at least one of the processing lanes, the instruction.
27. The method of claim 15 , further comprising performing a reduction operation across the lanes of one or more processing elements.
28. The method of claim 15 , further comprising performing a reduction operation within at least one of the one or more processing elements.
29. The method of claim 15 , further comprising determining the loop count of the operation based on a vector length of a vectorizable loop associated with the operation and a number of the processing lanes.
30. The method of claim 15 , further comprising generating an address for each of the lanes of the one or more processing elements.
31. A single instruction, multiple data (SIMD) processor, comprising: a compute array having one or more processing element that includes a register set divisible into a number of SIMD lanes; and one or more mask registers having a number of mask bits, such that each SIMD lane has at least one corresponding mask bit, wherein the one or more processing elements are operable to: set each of the mask bits to a first state or a second state when a loop count of a loop of an operation associated with an instruction indicates that the operation is in a last iteration of the loop, and set all of the mask bits to the first state otherwise, enable the SIMD lanes having corresponding mask bits of the first state to execute the instruction, and disable the SIMD lanes having corresponding mask bits of the second state from executing the instruction.
32. The SIMD processor of claim 31 , further comprising: one or more predicate registers having a number of predicate bits equal to a maximum number of divisible SIMD lanes, such that each SIMD lane has at least one corresponding predicate bit.
33. The SIMD processor of claim 32 , wherein the one or more processing elements are further operable to: perform an operation defined by the instruction in SIMD lanes having corresponding predicate bits of a third state; and not perform the operation defined by the instruction in SIMD lanes having corresponding predicate bits of a fourth state.
34. The SIMD processor of claim 31 , wherein the one or more processing elements are further operable to conditionally execute the instruction in at least one of the SIMD lanes at least based on a state of the mask bits.
35. An apparatus comprising: means for issuing an instruction to one or more processing elements that include one or more registers divisible into processing lanes for parallel processing; means for setting each of mask bits corresponding with the processing lanes to a first state or a second state when a loop count of a loop of an operation associated with the instruction indicates that the operation is in a last iteration of the loop, and setting all of the mask bits to the first state otherwise; means for enabling processing lanes having corresponding mask bits of the first state to execute the instruction; and means for disabling processing lanes having corresponding mask bits of the second state from executing the instruction.
36. The apparatus of claim 35 , further comprising: means for setting predicate bits corresponding with the processing lanes.
Unknown
January 31, 2017
Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.