Legal claims defining the scope of protection, as filed with the USPTO.
1. A circuit for performing neural network computations for a neural network comprising a plurality of neural network layers, the circuit comprising: a matrix computation unit configured to, for each of the plurality of neural network layers: receive a plurality of weight inputs and a plurality of activation inputs for the neural network layer, and generate a plurality of accumulated values based on the plurality of weight inputs and the plurality of activation inputs, wherein the matrix computation unit is configured as a two dimensional systolic array comprising a plurality of cells, wherein the plurality of weight inputs is shifted through a first plurality of cells along a first dimension of the systolic array, and wherein the plurality of activation inputs is shifted through a second plurality of cells along a second dimension of the systolic array; and a vector computation unit communicatively coupled to the matrix computation unit and configured to, for each of the plurality of neural network layers: apply an activation function to each of the plurality of accumulated values for the neural network layer generated by the matrix computation unit to generate a plurality of activated values for the neural network layer.
2. The circuit of claim 1 , further comprising: a unified buffer communicatively coupled to the matrix computation unit and the vector computation unit, where the unified buffer is configured to receive and store output from the vector computation unit, and the unified buffer is configured to send the received output as input to the matrix computation unit.
3. The circuit of claim 2 , further comprising: a sequencer configured to receive instructions from a host device and generate a plurality of control signals from the instructions, where the plurality of control signals control dataflow through the circuit; and a direct memory access engine communicatively coupled to the unified buffer and the sequencer, where the direct memory access engine is configured to send the plurality of activation inputs to the unified buffer, where the unified buffer is configured to send the plurality of activation inputs to the matrix computation unit, and where the direct memory access engine is configured to read result data from the unified buffer.
4. The circuit of claim 3 , further comprising: a memory unit configured to send the plurality of weight inputs to the matrix computation unit, and where the direct memory access engine is configured to send the plurality of weight inputs to the memory unit.
5. The circuit of claim 1 , where the two dimensional systolic array is a square array.
6. The circuit of claim 1 , where, for a given layer in the plurality of layers, a count of the plurality of activation inputs is greater than a size of the second dimension of the systolic array, and where the systolic array is configured to: divide the plurality of activation inputs into portions, where each portion has a size less than or equal to the size of the second dimension; generate, for each portion of activation inputs, a respective portion of accumulated values; and combining each portion of accumulated values to generate a vector of accumulated values for the given layer.
7. The circuit of claim 1 , where, for a given layer in the plurality of layers, a count of the plurality of weight inputs is greater than a size of the first dimension of the systolic array, and where the systolic array is configured to: divide the plurality of weight inputs into portions, where each portion has a size less than or equal to the size of the first dimension; generating, for each portion of weight inputs, a respective portion of accumulated values; and combining each portion of accumulated values to generate a vector of accumulated values for the given layer.
8. The circuit of claim 1 , where each cell in the plurality of cells comprises: a weight register configured to store a weight input; an activation register configured to store an activation input and configured to send the activation input to another activation register in a first adjacent cell along the second dimension; a sum-in register configured to store a previously summed value; multiplication circuitry communicatively coupled to the weight register and the activation register, where the multiplication circuitry is configured to output a product of the weight input and the activation input; and summation circuitry communicatively coupled to the multiplication circuitry and the sum-in register, where the summation circuitry is configured to output a sum of the product and the previously summed value, and where the summation circuitry is configured to send the sum to another sum-in register in a second adjacent cell along the first dimension.
9. The circuit of claim 8 , where one or more cells in the plurality of cells are each configured to store the respective sum in a respective accumulator unit, where the respective sum is an accumulated value.
10. The circuit of claim 1 , where the first dimension of the systolic array corresponds to columns of the systolic array, and where the second dimension of the systolic array corresponds to rows of the systolic array.
11. The circuit of claim 1 , where the vector computation unit normalizes each activated value to generate a plurality of normalized values.
12. The circuit of claim 1 , where the vector computation unit pools one or more activated values to generate a plurality of pooled values.
13. The circuit of claim 1 , where the first dimension of the systolic array corresponds to rows of the systolic array, and where the second dimension of the systolic array corresponds to columns of the systolic array.
14. A method for performing neural network computations for a neural network comprising a plurality of neural network layers using a circuit comprising a matrix computation unit and a vector computation unit coupled to the matrix computation unit, where the matrix computation unit is configured as a two dimensional systolic array comprising a plurality of cells, and wherein the method comprises, for each of the plurality of neural network layers: providing a plurality of weight inputs and a plurality of activation inputs for the neural network layer to the matrix computation unit, comprising: shifting the plurality of weight inputs through a first plurality of cells along a first dimension of the systolic array, and shifting the plurality of activation inputs through a second plurality of cells along a second dimension of the systolic array; generating, using the matrix computation unit, a plurality of accumulated values, wherein the matrix computation unit is configured to receive the plurality of weight inputs and the plurality of activation inputs for the neural network layer and generate the plurality of accumulated values based on the plurality of weight inputs and the plurality of activation inputs; and generating, using the vector computation unit, a plurality of activated values for the neural network layer, wherein the matrix computation unit is configured to apply an activation function to each accumulated value generated by the matrix computation unit to generate a plurality of activated values for the neural network layer.
15. The method of claim 14 , further comprising: receiving, by a unified buffer communicatively coupled to the matrix computation unit and the vector computation unit; storing output from the vector computation unit at the unified buffer; sending, from the unified buffer, the received output as input to the matrix computation unit.
16. The method of claim 15 , further comprising: receiving, at a sequencer, instructions from a host device and generating a plurality of control signals from the instructions, where the plurality of control signals control dataflow through the circuit; sending, from a direct memory access engine communicatively coupled to the unified buffer and the sequencer, the plurality of activation inputs to the unified buffer; sending, from the unified buffer, the plurality of activation inputs to the matrix computation unit; and reading, at the direct memory access engine, result data from the unified buffer.
17. The method of claim 16 , further comprising: sending, at a memory unit, the plurality of weight inputs to the matrix computation unit; sending, from the direct memory access engine, the plurality of weight inputs to the memory unit.
18. The method of claim 14 , where the two dimensional systolic array is a square array.
19. The method of claim 14 , where, for a given layer in the plurality of layers, a count of the plurality of activation inputs is greater than a size of the second dimension of the systolic array, the method further comprising: dividing, at the systolic array, the plurality of activation inputs into portions, where each portion has a size less than or equal to the size of the second dimension; generating, for each portion of activation inputs and at the systolic array, a respective portion of accumulated values; and combining, at the systolic array, each portion of accumulated values to generate a vector of accumulated values for the given layer.
20. The method of claim 14 , where, for a given layer in the plurality of layers, a count of the plurality of weight inputs is greater than a size of the first dimension of the systolic array, the method further comprising: dividing, at the systolic array, the plurality of weight inputs into portions, where each portion has a size less than or equal to the size of the first dimension; generating, for each portion of weight inputs and at the systolic array, a respective portion of accumulated values; and combining, at the systolic array, each portion of accumulated values to generate a vector of accumulated values for the given layer.
21. The method of claim 14 , where each cell in the plurality of cells comprises: a weight register configured to store a weight input; an activation register configured to store an activation input and configured to send the activation input to another activation register in a first adjacent cell along the second dimension; a sum-in register configured to store a previously summed value; multiplication circuitry communicatively coupled to the weight register and the activation register, where the multiplication circuitry is configured to output a product of the weight input and the activation input; and summation circuitry communicatively coupled to the multiplication circuitry and the sum-in register, where the summation circuitry is configured to output a sum of the product and the previously summed value, and where the summation circuitry is configured to send the sum to another sum-in register in a second adjacent cell along the first dimension.
22. The method of claim 21 , further comprising storing, at one or more cells in the plurality of cells, the respective sum in a respective accumulator unit, where the respective sum is an accumulated value.
23. The method of claim 14 , where the first dimension of the systolic array corresponds to columns of the systolic array, and where the second dimension of the systolic array corresponds to rows of the systolic array.
24. The method of claim 14 , further comprising normalizing, at the vector computation unit, each activated value to generate a plurality of normalized values.
25. The method of claim 14 , further comprising pooling, at the vector computation unit, one or more activated values to generate a plurality of pooled values.
26. The method of claim 14 , where the first dimension of the systolic array corresponds to rows of the systolic array, and where the second dimension of the systolic array corresponds to columns of the systolic array.
Unknown
July 18, 2017
Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.