US-11321606

Systems, apparatus, methods, and architectures for a neural network workflow to generate a hardware accelerator

PublishedMay 3, 2022

Assigneenot available in USPTO data we have

Inventorsnot available in USPTO data we have

Technical Abstract

Methods, systems, apparatus, and circuits for dynamically optimizing the circuit for forward and backward propagation phases of training for neural networks, given a fixed resource budget. The circuits comprising: (1) a specialized circuit that can operate on a plurality of multi-dimensional inputs and weights for the forward propagations phase of neural networks; and (2) a specialized circuit that can operate on either gradients and inputs, or gradients and weights for the backward propagation phase of neural networks. The method comprising: (1) an analysis step to obtain the number of operations and the precision of operations in the forward and backward propagations phases of the neural network; (2) a sampling step to obtain the number of zero-valued activations and gradients during the execution of the neural network; (3) a scheduling and estimation step to obtain the runtime for the forward and backward phases of neural network execution using specialized circuits; (4) a builder step to apply the optimal breakdown of resource budget for the forward and backward phases of the neural network to improve the execution of the Neural Network training for future iterations.

Patent Claims

13 claims

Legal claims defining the scope of protection, as filed with the USPTO.

1. A data processing system to perform a workflow for generating a hardware accelerator, comprising: memory; and a processor coupled to the memory, the processor is configured to perform a dataflow analysis operation to analyze resource requirements for a dataflow graph of a neural network (NN) model, to perform a resource partitioning operation to analytically split resources of a hardware accelerator, to perform a cycle-accurate scheduling operation to obtain cycle counts for forward and backward passes of the dataflow graph, and to generate a synthesizable hardware accelerator using an optimal resource breakdown.

2. The data processing system of claim 1 , wherein the hardware accelerator is implemented on a Field Programmable Gate Array (FPGA).

3. The data processing system of claim 1 , wherein performing a dataflow analysis operation to analyze resource requirements for a dataflow graph of the NN model includes iterating, with a dataflow analyzer component, over nodes of the dataflow graph of the NN model and generating a list of pairs including operation type, precision, and operation count for the forward and backward passes of training.

4. The data processing system of claim 1 , wherein the processor is further configured to generate a highest precision and a lowest precision for the forward pass of training and to generate, during static analysis, a highest precision and a lowest precision for the backward pass of training.

5. The data processing system of claim 1 , wherein the processor is further configured to perform runtime analysis by sampling data propagated in the forward and backward passes of the dataflow graph for numerous iterations using a user-specified batch-size of inputs and to calculate the proportion of zero-valued data in sampled data.

6. The data processing system of claim 1 , wherein performing a cycle-accurate scheduling operation to obtain cycle counts for forward and backward passes of the dataflow graph includes dividing the data processing systems look up table and digital signal processing (DSP) resources into systolic arrays for the forward and backward passes using the optimal resource breakdown, utilizing a two level hierarchy for organizing on-chip memory, and performing cycle-accurate simulation.

7. The data processing system of claim 1 , further comprising: a general purpose array that is configured to compute element-wise data transformations for quantized training.

8. The data processing system of claim 1 , wherein the hardware accelerator includes a quantized component for inputs and weights, a mixed precision component for gradients and inputs, and a mixed precision component for gradients and weights.

9. A computer implemented method for quantized neural network (DNN) training comprising: performing a dataflow analysis operation to analyze resource requirements for a dataflow graph of a DNN model; performing, with an analytical model, a resource partitioning operation to analytically split resources of a hardware accelerator; performing a cycle-accurate scheduling operation to obtain cycle counts for forward and backward passes of the dataflow graph; and generating a synthesizable hardware accelerator using an optimal resource breakdown.

10. The computer implemented method of claim 9 , wherein performing a dataflow analysis operation to analyze resource requirements for a dataflow graph of a DNN model includes iterating, with a dataflow analyzer component, over nodes of the dataflow graph of the DNN model and generating a list of pairs including operation type, precision, and operation count for the forward and backward passes of training.

11. The computer implemented method of claim 9 , further comprising: generating, during static analysis with the dataflow analyzer, a highest precision and a lowest precision for the forward pass of training; and generating, during static analysis with the dataflow analyzer, a highest precision and a lowest precision for the backward pass of training.

12. The computer implemented method of claim 11 , further comprising: performing runtime analysis by sampling data propagated in the forward and backward passes of the dataflow graph for numerous iterations using a user-specified batch-size of inputs; and calculating the proportion of zero-valued data in sampled data.

13. The computer implemented method of claim 12 , wherein performing a cycle-accurate scheduling operation to obtain cycle counts for forward and backward passes of the dataflow graph includes dividing the hardware accelerators look up table and digital signal processing (DSP) resources into systolic arrays for the forward and backward passes using the optimal resource breakdown, utilizes a two level hierarchy for organizing on-chip memory, and perform cycle-accurate simulation.

Classification Codes (CPC)

Cooperative Patent Classification codes for this invention. Click any code to explore related patents in that topic.

G06N G06F

Patent Metadata

Filing Date

January 15, 2020

Publication Date

May 3, 2022

Want to explore more patents?

Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.

Browse All Patents Try Prior Art Search