US-12585928-B2

Hardware architecture for introducing activation sparsity in neural network

PublishedMarch 24, 2026

Assigneenot available in USPTO data we have

Inventorsnot available in USPTO data we have

Technical Abstract

A hardware accelerator that is efficient at performing computations related to a sparse neural network. The sparse neural network may be associated with a plurality of nodes. An artificial intelligence (AI) accelerator stores, at a memory circuit, a weight tenor and an input activation tensor that corresponds to a node of the neural network. The AI accelerator performs a computation such as convolution between the weight tenor and the input activation tensor to generate an output activation tensor. The AI accelerator introduces sparsity to the output activation tensor by reducing the number of active values in the output activation tensor. The sparsity activation may be a K-winner approach, which selects the K-largest values in the output activation tensor and set the remaining values to zero.

Patent Claims

Legal claims defining the scope of protection, as filed with the USPTO.

. An artificial intelligence accelerator for performing operations related to a neural network, comprising:

. The artificial intelligence accelerator of, wherein the comparator tree circuit is configured to reduce the number of active values in the output activation tensor globally by selecting a limited number of the active values throughout the output activation tensor based on the comparison.

. The artificial intelligence accelerator of, wherein the comparator tree circuit is configured to reduce the number of the active values in the output activation tensor locally by:

. The artificial intelligence accelerator of, wherein the output activation tensor is divided based on the weight tensor of the first node of the neural network, wherein each subset corresponds to a result of the convolution operation with the weight tensor.

. The artificial intelligence accelerator of, wherein the number of active values in the sparsified output activation tensor of the first node is different from the number of active values in a sparsified output activation tensor of the second node of the neural network.

. The artificial intelligence accelerator of, wherein the comparator tree circuit is configured to reduce the number of the active values by:

. The artificial intelligence accelerator of, wherein the sorted number of values are stored as a linked list in a pointer array.

. The artificial intelligence accelerator of, wherein the comparator tree circuit is configured to reduce the number of the active values by:

. The artificial intelligence accelerator of, wherein the comparator tree circuit is configured to reduce the number of active values in the output activation tensor by restricting the output activation tensor to have a structure that defines a distribution of the active values.

. The artificial intelligence accelerator of, wherein the active values are distributed in a block structure in which the output activation tensor is divided into a plurality of blocks, each block comprising a plurality of values, each block being either active or inactive, wherein in an active block, at least one of the values is active, and in an inactive block, all of the values are inactive.

. The artificial intelligence accelerator of, wherein the comparator tree circuit is configured to reduce the number of the active values in the output activation tensor by selecting a number of active blocks, and wherein each selected active block is selected based on an aggregated value of the selected active block.

. The artificial intelligence accelerator of, wherein the comparator tree circuit is configured to apply one or more tie-breaker criteria to determine which one of the active values in the output activation tensor is to be selected.

. A computer-implemented method for processing a neural network, the computer-implemented method comprising:

. The computer-implemented method of, wherein reducing the number of active values in the output activation tensor is performed globally by selecting a limited number of the active values throughout the output activation tensor based on the comparison.

. The computer-implemented method of, wherein reducing the number of active values in the output activation tensor comprises:

. A computing device, comprising:

. The computing device of, wherein artificial intelligence accelerator is configured to reduce the number of the active values in the output activation tensor by:

. The computing device of, wherein reducing the number of active values in the output activation tensor is performed globally by selecting a limited number of active values throughout the output activation tensor.

. The artificial intelligence accelerator of, wherein the comparator tree circuit is configured to retain a predetermined number of the active values by setting the one or more of the active values in the output activation tensor to zero.

. The artificial intelligence accelerator of, wherein the comparator tree circuit comprises find max circuits that selects a highest active value from at least a subset of the active values in the output activation tensor.

Detailed Description

Complete technical specification and implementation details from the patent document.

The present application claims the benefit of U.S. Provisional Patent Application 63/087,644, filed on Oct. 5, 2020, which is hereby incorporated by reference in its entirety.

The present disclosure relates to learning and processing neural networks, and more specifically to hardware architecture that is efficient at performing operations related to sparse neural networks.

The use of artificial neural networks (ANN), or simply neural networks, includes a vast array of technologies. An ANN's complexity, in terms of the number of parameters, is growing exponentially at a faster rate than hardware performance. In many cases, an ANN may have a large number of parameters. Training and inference on these networks are bottlenecked by massive linear tensor operations, multiplication and convolution. Consequently, a large amount of time and/or resources may be used for both ANN creation (e.g., training) and execution (e.g., inference).

Computing systems that execute ANNs often involve extensive computing operations including multiplication and accumulation. For example, CNN is a class of machine learning techniques that primarily uses convolution between input data and kernel data, which can be decomposed into multiplication and accumulation operations. Using a central processing unit (CPU) and its main memory to instantiate and execute machine learning systems or models of various configurations is relatively easy because such systems or models can be instantiated with mere updates to code. However, relying solely on the CPU for various operations of these machine learning systems or models would consume significant bandwidth of a central processing unit (CPU) as well as increase the overall power consumption.

Embodiments relate to an artificial intelligence (AI) accelerator for performing operations related to a sparse neural network. The AI accelerator may include a memory circuit configured to store a weight tensor and an input activation tensor that corresponds to a node of the neural network. The AI accelerator may also include a multiply circuit coupled to the memory circuit. The multiply circuit is configured to perform a computation between the weight tensor and the input activation tensor to generate an output activation tensor. The AI accelerator may further include a comparator circuit coupled to the multiply circuit. The comparator circuit is configured to receive the output activation tensor and reduce a number of active values in the output activation tensor.

In one embodiment, the comparator circuit is configured to reduce the number of active values in the output activation tensor globally by selecting a limited number of active values in the entirety of the output activation tensor.

In one embodiment, the comparator circuit is configured to reduce the number of active values in the output activation tensor locally by dividing the output activation tensor into a plurality of subsets and selecting, for each subset, a limited number of active values in the subset.

In one embodiment, the output activation tensor is divided based on kernels corresponding to the node of the neural network. Each subset corresponds a result of the computation with a kernel.

In one embodiment, the number of active values is different for the node and that of a second node.

In one embodiment, the comparator circuit is configured to reduce the number of active values by receiving a stream of values in the output activation tensor, sorting the stream of values to a sorted number of values, and discarding values that are not selected in the sorted number of values.

In one embodiment, the sorted number of values are stored as a linked list in a pointer array.

In one embodiment, the comparator circuit is configured to reduce the number of active values by storing values in the output activation tensor in a buffer array, selecting the number of active values, and setting remaining of the values in the output activation tensor that are not selected to zero.

In one embodiment, the comparator circuit comprises a comparator tree circuit that is configured to select the highest value from a set of values. The comparator circuit is configured to select the number of active values by repeatedly running the comparator tree circuit until the selected active values reach the number.

In one embodiment, the comparator circuit is configured to reduce the number of active values in the output activation tensor by restricting the output activation tensor to have a structure that defines a distribution of the active values.

In one embodiment, the active values are distributed in a block structure in which the output activation tensor is divided into a plurality of blocks. Each block includes a plurality of values. Each block is either active or inactive. In an active block, at least one of the values is active. In an inactive block, all of the values are inactive.

In one embodiment, the comparator circuit is configured to reduce the number of active values in the output activation tensor by selecting a number of active blocks. Each selected active block is selected based on an aggregated value of the selected active block.

In one embodiment, the comparator circuit is configured to reduce the number of active values in the output activation tensor by determining a threshold value for selecting values to remain active, maintaining values in the output activation tensor that are larger than the threshold value, and setting remaining values in the output activation tensor to zeros.

In one embodiment, the comparator circuit is configured to reduce the number of active values by storing values in the output activation tensor in a buffer array, creating a histogram of the values, determining a threshold value based on the histogram, and setting the values that are smaller than the threshold value as zero.

In one embodiment, one or more values in the output activation tensor are equal. The comparator circuit is configured to apply one or more tie-breaker criteria to determine which of the one or more equal values is selected.

The features and advantages described in the specification are not all inclusive and, in particular, many additional features and advantages will be apparent to one of ordinary skill in the art in view of the drawings and specification. Moreover, it should be noted that the language used in the specification has been principally selected for readability and instructional purposes, and may not have been selected to delineate or circumscribe the inventive subject matter.

In the following description of embodiments, numerous specific details are set forth in order to provide more thorough understanding. However, note that the present invention may be practiced without one or more of these specific details. In other instances, well-known features have not been described in detail to avoid unnecessarily complicating the description.

A preferred embodiment is now described with reference to the figures where like reference numbers indicate identical or functionally similar elements. Also in the figures, the left-most digit of each reference number corresponds to the figure in which the reference number is first used.

Reference in the specification to “one embodiment” or to “an embodiment” means that a particular feature, structure, or characteristic described in connection with the embodiments is included in at least one embodiment. The appearances of the phrase “in one embodiment” in various places in the specification are not necessarily all referring to the same embodiment.

Some portions of the detailed description that follows are presented in terms of algorithms and symbolic representations of operations on data bits within a computer memory. These algorithmic descriptions and representations are the means used by those skilled in the data processing arts to most effectively convey the substance of their work to others skilled in the art. An algorithm is here, and generally, conceived to be a self-consistent sequence of steps (instructions) leading to the desired result. The steps are those requiring physical manipulations of physical quantities. Usually, though not necessarily, these quantities take the form of electrical, magnetic or optical signals capable of being stored, transferred, combined, compared and otherwise manipulated. It is convenient at times, principally for reasons of common usage, to refer to these signals as bits, values, elements, symbols, characters, terms, numbers, or the like. Furthermore, it is also convenient at times, to refer to certain arrangements of steps requiring physical manipulations of physical quantities as modules or code devices, without loss of generality.

However, all of these and similar terms are to be associated with the appropriate physical quantities and are merely convenient labels applied to these quantities. Unless specifically stated otherwise as apparent from the following discussion, it is appreciated that throughout the description, discussions utilizing terms such as “processing” or “computing” or “calculating” or “determining” or “displaying” or “determining” or the like, refer to the action and processes of a computer system, or similar electronic computing device, that manipulates and transforms data represented as physical (electronic) quantities within the computer system memories or registers or other such information storage, transmission or display devices.

Certain aspects of the embodiments include process steps and instructions described herein in the form of an algorithm. It should be noted that the process steps and instructions of the embodiments could be embodied in software, firmware or hardware, and when embodied in software, could be downloaded to reside on and be operated from different platforms used by a variety of operating systems.

Embodiments also relate to an apparatus for performing the operations herein. This apparatus may be specially constructed for the required purposes, or it may comprise a general-purpose computer selectively activated or reconfigured by a computer program stored in the computer. Such a computer program may be stored in a computer readable storage medium, such as, but is not limited to, any type of disk including floppy disks, optical disks, CD-ROMs, magnetic-optical disks, read-only memories (ROMs), random access memories (RAMs), EPROMs, EEPROMs, magnetic or optical cards, application specific integrated circuits (ASICs), or any type of media suitable for storing electronic instructions, and each coupled to a computer system bus. A computer readable medium is a non-transitory medium that does not include propagation signals and transient waves. Furthermore, the computers referred to in the specification may include a single processor or may be architectures employing multiple processor designs for increased computing capability.

The algorithms and displays presented herein are not inherently related to any particular computer or other apparatus. Various general-purpose systems may also be used with programs in accordance with the teachings herein, or it may prove convenient to construct more specialized apparatus to perform the required method steps. The required structure for a variety of these systems will appear from the description below. In addition, embodiments are not described with reference to any particular programming language. It will be appreciated that a variety of programming languages may be used to implement the teachings as described herein, and any references below to specific languages are provided for disclosure of enablement and best mode of the embodiments.

In addition, the language used in the specification has been principally selected for readability and instructional purposes, and may not have been selected to delineate or circumscribe the inventive subject matter. Accordingly, the disclosure set forth herein is intended to be illustrative, but not limiting, of the scope, which is set forth in the claims.

Embodiments relate to the architecture of an artificial intelligence (AI) accelerator that is efficient at processing sparse nodes of a neural network. A sparse node may include a sparse tensor that has a low density of active values. In using a generic processor, the computation operation of a tensor, sparse or dense, may include computing the value in the tensor one by one. However, in a sparse tensor, many values in the tensor are inactive (e.g., zeros) and computation with such inactive values can be skipped. The AI accelerator may perform a sparse activation to reduce the number of active values in an output tensor that corresponds to the output of a node of the neural network. The sparse activation may include a K-winner approach, which keeps a number (K) of the largest values in the tensor and sets the rest of the values to zeros. The AI accelerator may include an architecture that is efficient at performing a K-winner approach.

Example Computing Device Architecture

is a block diagram of an example computing devicefor processing one or more sparse neural networks, according to an embodiment. A computing devicemay be a server computer, a personal computer, a portable electronic device, a wearable electronic device (e.g., a smartwatch), an IoT device (e.g., a sensor), smart/connected appliance (e.g., a refrigerator), dongle, a device in edge computing, a device with limited processing power, etc. The computing devicemay include, among other components, a central processing unit (CPU), an artificial intelligence accelerator (AI accelerator), a graphical processing unit (GPU), system memory, a storage unit, an input interface, an output interface, a network interface, and a busconnecting these components. In various embodiments, computing devicemay include additional, fewer or different components.

While some of the components in this disclosure may at times be described in a singular form while other components may be described in a plural form, various components described in any system may include one or more copies of the components. For example, a computing devicemay include more than one processor such as CPU, AI accelerator, and GPU, but the disclosure may refer the processors to as “a processor” or “the processor.” Also, a processor may include multiple cores.

CPUmay be a general-purpose processor using any appropriate architecture. CPUretrieves and executes computer code that includes instructions, when executed, that may cause CPUor another processor, individually or in combination, to perform certain actions or processes that are described in this disclosure. Instructions can be any directions, commands, or orders that may be stored in different forms, such as equipment-readable instructions, programming instructions including source code, and other communication signals and orders. Instructions may be used in a general sense and are not limited to machine-readable codes. CPUmay be used to compile the instructions and also determine which processors may be used to performed certain tasks based on the commands in the instructions. For example, certain machine learning computations may be more efficient to be processed using AI acceleratorwhile other parallel computations may be better to be processed using GPU.

AI acceleratormay be a processor that is efficient at performing certain machine learning operations such as tensor multiplications, convolutions, tensor dot products, etc. In various embodiments, AI acceleratormay have different hardware architectures. For example, in one embodiment, AI acceleratormay take the form of field-programmable gate arrays (FPGAs). In another embodiment, AI acceleratormay take the form of application-specific integrated circuits (ASICs), which may include circuits along or circuits in combination with firmware.

GPUmay be a processor that includes highly parallel structures that are more efficient than CPUat processing large blocks of data in parallel. GPUmay be used to process graphical data and accelerate certain graphical operations. In some cases, owing to its parallel nature, GPUmay also be used to process a large number of machine learning operations in parallel. GPUis often efficient at performing the same type of workload many times in rapid succession.

While, in, the processors CPU, AI accelerator, and GPUare illustrated as separated components, in various embodiments the structure of one processor may be embedded in another processor. For example, one or more examples of the circuitry of AI acceleratordisclosed in different figures of this disclosure may be embedded in a CPU. The processors may also be included in a single chip such as in a system-on-a-chip (SoC) implementation. In various embodiments, computing devicemay also include additional processors for various specific purposes. In this disclosure, the various processors may be collectively referred to as “processors” or “a processor.”

System memoryincludes circuitry for storing instructions for execution by a processor and for storing data processed by the processor. System memorymay take the form of any type of memory structure including, for example, dynamic random access memory (DRAM), synchronous DRAM (SDRAM), double data rate (DDR, DDR2, DDR3, etc.) RAMBUS DRAM (RDRAM), static RAM (SRAM) or a combination thereof. System memoryusually takes the form of volatile memory.

Storage unitmay be a persistent storage for storing data and software applications in a non-volatile manner. Storage unitmay take the form of read-only memory (ROM), hard drive, flash memory, or another type of non-volatile memory device. Storage unitstores the operating system of the computing device, various software applicationsand machine learning models. Storage unitmay store computer code that includes instructions that, when executed, cause a processor to perform one or more processes described in this disclosure.

Applicationsmay be any suitable software applications that operate at the computing device. An applicationmay be in communication with other devices via network interface. Applicationsmay be of different types. In one case, an applicationmay be a web application, such as an application that runs on JavaScript. In another case, an applicationmay be a mobile application. For example, the mobile application may run on Swift for iOS and other APPLE operating systems or on Java or another suitable language for ANDROID systems. In yet another case, an applicationmay be a software program that operates on a desktop operating system such as LINUX, MICROSOFT WINDOWS, MAC OS, or CHROME OS. In yet another case, an applicationmay be a built-in application in an IoT device. An applicationmay include a graphical user interface (GUI) that visually renders data and information. An applicationmay include tools for training machine leaning modelsand/or perform inference using the trained machine learning models.

Machine learning modelsmay include different types of algorithms for making inferences based on the training of the models. Examples of machine learning modelsinclude regression models, random forest models, support vector machines (SVMs) such as kernel SVMs, and artificial neural networks (ANNs) such as convolutional network networks (CNNs), recurrent network networks (RNNs), autoencoders, long short term memory (LSTM), reinforcement learning (RL) models. Some of the machine learning models may include a sparse network structure whose detail will be further discussed with reference to. A machine learning modelmay be an independent model that is run by a processor. A machine learning modelmay also be part of a software application. Machine learning modelsmay perform various tasks.

By way of example, a machine learning modelmay receive sensed inputs representing images, videos, audio signals, sensor signals, data related to network traffic, financial transaction data, communication signals (e.g., emails, text messages and instant messages), documents, insurance records, biometric information, parameters for manufacturing process (e.g., semiconductor fabrication parameters), inventory patterns, energy or power usage patterns, data representing genes, results of scientific experiments or parameters associated with the operation of a machine (e.g., vehicle operation) and medical treatment data. The machine learning modelmay process such inputs and produce an output representing, among others, identification of objects shown in an image, identification of recognized gestures, classification of digital images as pornographic or non-pornographic, identification of email messages as unsolicited bulk email (‘spam’) or legitimate email (‘non-spam’), prediction of a trend in financial market, prediction of failures in a large-scale power system, identification of a speaker in an audio recording, classification of loan applicants as good or bad credit risks, identification of network traffic as malicious or benign, identity of a person appearing in the image, processed natural language processing, weather forecast results, patterns of a person's behavior, control signals for machines (e.g., automatic vehicle navigation), gene expression and protein interactions, analytic information on access to resources on a network, parameters for optimizing a manufacturing process, predicted inventory, predicted energy usage in a building or facility, web analytics (e.g., predicting which link or advertisement that users are likely to click), identification of anomalous patterns in insurance records, prediction on results of experiments, indication of illness that a person is likely to experience, selection of contents that may be of interest to a user, indication on prediction of a person's behavior (e.g., ticket purchase, no-show behavior), prediction on election, prediction/detection of adverse events, a string of texts in the image, indication representing topic in text, and a summary of text or prediction on reaction to medical treatments. The underlying representation (e.g., photo, audio and etc.) can be stored in system memoryand/or storage unit.

Input interfacereceives data from external sources such as sensor data or action information. Output interfaceis a component for providing the result of computations in various forms (e.g., image or audio signals). Computing devicemay include various types of input or output interfaces, such as displays, keyboards, cameras, microphones, speakers, antennas, fingerprint sensors, touch sensors, and other measurement sensors. Some input interfacemay directly work with a machine learning modelto perform various functions. For example, a sensor may use a machine learning modelto infer interpretations of measurements. Output interfacemay be in communication with humans, robotic agents or other computing devices.

The network interfaceenables the computing deviceto communicate with other computing devices via a network. The networks may include, but are not limited to, Local Area Networks (LANs) (e.g., an Ethernet or corporate network) and Wide Area Networks (WANs). When multiple nodes or components of a single node of a machine learning modelis embodied in multiple computing devices, information associated with various processes in the machine learning model, such as temporal sequencing, spatial pooling and management of nodes may be communicated between computing devices via the network interface.

Example Neural Network Architecture

is a conceptual diagram illustrating an example architecture of a neural network, according to an embodiment. The illustrated neural networkshows a generic structure of a neural network. Neural networkmay represent different types of neural networks, including convolutional network networks (CNNs), recurrent network networks (RNNs), autoencoders, and long short term memory (LSTM). In various embodiments, customized changes may be made to this general structure. Neural networkmay also be a hierarchical temporal memory system as described, for example, in U.S. Patent Application Publication No. 2020/0097857, published on May 26, 2020, which is incorporated hereto by reference in its entirety.

Neural networkincludes an input layer, an output layerand one or more hidden layers. Input layeris the first layer of neural network. Input layerreceives input data, such as image data, speech data, text, etc. Output layeris the last layer of neural network. Output layermay generate one or more inferences in the form of classifications or probabilities. Neural networkmay include any number of hidden layers. Hidden layerare intermediate layers in neural networkthat perform various operations. Neural networkmay include additional or fewer layers than the example shown in. Each layer may include one or more nodes. The number of nodes in each layer in the neural networkshown inis an example only. A nodemay be associated with certain weights and activation functions. In various embodiments, the nodesin neural networkmay be fully connected or partially connected.

Each nodein neural networkmay be associated with different operations. For example, in a simple form, neural networkmay be a vanilla neural network whose nodes are each associated with a set of linear weight coefficients and an activation function. In another embodiment, neural networkmay be an example convolutional neural network (CNN). In this example CNN, nodesin one layer may be associated with convolution operations with kernels as weights that are adjustable in the training process. Nodesin another layer may be associated with spatial pooling operations. In yet another embodiment, neural networkmay be a recurrent neural network (RNN) whose nodes may be associated with more complicated structures such as loops and gates. In a neural network, each node may represent a different structure and have different weight values and a different activation function.

is a block diagram illustrating an example general operation of a nodein neural network, according to an embodiment. A nodemay receive an input activation tensor, which can be an N-dimensional tensor, where N can be greater than or equal to one. Input activation tensormay be the input data of neural networkif nodeis in the input layer. Input activation tensormay also be the output of another node in the preceding layer. Nodemay apply a weight tensorto input activation tensorin a linear operation, such as addition, scaling, biasing, tensor multiplication, and convolution in the case of a CNN. The result of linear operationmay be processed by a non-linear activationsuch as a step function, a sigmoid function, a hyperbolic tangent function (tan h), and rectified linear unit functions (ReLU). The result of the activation is an output activation tensorthat is sent to a subsequent connected node that is in the next layer of neural network. The subsequent node uses output activation tensoras the input activation tensor.

In various embodiments, a wide variety of machine learning techniques may be used in training neural network. Neural networkmay be associated with an objective function (also commonly referred to as a loss function), which generates a metric value that describes the objective goal of the training process. The training may intend to reduce the error rate of the model in generating predictions. In such a case, the objective function may monitor the error rate of neural network. For example, in object recognition (e.g., object detection and classification), the objective function of neural networkmay be the training error rate in classifying objects in a training set. Other forms of objective functions may also be used. In various embodiments, the error rate may be measured as cross-entropy loss, L1 loss (e.g., the sum of absolute differences between the predicted values and the actual value), L2 loss (e.g., the sum of squared distances) or their combinations.

Patent Metadata

Filing Date

Unknown

Publication Date

March 24, 2026

Inventors

Unknown

Want to explore more patents?

Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.

Browse All Patents Try Prior Art Search