10678507

Programmable Multiply-Add Array Hardware

PublishedJune 9, 2020
Assigneenot available in USPTO data we have
Technical Abstract

Patent Claims
14 claims

Legal claims defining the scope of protection. Each claim is shown in both the original legal language and a plain English translation.

Claim 1

Original Legal Text

1. A method for specifying functionalities to be performed on a data architecture including N adders and N multipliers configured to receive operands, the method comprising: receiving instructions for the data architecture to operate in one of a multiply-reduce mode or a multiply-accumulate mode, wherein the N multipliers and at least some of the N adders of the data architecture are used both in the multiply-reduce mode and the multiply-accumulate mode; and selecting, based on the instructions, a data flow between the N multipliers and the at least some of the N adders of the data architecture, wherein the N multipliers includes a first multiplier of which output data is provided to a first adder among the at least some of the N adders in the multiply-reduce mode and to a second adder among the at least some of the N adders in the multiply-accumulate mode.

Plain English Translation

This invention relates to a data architecture system designed for efficient computation using N multipliers and N adders. The system addresses the challenge of optimizing hardware resources for different computational modes, specifically multiply-reduce and multiply-accumulate operations, which are commonly used in digital signal processing and machine learning applications. The architecture reuses the same multipliers and adders for both modes, reducing hardware complexity and cost. The method involves receiving instructions to configure the data architecture in either multiply-reduce or multiply-accumulate mode. In multiply-reduce mode, the output of a first multiplier is directed to a first adder, which performs a reduction operation (e.g., summing partial results). In multiply-accumulate mode, the same multiplier's output is instead routed to a second adder, which accumulates results over multiple cycles. The system dynamically selects the data flow between multipliers and adders based on the selected mode, ensuring efficient resource utilization. By reusing hardware components across different computational modes, the invention minimizes redundant circuitry while maintaining flexibility for various mathematical operations. This approach is particularly useful in applications requiring both reduction and accumulation operations, such as matrix multiplications in neural networks or signal processing pipelines. The system ensures high performance with reduced hardware overhead.

Claim 2

Original Legal Text

2. The method of claim 1 , wherein selecting the data flow includes, in response to receiving instructions corresponding to the multiply-reduce mode, selecting a first data flow using the N multipliers and N−1 adders, wherein one of the N adders is not used.

Plain English Translation

This invention relates to a method for optimizing data processing in a computing system, specifically for performing multiply-reduce operations efficiently. The problem addressed is the inefficient use of hardware resources in computing systems when executing multiply-reduce operations, where some components remain underutilized. The method involves selecting a data flow for processing based on a specified mode, such as a multiply-reduce mode. In this mode, the system uses N multipliers and N−1 adders, where one of the N adders is intentionally left unused. The multipliers and adders are arranged in a specific configuration to perform the multiply-reduce operation, which involves multiplying input data elements and then summing the results. By selectively utilizing N−1 adders instead of all N, the method ensures that the hardware resources are used optimally, reducing unnecessary computations and improving overall efficiency. The method is particularly useful in systems where multiply-reduce operations are frequently performed, such as in digital signal processing, machine learning, or other computational tasks requiring matrix operations. The selective use of adders minimizes power consumption and computational overhead while maintaining accurate results. The invention provides a way to dynamically configure the hardware to adapt to different operational modes, ensuring efficient resource utilization.

Claim 3

Original Legal Text

3. The method of claim 2 , wherein the first data flow comprises the N−1 adders receiving input resulting from the N multipliers.

Plain English Translation

This invention relates to digital signal processing, specifically to an optimized architecture for performing matrix multiplication or other linear algebra operations in hardware. The problem addressed is the computational inefficiency and latency in traditional matrix multiplication implementations, particularly in high-performance applications like machine learning or real-time signal processing. The invention describes a hardware circuit for matrix multiplication that includes N multipliers and N−1 adders arranged in a specific configuration. The multipliers receive input data and generate partial products, which are then fed into the adders. The first data flow involves the N−1 adders receiving input directly from the N multipliers, allowing for parallel accumulation of partial products. This configuration reduces the number of required adders compared to traditional implementations while maintaining computational throughput. The circuit may also include additional components such as registers or memory buffers to store intermediate results or manage data flow. The multipliers and adders are interconnected in a way that minimizes latency and maximizes parallelism, improving overall processing efficiency. This architecture is particularly useful in applications requiring fast matrix operations, such as neural network inference or digital signal filtering. The design balances hardware complexity with performance, making it suitable for integration into custom ASICs or FPGAs.

Claim 4

Original Legal Text

4. The method of claim 1 , wherein selecting the data flow includes, in response to receiving instructions corresponding to the multiply-accumulate mode, selecting a second data flow using the N multipliers and the N adders.

Plain English Translation

The invention relates to a digital signal processing system that performs multiply-accumulate operations efficiently. The problem addressed is the need for flexible and high-throughput data processing in systems requiring simultaneous multiplication and accumulation of multiple data streams. Traditional approaches often lack the ability to dynamically switch between different data flow configurations, limiting performance in applications like digital filtering, matrix operations, or machine learning. The system includes a processing unit with N multipliers and N adders, where N is an integer greater than one. The multipliers and adders are configurable to process data in different modes, including a multiply-accumulate mode. In this mode, the system selects a second data flow that utilizes all N multipliers and N adders in parallel. The second data flow involves multiplying pairs of input data elements and accumulating the results, enabling efficient computation of weighted sums. The system dynamically adjusts the data flow based on received instructions, allowing it to adapt to varying computational demands. This flexibility improves throughput and reduces latency in applications requiring real-time processing. The invention enhances computational efficiency by optimizing resource utilization in digital signal processing tasks.

Claim 5

Original Legal Text

5. The method of claim 4 , wherein the second data flow comprises each adder of the N adders receiving an input operand from a corresponding multiplier of the N multipliers.

Plain English Translation

This invention relates to digital signal processing (DSP) architectures, specifically optimizing data flow in parallel processing systems. The problem addressed is inefficient data routing in high-performance computing, where delays in operand distribution between arithmetic units degrade performance. The solution involves a structured data flow system where multiple adders and multipliers are interconnected to minimize latency and maximize throughput. The system includes N multipliers and N adders arranged in parallel. Each adder receives an input operand directly from a corresponding multiplier, ensuring synchronized data transfer without intermediate buffering. This direct connection reduces routing complexity and improves processing speed. The multipliers generate partial products, which are then fed into the adders for summation, enabling efficient computation in applications like matrix operations, convolution, or digital filtering. The architecture is scalable, allowing for increased N to handle larger datasets or more complex computations while maintaining low-latency data flow. The design is particularly useful in real-time signal processing, where minimizing delay is critical.

Claim 6

Original Legal Text

6. An integrated circuit comprising: a data architecture including N adders and N multipliers configured to receive operands, wherein the data architecture receives instructions for selecting a data flow between the N multipliers and at least some of the N adders of the data architecture, the selected data flow including the options: a first data flow using the N multipliers and the N adders to provide a multiply-accumulate mode; and a second data flow to provide a multiply-reduce mode, wherein the N multipliers and the at least some of the N adders are used both in the first data flow and the second data flow, and wherein the N multipliers includes a first multiplier of which output data is provided to a first adder among the at least some of the N adders in the first data flow and to a second adder among the at least some of the N adders in the second data flow.

Plain English Translation

The invention relates to an integrated circuit designed for efficient data processing, particularly in applications requiring flexible data flow configurations. The circuit addresses the need for hardware that can dynamically switch between different computational modes without requiring redundant components, thereby optimizing resource utilization and performance. The integrated circuit includes a data architecture featuring N adders and N multipliers, which receive operands for processing. The architecture is configured to execute instructions that select between two distinct data flow modes: a multiply-accumulate mode and a multiply-reduce mode. In the multiply-accumulate mode, the N multipliers and N adders work together to perform accumulation operations, where the output of each multiplier is fed into a corresponding adder. In the multiply-reduce mode, the same multipliers and a subset of the adders are used to perform reduction operations, such as summing or other reduction functions, on the outputs of the multipliers. The design ensures that the multipliers and adders are reused across both modes, with a specific multiplier's output being directed to different adders depending on the selected mode. This reconfigurable approach minimizes hardware redundancy while supporting versatile computational tasks.

Claim 7

Original Legal Text

7. The integrated circuit of claim 6 , wherein the first data flow uses each adder of the N adders to receive an input operand from a corresponding multiplier of the N multipliers.

Plain English Translation

The invention relates to an integrated circuit designed for efficient data processing, particularly in systems requiring parallel arithmetic operations. The circuit addresses the challenge of optimizing computational throughput and resource utilization in digital signal processing (DSP) or other high-performance computing applications. The integrated circuit includes a plurality of multipliers and adders arranged to process data flows in parallel, enhancing processing speed and efficiency. The circuit comprises N multipliers and N adders, where N is an integer greater than one. Each multiplier generates a product output, and each adder receives an input operand from a corresponding multiplier. The first data flow involves routing the output of each multiplier directly to an associated adder, enabling simultaneous arithmetic operations. This configuration minimizes latency and maximizes throughput by leveraging parallel processing. The circuit may also include additional data flows or control logic to manage data routing, ensuring flexibility in handling different computational tasks. The design is particularly useful in applications such as digital filters, matrix operations, or other computations requiring high-speed arithmetic processing. The parallel structure reduces the need for sequential operations, improving overall system performance.

Claim 8

Original Legal Text

8. The integrated circuit of claim 6 , wherein the second data flow uses the N multipliers and N−1 adders, wherein one of the N adders is not used.

Plain English Translation

This invention relates to integrated circuits designed for efficient data processing, particularly in systems requiring parallel computation. The problem addressed is optimizing hardware resources in integrated circuits to minimize power consumption and area usage while maintaining computational efficiency. The invention describes an integrated circuit with a configurable data processing architecture that includes a set of multipliers and adders arranged to handle multiple data flows simultaneously. The circuit is designed to process a first data flow using all available multipliers and adders, while a second data flow utilizes the same multipliers but only a subset of the adders, leaving one adder unused. This configuration allows the circuit to dynamically allocate resources between different data processing tasks, improving flexibility and efficiency. The unused adder in the second data flow reduces power consumption by avoiding unnecessary operations, while the shared multipliers ensure that computational throughput is maintained. The invention is particularly useful in applications such as digital signal processing, machine learning accelerators, and other high-performance computing tasks where resource optimization is critical. The design ensures that the circuit can handle varying workloads without requiring redundant hardware, thus optimizing both performance and energy efficiency.

Claim 9

Original Legal Text

9. The integrated circuit of claim 8 , wherein the second data flow uses the N−1 adders to receive input resulting from the N multipliers.

Plain English Translation

The invention relates to integrated circuits designed for high-performance data processing, particularly in applications requiring efficient computation of large-scale matrix operations. A key challenge in such systems is optimizing the hardware architecture to minimize latency and power consumption while maintaining computational accuracy. The invention addresses this by implementing a specialized integrated circuit with a hierarchical adder structure that enhances parallel processing capabilities. The integrated circuit includes a plurality of multipliers and a network of adders arranged in a multi-stage configuration. The multipliers generate intermediate results from input data, which are then processed by the adders in a cascaded manner. A first data flow routes the multiplier outputs to a subset of the adders, while a second data flow utilizes the remaining adders to further process the results. Specifically, the second data flow employs N−1 adders to receive and combine input derived from N multipliers, enabling efficient summation of partial products. This architecture reduces the number of required adders while maintaining computational throughput, thereby improving resource utilization and energy efficiency. The design is particularly suited for applications such as digital signal processing, machine learning accelerators, and high-performance computing systems.

Claim 10

Original Legal Text

10. A non-transitory computer-readable storage medium that stores a set of instructions that is executable by at least one processor of a device to cause the device to perform a method for specifying functionalities to be performed on a data architecture including N adders and N multipliers configured to receive operands, the method comprising: receiving instructions for the data architecture to operate in one of a multiply-reduce mode or a multiply-accumulate mode, wherein the N multipliers and at least some of the N adders of the data architecture are used both in the multiply-reduce mode and the multiply-accumulate mode; and selecting, based on the instructions, a data flow between the N multipliers and the at least some of the N adders of the data architecture, wherein the N multipliers includes a first multiplier of which output data is provided to a first adder among the at least some of the N adders in the multiply-reduce mode and to a second adder among the at least some of the N adders in the multiply-accumulate mode.

Plain English Translation

The invention relates to a configurable data architecture for performing multiply-reduce and multiply-accumulate operations using a shared set of N multipliers and N adders. The architecture dynamically reconfigures the data flow between these components based on the selected mode of operation. In multiply-reduce mode, the output of a first multiplier is directed to a first adder, while in multiply-accumulate mode, the same multiplier's output is routed to a second adder. The system receives instructions specifying the desired mode and adjusts the data flow accordingly, ensuring that the same hardware components are reused efficiently in both modes. This approach optimizes resource utilization by avoiding the need for separate dedicated hardware for each operation type, reducing circuit complexity and power consumption. The invention is implemented via executable instructions stored on a non-transitory computer-readable medium, which configure a processor to control the data architecture's operation. The solution addresses the challenge of efficiently performing different arithmetic operations in hardware with minimal redundancy, improving performance and energy efficiency in computing systems.

Claim 11

Original Legal Text

11. The non-transitory computer-readable storage medium of claim 10 , wherein selecting the data flow includes, in response to receiving instructions corresponding to the multiply-reduce mode, selecting a first data flow using the N multipliers and N−1 adders, wherein one of the N adders is not used.

Plain English Translation

This invention relates to a non-transitory computer-readable storage medium containing instructions for performing a multiply-reduce operation in a hardware accelerator. The problem addressed is the efficient implementation of multiply-reduce operations, which are common in machine learning and signal processing, using a fixed hardware configuration. The solution involves a configurable hardware accelerator with N multipliers and N−1 adders that can be dynamically reconfigured to optimize performance for different modes, including a multiply-reduce mode. In the multiply-reduce mode, the hardware accelerator selects a data flow that utilizes all N multipliers but intentionally leaves one of the N−1 adders unused. This configuration ensures that the remaining adders can efficiently combine the multiplied results in a tree-like reduction structure, minimizing latency and maximizing throughput. The unused adder allows for flexibility in handling different reduction patterns, such as partial reductions or early termination, without requiring additional hardware resources. The instructions stored in the medium control the selection of this data flow, enabling dynamic reconfiguration based on the operational mode. This approach improves efficiency by avoiding redundant computations and optimizing resource utilization for the specific task at hand.

Claim 12

Original Legal Text

12. The non-transitory computer-readable storage medium of claim 11 , wherein the first data flow comprises the N−1 adders receiving input resulting from the N multipliers.

Plain English Translation

The invention relates to a digital signal processing system, specifically a hardware-accelerated implementation for performing polynomial multiplication and addition operations. The system addresses the computational inefficiency in evaluating polynomials, particularly in applications like cryptography, error correction, and signal processing, where polynomial operations are frequently required. Traditional software-based approaches are slow, while existing hardware accelerators often lack flexibility or require excessive resources. The system includes a plurality of multipliers and adders arranged in a pipelined architecture to efficiently compute polynomial products and sums. The multipliers receive input coefficients and generate partial products, which are then processed by the adders in a hierarchical structure. Specifically, the first data flow involves N multipliers generating outputs that are fed into N−1 adders, where N is the degree of the polynomial. This arrangement allows for parallel computation, reducing latency and improving throughput compared to sequential processing. The adders may be configured to perform modular arithmetic, supporting operations in finite fields commonly used in cryptographic applications. The system can be implemented in an integrated circuit, such as an FPGA or ASIC, and may include additional logic for controlling data flow and managing pipeline stages. The invention optimizes polynomial arithmetic by leveraging parallelism and pipelining, making it suitable for high-performance computing tasks.

Claim 13

Original Legal Text

13. The non-transitory computer-readable storage medium of claim 10 , wherein selecting the data flow includes, in response to receiving instructions corresponding to the multiply-accumulate mode, selecting a second data flow using the N multipliers and the N adders.

Plain English Translation

The invention relates to a computer-readable storage medium containing instructions for processing data flows in a multiply-accumulate mode. The system includes a processing unit with N multipliers and N adders, where N is an integer greater than or equal to 2. The instructions enable the selection of a data flow based on received instructions. Specifically, when the system receives instructions corresponding to a multiply-accumulate mode, it selects a second data flow that utilizes all N multipliers and N adders. This mode involves multiplying pairs of input data elements and accumulating the results, which is commonly used in digital signal processing, matrix operations, and other computational tasks requiring efficient parallel arithmetic. The system dynamically adjusts the data flow to optimize performance based on the selected mode, ensuring efficient use of hardware resources. The invention addresses the need for flexible and efficient data processing in systems with parallel arithmetic units, particularly in applications requiring high-throughput computations.

Claim 14

Original Legal Text

14. The non-transitory computer-readable storage medium of claim 13 , wherein the second data flow comprises each adder of the N adders receiving an input operand from a corresponding multiplier of the N multipliers.

Plain English Translation

The invention relates to digital signal processing systems, specifically to optimized data flow architectures for efficient computation in hardware accelerators. The problem addressed is the inefficiency in data routing and processing in conventional systems, leading to bottlenecks and increased latency in high-performance computing tasks. The invention describes a non-transitory computer-readable storage medium storing instructions that, when executed, configure a hardware accelerator to process data flows in a pipelined manner. The system includes N multipliers and N adders arranged in a specific configuration to minimize data transfer delays. The second data flow involves each adder receiving an input operand directly from a corresponding multiplier, ensuring synchronized and efficient data propagation. This direct connection reduces intermediate storage requirements and improves throughput by eliminating unnecessary data movement. The architecture is designed to handle large-scale computations, such as matrix operations or convolutional neural networks, with reduced latency and power consumption. The system may also include additional components like registers or control logic to manage data flow and synchronization. The overall design optimizes hardware resource utilization while maintaining computational accuracy.

Patent Metadata

Filing Date

Unknown

Publication Date

June 9, 2020

Inventors

Liang HAN
Xiaowei JIANG

Want to explore more patents?

Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.

Citation & reuse

Analysis on this page is generated by Patentable — an AI-powered patent intelligence platform. AI-generated summaries, explanations, FAQs, and analysis may be reused with attribution and a visible link back to the canonical URL below. Patent abstracts and claims are USPTO public domain.

Cite as: Patentable. “PROGRAMMABLE MULTIPLY-ADD ARRAY HARDWARE” (10678507). https://patentable.app/patents/10678507

© 2026 Nomic Interactive Technology LLC. Machine-readable context available at /api/llm-context/10678507. See llms.txt for full attribution policy.