Embodiments of the present disclosure include techniques storing and retrieving data. In one embodiment, sub-matrices of data are stored as row slices and column slices. A fetch circuit determines if particular slices of one sub-matrix, when combined with corresponding slices of another sub-matrix, produce a zero result and need not be retrieved. In another embodiment, the present disclosure includes a memory circuit comprising memory banks and sub-banks. The sub-banks store slices of sub-matrices. A request moves between serially configured memory banks and slices in different sub-banks may be retrieved at the same time.
Legal claims defining the scope of protection. Each claim is shown in both the original legal language and a plain English translation.
2. The circuit of claim 1, wherein the fetch circuit determines row slices of the first sub-matrix of data that produce a non-zero result when multiplied by a plurality of corresponding column slices of the second sub-matrix of data while row slices of a third sub-matrix of data that produce a non-zero result when multiplied by a plurality of corresponding column slices of the fourth sub-matrix of data are being retrieved.
This invention relates to a circuit for performing matrix multiplication operations, specifically optimizing the process by selectively fetching only the relevant row and column slices of sub-matrices that contribute to non-zero results. The circuit includes a fetch circuit that identifies and retrieves row slices from a first sub-matrix of data that, when multiplied by corresponding column slices of a second sub-matrix, produce non-zero results. Simultaneously, the circuit retrieves row slices from a third sub-matrix that, when multiplied by corresponding column slices of a fourth sub-matrix, also yield non-zero results. This parallel fetching approach reduces computational overhead by avoiding unnecessary data retrieval and multiplication operations, improving efficiency in matrix multiplication tasks. The circuit is particularly useful in applications requiring real-time processing, such as machine learning and signal processing, where minimizing latency and power consumption is critical. The selective fetching mechanism ensures that only the essential data slices are processed, optimizing both memory bandwidth and computational resources.
3. The circuit of claim 2, wherein the first sub-matrix of data and the third sub-matrix of data are from a first matrix of data, and wherein the second sub-matrix of data and the fourth sub-matrix of data are from a second matrix of data.
This invention relates to a circuit for processing data matrices, specifically for operations involving sub-matrices derived from multiple source matrices. The problem addressed is the efficient handling of data sub-matrices in computational tasks, particularly where sub-matrices are extracted from different source matrices and need to be processed in a coordinated manner. The circuit includes a processing unit configured to receive and manipulate sub-matrices of data. The first and third sub-matrices are sourced from a first matrix of data, while the second and fourth sub-matrices are sourced from a second matrix of data. The circuit is designed to perform operations on these sub-matrices, such as matrix multiplication, addition, or other linear algebra operations, where the sub-matrices are combined or processed in a specific sequence or configuration. This approach optimizes computational efficiency by leveraging parallel processing or pipelined operations, reducing the overhead associated with accessing and manipulating data from multiple matrices. The circuit may include memory units or buffers to store the sub-matrices temporarily, ensuring fast access during processing. The processing unit may also include specialized hardware, such as arithmetic logic units (ALUs) or digital signal processors (DSPs), to accelerate the operations on the sub-matrices. The invention is particularly useful in applications requiring high-performance matrix computations, such as machine learning, signal processing, or scientific computing, where efficient handling of sub-matrices from different matrices is critical.
4. The circuit of claim 1, wherein the at least one memory circuit stores a first mask corresponding to the first sub-matrix, wherein the first mask specifies row slices having at least one non-zero value.
The invention relates to a memory circuit system designed to efficiently store and process sparse matrices, which are matrices containing a high proportion of zero values. The primary problem addressed is the inefficient use of memory and computational resources when storing and operating on sparse matrices in conventional systems, where zero values are stored and processed unnecessarily. The system includes a memory circuit that stores a sparse matrix divided into multiple sub-matrices, each containing a subset of the matrix data. The memory circuit stores a mask for each sub-matrix, where the mask identifies which row slices within the sub-matrix contain at least one non-zero value. This allows the system to skip processing rows that are entirely zero, reducing memory access and computational overhead. The mask is used to selectively retrieve only the relevant row slices, improving efficiency in operations such as matrix multiplication or other linear algebra computations. The system may also include a processing circuit that uses the stored masks to determine which row slices to process, further optimizing performance by avoiding unnecessary computations on zero-valued rows. The memory circuit may be configured to dynamically update the masks as the matrix data changes, ensuring the system remains efficient even when the sparsity pattern of the matrix evolves over time. This approach is particularly useful in applications like machine learning, signal processing, and scientific computing, where sparse matrices are common and efficient storage and processing are critical.
5. The circuit of claim 4, wherein the fetch circuit eliminates row slices having all zero values from being retrieved based on the first mask.
A system for optimizing memory access in a computing device addresses inefficiencies in data retrieval from memory arrays. The system includes a fetch circuit that retrieves data from a memory array organized into rows and columns, where each row is divided into multiple slices. The memory array stores data in a compressed format, where some slices may contain only zero values. The system also includes a mask generator that produces a first mask indicating which slices in a row contain non-zero values. The fetch circuit uses this mask to selectively retrieve only the slices with non-zero values, avoiding unnecessary access to slices with all zeros. This reduces power consumption and improves access speed by minimizing the amount of data transferred. The system may also include a decompression circuit that reconstructs the original data from the retrieved non-zero slices. The mask generator may further produce a second mask to identify specific bits within a slice that are non-zero, allowing for even finer-grained data retrieval. The overall system enhances memory efficiency by dynamically filtering out irrelevant data during the fetch operation.
6. The circuit of claim 1, wherein the fetch circuit analyzes the first sub-matrix in said at least one memory to determine the row slices that produce non-zero results.
A system for efficient data processing in memory-centric computing architectures addresses the challenge of reducing energy consumption and latency in matrix operations, particularly in sparse matrix computations. The system includes a memory array organized into multiple sub-matrices, each containing data elements arranged in rows and columns. A fetch circuit is configured to analyze a first sub-matrix to identify row slices that produce non-zero results when processed. The fetch circuit selectively retrieves only these relevant row slices, avoiding unnecessary data transfers and computations. This selective fetching reduces power consumption and improves processing efficiency by minimizing access to irrelevant data. The system may also include a processing circuit that operates on the fetched row slices to perform computations, such as matrix multiplications or other linear algebra operations. The overall architecture optimizes memory bandwidth usage and computational resources by focusing on active data elements, making it suitable for applications in machine learning, signal processing, and scientific computing where sparse matrices are common. The selective row slice analysis and retrieval mechanism enhances performance in systems where memory access is a bottleneck.
7. The circuit of claim 6, wherein the fetch circuit determines, for a plurality of row slices, whether a particular row slice produces a zero or non-zero result when multiplied by a plurality of corresponding column slices.
The invention relates to a circuit for efficient matrix multiplication, specifically addressing the challenge of reducing computational overhead in sparse matrix operations. The circuit includes a fetch circuit that evaluates multiple row slices of a matrix to determine whether multiplying each row slice by corresponding column slices yields a zero or non-zero result. This pre-evaluation step allows the circuit to skip unnecessary multiplications when a row slice is known to produce a zero result, thereby optimizing performance by avoiding redundant calculations. The fetch circuit operates by analyzing the row slices in parallel, enabling rapid identification of non-zero contributions. The circuit further includes a multiplier array that performs the actual multiplications only for the row slices that produce non-zero results, reducing power consumption and processing time. The invention is particularly useful in applications involving large-scale sparse matrices, such as machine learning, signal processing, and scientific computing, where efficiency is critical. By selectively processing only relevant row slices, the circuit minimizes computational waste while maintaining accuracy.
8. The circuit of claim 6, wherein the fetch circuit receives a bit mask comprising 1 bit per row slice of the first sub-matrix, wherein particular row slices of the first sub-matrix having a first bit mask value indicating the particular row slices comprise all zeros are eliminated from the determined row slices to be retrieved.
This invention relates to a circuit for efficiently retrieving data from a memory matrix, specifically addressing the problem of reducing unnecessary data access in memory operations. The circuit includes a fetch circuit that retrieves data from a first sub-matrix of a memory matrix, where the sub-matrix is divided into multiple row slices. Each row slice contains data elements that may or may not be relevant for a given operation. To optimize performance, the fetch circuit receives a bit mask, where each bit in the mask corresponds to a row slice in the sub-matrix. The bit mask identifies which row slices contain all zeros, allowing the circuit to skip these slices during data retrieval. By eliminating row slices with all-zero values, the circuit reduces the amount of data transferred and processed, improving efficiency in memory operations. This selective retrieval mechanism is particularly useful in applications where memory bandwidth and power consumption are critical, such as in high-performance computing or embedded systems. The circuit ensures that only relevant data is fetched, minimizing unnecessary access to memory and enhancing overall system performance.
9. The circuit of claim 8, further wherein, after eliminating row slices comprising all zeros, the fetch circuit determines first row slices of the first sub-matrix of data that produce a zero result when multiplied by a plurality of corresponding column slices of the second sub-matrix of data to eliminate the first row slices from the determined row slices to be retrieved.
This invention relates to optimizing matrix multiplication operations in computing systems, particularly for reducing computational overhead when processing sparse matrices. The problem addressed is the inefficiency in conventional matrix multiplication where unnecessary computations are performed on zero-valued elements, leading to wasted processing cycles and energy. The system includes a circuit configured to handle matrix multiplication by selectively fetching only relevant portions of the matrices to minimize redundant operations. The circuit first processes a first sub-matrix of data and a second sub-matrix of data, where the first sub-matrix is divided into row slices and the second sub-matrix is divided into column slices. The circuit eliminates row slices from the first sub-matrix that contain all zeros, as these will not contribute to the final result when multiplied by any column slices of the second sub-matrix. After this initial filtering, the circuit further refines the selection by identifying row slices in the first sub-matrix that, when multiplied by corresponding column slices in the second sub-matrix, produce a zero result. These row slices are also eliminated from the set of row slices to be retrieved and processed. This two-stage elimination process ensures that only the most relevant row slices are fetched and multiplied, significantly reducing computational overhead and improving efficiency in sparse matrix operations. The approach is particularly useful in applications such as machine learning, scientific computing, and signal processing where sparse matrices are common.
10. The circuit of claim 8, further wherein the fetch circuit logically ANDs values of remaining non-all zero row slices of the first sub-matrix with corresponding values of the column slices of the second sub-matrix to produce a plurality of results and logically ORs the plurality of results to eliminate a plurality of non-all zero row slices producing a zero result from the determined row slices to be retrieved.
This invention relates to a circuit for efficiently retrieving row slices from a matrix by eliminating non-relevant rows. The problem addressed is the computational overhead in matrix operations where certain rows do not contribute to the final result, requiring unnecessary processing. The circuit includes a fetch circuit that processes two sub-matrices, a first sub-matrix containing row slices and a second sub-matrix containing column slices. The fetch circuit first identifies all-zero row slices in the first sub-matrix and excludes them from further processing. For the remaining non-all-zero row slices, the circuit performs a logical AND operation between each row slice and the corresponding column slices from the second sub-matrix, generating multiple results. These results are then combined using a logical OR operation. This step eliminates any non-all-zero row slices that produce a zero result, reducing the number of row slices that need to be retrieved. The remaining row slices are then output as the final result. This approach optimizes matrix operations by minimizing unnecessary computations and improving efficiency in data retrieval.
11. The circuit of claim 1, wherein the first sub-matrix is stored in row major order and the second sub-matrix is stored in column major order.
This invention relates to memory storage and access optimization for matrix computations, particularly in systems where matrices are divided into sub-matrices for efficient processing. The problem addressed is the inefficiency in memory access patterns when processing large matrices, which can lead to increased latency and reduced computational performance due to suboptimal data locality. The invention involves a circuit configured to store and process matrices divided into at least two sub-matrices. The first sub-matrix is stored in row-major order, where elements are stored sequentially by rows, optimizing access for row-wise operations. The second sub-matrix is stored in column-major order, where elements are stored sequentially by columns, optimizing access for column-wise operations. This hybrid storage approach allows the circuit to efficiently handle both row and column operations without requiring costly data reorganization during computation. The circuit may include memory modules, processing units, and control logic to manage the storage and retrieval of sub-matrices in their respective orders. The invention improves performance in applications such as linear algebra, machine learning, and scientific computing by reducing memory access bottlenecks and enhancing data locality.
12. The circuit of claim 1, wherein the first sub-matrix is a portion of a first matrix stored in row major order and the second sub-matrix is a portion of a second matrix stored in column major order.
This invention relates to a circuit for efficiently performing matrix operations, particularly matrix multiplication, by leveraging specific memory storage formats to optimize data access patterns. The problem addressed is the inefficiency in traditional matrix multiplication circuits where data access patterns do not align with memory storage layouts, leading to increased latency and reduced computational throughput. The circuit includes a first sub-matrix and a second sub-matrix, where the first sub-matrix is a portion of a first matrix stored in row-major order and the second sub-matrix is a portion of a second matrix stored in column-major order. This arrangement allows the circuit to exploit the natural alignment of data in memory, reducing the need for costly data reordering or transposition operations. The circuit further includes processing elements configured to perform element-wise multiplication and accumulation operations between corresponding elements of the first and second sub-matrices, enabling efficient computation of the resulting matrix product. By storing the first matrix in row-major order and the second matrix in column-major order, the circuit ensures that the data required for each multiplication operation is accessed sequentially, minimizing memory access latency. The processing elements are arranged to handle these operations in parallel, further enhancing computational efficiency. This approach is particularly beneficial in applications requiring real-time or high-throughput matrix computations, such as machine learning, signal processing, and scientific computing.
13. The circuit of claim 1, wherein the fetch circuit generates at least one data structure specifying said row slices that produce a non-zero result when multiplied by corresponding column slices.
A system for optimizing matrix multiplication operations in computing hardware addresses the inefficiency of processing large matrices by identifying and skipping unnecessary computations. The system includes a fetch circuit that retrieves matrix data and a processing circuit that performs multiplication operations. The fetch circuit generates a data structure that specifies which row and column slices of the matrices will produce non-zero results when multiplied. This allows the processing circuit to skip multiplying row and column slices that would yield zero results, reducing computational overhead and improving efficiency. The data structure may include indices or identifiers for the relevant row and column slices, enabling the processing circuit to focus only on the necessary computations. By avoiding redundant operations, the system enhances performance in applications requiring frequent matrix multiplications, such as machine learning and scientific computing. The invention improves processing speed and energy efficiency by dynamically identifying and excluding zero-result computations.
14. The circuit of claim 13, wherein the fetch circuit generates a first data structure specifying addresses of a plurality of sub-matrices and a mask specifying the location of said row slices that produce a non-zero result when multiplied by corresponding column slices across within the plurality of sub-matrices.
This invention relates to a circuit for efficient sparse matrix multiplication, addressing the computational inefficiency of traditional dense matrix multiplication when applied to sparse matrices. The circuit includes a fetch circuit that generates a first data structure containing addresses of multiple sub-matrices and a mask indicating the positions of row slices that yield non-zero results when multiplied by corresponding column slices within those sub-matrices. This approach optimizes sparse matrix operations by selectively processing only the relevant sub-matrices and row/column slices, reducing unnecessary computations. The circuit further includes a multiplier array that performs the multiplication of the identified row and column slices, and an accumulator that sums the partial products to produce the final result. The fetch circuit may also generate a second data structure specifying the addresses of the column slices and a mask indicating the positions of the column slices that produce non-zero results when multiplied by the corresponding row slices. This dual-masking approach further enhances efficiency by ensuring only the most relevant computations are performed. The circuit is particularly useful in applications requiring high-performance sparse matrix operations, such as machine learning and scientific computing.
15. The circuit of claim 14, wherein the fetch circuit generates a second data structure storing retrieved row slices and a mask specifying the location of said row slices within the plurality of sub-matrices.
The invention relates to a memory access circuit designed to efficiently retrieve data from a memory array organized into multiple sub-matrices. The problem addressed is the inefficiency in accessing data stored in a distributed manner across these sub-matrices, particularly when only specific portions (row slices) of the data are needed. Traditional methods often retrieve entire rows or sub-matrices, leading to unnecessary data transfers and increased latency. The circuit includes a fetch circuit that retrieves only the required row slices from the sub-matrices. To manage this, the fetch circuit generates a second data structure that stores the retrieved row slices. Additionally, it produces a mask that specifies the exact locations of these row slices within the sub-matrices. This allows downstream components to accurately identify where the retrieved data belongs in the original memory structure. The mask ensures proper alignment and reconstruction of the data, enabling efficient processing without redundant data transfers. The circuit optimizes memory access by minimizing the amount of data moved and reducing latency, particularly in systems where only partial row data is needed. This approach is beneficial in applications requiring selective data retrieval, such as sparse matrix operations or partial data processing in memory-intensive systems.
16. The circuit of claim 1, wherein the at least one memory circuit is a static random access memory.
A circuit includes a memory circuit and a control circuit. The memory circuit stores data and is coupled to the control circuit, which manages data access operations. The control circuit includes a write circuit that writes data to the memory circuit and a read circuit that reads data from the memory circuit. The write circuit includes a write driver that generates a write signal to store data in the memory circuit, and the read circuit includes a sense amplifier that amplifies a read signal from the memory circuit. The memory circuit is a static random access memory (SRAM), which retains data as long as power is supplied, unlike dynamic RAM (DRAM) that requires periodic refreshing. SRAM is faster and more reliable for high-speed applications but consumes more power and is more expensive than DRAM. The control circuit may also include a timing circuit that synchronizes data access operations with a clock signal, ensuring proper timing for read and write operations. The circuit may be part of a larger system, such as a processor or memory module, where fast and reliable data storage is critical. The SRAM implementation ensures low-latency access, making it suitable for cache memory or other performance-sensitive applications.
17. The circuit of claim 1, wherein the determined row slices retrieved from the at least one memory circuit are loaded into a multiplier circuit.
This invention relates to memory and processing systems, specifically addressing the challenge of efficiently retrieving and processing data from memory circuits. The system includes at least one memory circuit storing data in a matrix format, where data is organized into rows and columns. A control circuit retrieves specific row slices from the memory circuit based on a predefined selection criterion, such as a row address or a pattern-matching condition. These row slices are then loaded into a multiplier circuit, which performs arithmetic operations on the retrieved data. The multiplier circuit may include multiple processing elements arranged in parallel to handle multiple data elements simultaneously, improving computational efficiency. The system may also include additional circuits for preprocessing or postprocessing the data before or after multiplication. The invention aims to optimize data retrieval and processing by integrating memory access with arithmetic operations, reducing latency and improving throughput in applications such as matrix computations, neural networks, or signal processing.
18. The circuit of claim 1, wherein the fetch circuit determines column slices of the second sub-matrix of data that produce a non-zero result when multiplied by a plurality of corresponding row slices of the first sub-matrix of data, and the determined column slices are retrieved from the at least one memory circuit.
The invention relates to a circuit for efficient matrix multiplication, particularly for identifying and retrieving specific column slices of a second sub-matrix that produce non-zero results when multiplied by corresponding row slices of a first sub-matrix. This addresses the computational inefficiency in traditional matrix multiplication where all elements are processed, even when many result in zero, wasting resources. The circuit includes a fetch circuit that identifies column slices of the second sub-matrix that, when multiplied by corresponding row slices of the first sub-matrix, yield non-zero results. Only these relevant column slices are then retrieved from memory, reducing unnecessary data access and computation. This selective retrieval optimizes performance by avoiding the processing of zero-resulting multiplications, which is particularly useful in sparse matrix operations or applications requiring real-time processing. The circuit may also include a memory circuit storing the matrices and a multiplier circuit for performing the actual multiplication. The fetch circuit dynamically determines which column slices are needed based on the row slices of the first sub-matrix, ensuring only the necessary data is fetched. This approach minimizes memory bandwidth usage and computational overhead, improving efficiency in systems handling large-scale matrix operations, such as machine learning or signal processing.
Cooperative Patent Classification codes for this invention. Click any code to explore related patents in that topic.
April 26, 2022
June 11, 2024
Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.