A system comprises a processor coupled to a plurality of memory units. Each of the plurality of memory units includes a request processing unit and a plurality of memory banks. Each request processing unit includes a plurality of decomposition units and a crossbar switch, the crossbar switch communicatively connecting each of the plurality of decomposition units to each of the plurality of memory banks. The processor includes a plurality of processing elements and a communication network communicatively connecting the plurality of processing elements to the plurality of memory units. At least a first processing element of the plurality of processing elements includes a control logic unit and a matrix compute engine. The control logic unit is configured to access the plurality of memory units using a dynamically programmable distribution scheme.
Legal claims defining the scope of protection. Each claim is shown in both the original legal language and a plain English translation.
2. The system of claim 1, wherein the first request processing unit of the first memory unit is configured to receive a broadcasted memory request.
A system for managing memory requests in a distributed computing environment addresses the challenge of efficiently handling and processing memory access requests across multiple nodes. The system includes a first memory unit with a first request processing unit that receives broadcasted memory requests from other nodes in the system. These requests may involve read or write operations to shared memory locations. The first request processing unit processes these requests by determining whether the requested memory location is available or requires synchronization with other nodes. If synchronization is needed, the system ensures data consistency by coordinating with other nodes before granting access. The system also includes a second memory unit with a second request processing unit that handles similar requests, allowing for parallel processing of memory operations. The system may further include a request routing unit that directs incoming memory requests to the appropriate request processing unit based on the target memory location or request type. This distributed approach improves scalability and reduces latency in memory access operations, particularly in multi-node computing environments where shared memory is accessed concurrently by multiple processors. The system ensures efficient handling of broadcasted requests, minimizing conflicts and maintaining data integrity across the distributed memory architecture.
3. The system of claim 2, wherein the broadcasted memory request references data stored in each of the plurality of memory units.
The system involves a distributed memory architecture designed to improve data access efficiency in computing systems. The problem addressed is the latency and bandwidth limitations in traditional memory systems, particularly in multi-core or distributed computing environments where multiple processing units compete for access to shared memory resources. The system includes a plurality of memory units interconnected to form a distributed memory network. Each memory unit stores data and is capable of processing memory requests independently or in coordination with other units. The system broadcasts memory requests across the network, allowing multiple memory units to respond simultaneously, thereby reducing access latency and improving throughput. The broadcasted memory request references data stored in each of the plurality of memory units, enabling parallel data retrieval or processing. This approach leverages the collective storage and processing capabilities of the distributed memory units to enhance system performance, particularly in applications requiring high-speed data access or parallel computation. The system may also include mechanisms to manage request conflicts, prioritize requests, or optimize data distribution across the memory units to further improve efficiency.
4. The system of claim 1, wherein the dynamically programmable distribution scheme utilizes an identifier associated with a workload of the first processing element.
A system for managing workload distribution in a multi-processing environment addresses inefficiencies in task allocation, where static or inflexible distribution schemes lead to resource underutilization or bottlenecks. The system dynamically adjusts workload distribution among processing elements based on real-time conditions, optimizing performance and resource utilization. A key feature is the use of a dynamically programmable distribution scheme, which adapts to varying workload demands by leveraging identifiers associated with individual workloads. These identifiers enable the system to classify, prioritize, or route tasks to the most suitable processing elements, ensuring balanced and efficient execution. The system may also include mechanisms for monitoring workload characteristics, such as computational intensity or memory requirements, to further refine distribution decisions. By dynamically programming the distribution scheme, the system avoids the limitations of fixed allocation strategies, improving scalability and adaptability in diverse processing environments. This approach is particularly useful in high-performance computing, cloud computing, or distributed systems where workloads exhibit dynamic behavior.
5. The system of claim 1, wherein the first request processing unit is configured to determine whether each of the plurality of partial requests corresponds to corresponding data stored in a corresponding one of the first plurality of memory banks associated with the corresponding request processing unit.
This invention relates to a distributed memory system for processing multiple partial requests in parallel. The system addresses the challenge of efficiently managing data access in high-performance computing environments where multiple processing units need to access data stored across distributed memory banks. The system includes a plurality of request processing units, each associated with a subset of memory banks. Each request processing unit is configured to determine whether a partial request corresponds to data stored in its associated memory banks. If the data is present, the request processing unit processes the request locally. If not, the request is forwarded to another processing unit that has access to the required memory banks. This approach reduces latency by minimizing inter-unit communication and ensuring that data access is handled by the most appropriate processing unit. The system optimizes performance by leveraging parallel processing and distributed memory architecture, making it suitable for applications requiring high-speed data retrieval and processing.
6. The system of claim 1, wherein the first crossbar switch of the first request processing unit is configured to direct a first partial request for data stored in a corresponding one of the first plurality of memory banks to the corresponding memory bank and receive a retrieved data payload from the corresponding memory bank.
A system for processing data requests in a memory architecture involves a first request processing unit with a first crossbar switch. The crossbar switch routes a first partial request for data to a specific memory bank from a set of memory banks. The crossbar switch also receives the retrieved data payload from the addressed memory bank. The system is designed to handle data requests efficiently by distributing them across multiple memory banks, reducing bottlenecks and improving access times. The crossbar switch ensures that each partial request is directed to the correct memory bank, enabling parallel data retrieval and enhancing overall system performance. This architecture is particularly useful in high-performance computing environments where low-latency and high-throughput data access are critical. The system may also include additional request processing units and crossbar switches to further optimize data flow and manage larger datasets. The crossbar switch's ability to dynamically route requests and handle data payloads ensures efficient memory access and minimizes delays in data processing.
7. The system of claim 6, wherein the first request processing unit is configured to prepare a partial response using the retrieved data payload and provide the prepared partial response to the first processing element of the plurality of processing elements.
This invention relates to a distributed data processing system designed to improve efficiency in handling large-scale data requests. The system addresses the challenge of processing complex queries that require accessing and combining data from multiple sources, which can lead to bottlenecks and delays in traditional centralized architectures. The system includes a request processing unit that retrieves a data payload from a storage system and prepares a partial response using this data. This partial response is then provided to a processing element within a distributed network of processing elements. The processing element further processes the partial response, potentially combining it with other partial responses from different sources to generate a final response. The system is designed to distribute the workload across multiple processing elements, reducing latency and improving scalability. The request processing unit may also include a request parser to analyze incoming requests and determine the necessary data retrieval and processing steps. The overall architecture ensures that data is efficiently routed and processed in parallel, optimizing performance for high-volume or complex queries.
8. The system of claim 7, wherein the prepared partial response includes a corresponding sequence identifier ordering the partial response among a plurality of partial responses.
The invention relates to a system for processing and transmitting data in a networked environment, particularly for handling large or complex data sets that require segmentation into smaller, manageable portions. The system addresses the challenge of efficiently transmitting and reconstructing segmented data while maintaining accuracy and order. The system includes a data processing module that generates partial responses from a larger data set, ensuring each partial response is uniquely identifiable and ordered within a sequence. A transmission module then sends these partial responses over a network to a receiving system, which reassembles them into the original data structure. The system ensures that each partial response includes a sequence identifier, allowing the receiving system to correctly order the segments, even if they arrive out of sequence. This ordering mechanism prevents data corruption and ensures accurate reconstruction. The system is particularly useful in applications requiring high reliability, such as distributed computing, cloud storage, or real-time data streaming, where maintaining data integrity and sequence is critical. The inclusion of sequence identifiers in each partial response enables robust error handling and recovery, improving overall system performance and reliability.
9. The system of claim 1, wherein the plurality of memory units includes a north memory unit, an east memory unit, a south memory unit, and a west memory unit.
A distributed memory system is designed to improve data access efficiency in multi-directional processing environments. The system addresses challenges in conventional memory architectures where data retrieval delays occur due to centralized storage or limited directional access. The invention provides a decentralized memory structure with multiple memory units arranged in a grid-like configuration to enable simultaneous data access from different directions. Each memory unit is assigned a specific directional identifier, such as north, east, south, and west, to facilitate organized data routing and retrieval. This configuration allows parallel processing tasks to access data from the nearest memory unit, reducing latency and enhancing overall system performance. The directional memory units are interconnected to enable seamless data transfer between adjacent units, ensuring efficient load balancing and fault tolerance. The system dynamically allocates data storage and retrieval tasks based on the direction of incoming requests, optimizing resource utilization and minimizing bottlenecks. This approach is particularly useful in applications requiring high-speed data processing, such as real-time simulations, parallel computing, and distributed systems. The invention improves data access efficiency by leveraging spatial memory organization and directional routing, making it suitable for environments where multi-directional data flow is critical.
10. The system of claim 1, wherein the plurality of processing elements are arranged in a two-dimensional array and the communication network communicatively includes a corresponding two-dimensional communication network connecting the plurality of processing elements.
This invention relates to a parallel processing system designed to enhance computational efficiency by optimizing the arrangement and communication between processing elements. The system addresses the challenge of scalability and inter-element communication latency in traditional parallel processing architectures, which can limit performance in high-performance computing applications. The system comprises a plurality of processing elements organized in a two-dimensional array, where each processing element is responsible for executing a portion of a computational task. A two-dimensional communication network interconnects these processing elements, enabling efficient data exchange between adjacent and non-adjacent elements. The communication network is structured to match the two-dimensional arrangement of the processing elements, ensuring low-latency and high-bandwidth communication paths. This configuration allows for parallel processing of data with reduced communication overhead, improving overall system performance. The two-dimensional array and corresponding communication network facilitate scalable and modular expansion of the system, allowing for the addition of more processing elements without significant degradation in communication efficiency. The system is particularly suited for applications requiring high-throughput parallel processing, such as scientific simulations, machine learning, and real-time data analytics. By optimizing the physical and logical arrangement of processing elements, the invention provides a robust solution for overcoming the limitations of conventional parallel processing architectures.
11. The system of claim 10, wherein each decomposition unit of the plurality of decomposition units is configured to only receive a memory request from and only provide a response to processing elements located in a same row or column of the two-dimensional array.
This invention relates to a distributed memory system for processing elements arranged in a two-dimensional array. The system addresses the challenge of efficiently managing memory access in large-scale parallel computing architectures, where processing elements must communicate with memory units without causing bottlenecks or excessive latency. The system includes multiple decomposition units, each assigned to a specific row or column of the array. Each decomposition unit is configured to handle memory requests exclusively from processing elements located in the same row or column, ensuring localized communication and reducing contention. This spatial restriction prevents interference between different rows or columns, improving scalability and performance. The system also includes a network interface for routing requests between the decomposition units and the processing elements, ensuring that memory access remains efficient and predictable. The decomposition units may further include logic to prioritize or schedule requests, optimizing memory bandwidth usage. This design is particularly useful in high-performance computing, data centers, and other environments requiring low-latency, high-throughput memory access.
12. The system of claim 4, wherein two or more processing elements of the plurality of processing elements share the identifier.
A distributed computing system with shared identifiers for processing elements. The system addresses the challenge of efficiently managing and coordinating multiple processing elements in a distributed environment, particularly where tasks or data need to be routed or synchronized across different nodes. The system includes a plurality of processing elements, each configured to execute tasks or process data. A key feature is that two or more processing elements can share a common identifier, allowing them to be grouped or treated as a single logical unit. This shared identifier enables efficient task distribution, load balancing, or fault tolerance by allowing the system to route tasks or data to any processing element within the group. The system may also include a controller or management module that assigns identifiers to processing elements and manages their operation. The shared identifier mechanism simplifies coordination between processing elements, reduces overhead in task assignment, and improves scalability by allowing dynamic grouping of elements based on workload or availability. This approach is particularly useful in large-scale distributed systems where flexibility and efficient resource utilization are critical.
13. The system of claim 1, wherein a second processing element of the plurality of processing elements is configured with a different dynamically programmable distribution scheme for accessing memory units than the first processing element.
This invention relates to a distributed processing system with multiple processing elements, each configured to access memory units using dynamically programmable distribution schemes. The system addresses the challenge of efficiently managing memory access in parallel processing environments, where different processing elements may require distinct access patterns to optimize performance. The invention ensures that a second processing element operates with a different dynamically programmable distribution scheme for accessing memory units compared to a first processing element. This allows for flexible and adaptive memory access strategies tailored to the specific needs of each processing element, improving overall system efficiency and performance. The system may include a plurality of processing elements, each capable of dynamically adjusting their memory access schemes based on workload demands or other operational parameters. The invention enables dynamic reconfiguration of memory access patterns to enhance parallel processing capabilities and reduce bottlenecks in memory-intensive applications.
14. The system of claim 1, wherein the control logic unit of the first processing element is further configured with an access unit size for distributing data across the plurality of memory units.
The invention relates to a distributed memory system for processing elements, addressing challenges in efficiently managing data access and distribution across multiple memory units. The system includes a plurality of processing elements, each with a control logic unit that manages data distribution and access to a shared memory system. The control logic unit is configured to determine an access unit size, which defines the granularity of data distribution across the plurality of memory units. This allows the system to optimize data access patterns, reduce latency, and improve overall performance by aligning data distribution with the processing requirements of the system. The access unit size can be dynamically adjusted based on workload characteristics, ensuring efficient memory utilization and minimizing bottlenecks. The system may also include mechanisms for load balancing, fault tolerance, and synchronization to enhance reliability and performance. By configuring the access unit size, the system ensures that data is distributed in a manner that maximizes parallelism and minimizes contention, leading to improved efficiency in data-intensive applications.
15. The system of claim 1, wherein data elements of a machine learning weight matrix are distributed across the plurality of memory units using the dynamically programmable distribution scheme.
A system for distributed machine learning weight matrix storage addresses the challenge of efficiently managing large-scale weight matrices in memory-constrained environments. The system distributes data elements of a machine learning weight matrix across multiple memory units using a dynamically programmable distribution scheme. This scheme allows for flexible allocation of weight matrix elements to different memory units based on factors such as memory capacity, access patterns, or computational requirements. The distribution can be adjusted in real-time to optimize performance, reduce latency, or balance memory usage. The system may also include a controller that manages the distribution process, ensuring that data elements are correctly mapped to the appropriate memory units. Additionally, the system may support parallel processing by enabling concurrent access to different memory units, improving overall computational efficiency. The dynamic nature of the distribution scheme allows the system to adapt to varying workloads and hardware configurations, making it suitable for diverse machine learning applications.
17. The method of claim 16, wherein the first memory unit includes a plurality of connections communicatively connecting the first memory unit to a processor, the processor includes the first processing element among a plurality of processing elements, and the received memory request is received at a first connection of the plurality of connections.
This invention relates to a memory system architecture designed to improve data access efficiency in computing systems. The system addresses the problem of latency and bandwidth limitations in traditional memory architectures by implementing a distributed memory structure with multiple connections to a processor. The processor contains multiple processing elements, each capable of independently accessing memory. The memory system includes at least one memory unit with a plurality of connections to the processor, allowing parallel data transfers and reducing bottlenecks. When a memory request is received at one of these connections, the system routes the request to the appropriate processing element within the processor. This distributed approach enhances performance by enabling concurrent memory access operations, improving throughput, and reducing contention. The architecture is particularly useful in high-performance computing environments where multiple processing elements require simultaneous access to shared memory resources. The system ensures efficient data retrieval and storage by leveraging multiple communication paths between the memory unit and the processor, optimizing overall system performance.
18. The method of claim 17, wherein the partial response is provided to the first processing element via the first connection.
A system and method for processing data in a distributed computing environment addresses inefficiencies in data transmission and processing between interconnected processing elements. The invention improves communication by selectively routing partial responses from one processing element to another via a dedicated connection, reducing latency and optimizing resource utilization. The method involves generating a partial response at a first processing element, where the partial response is derived from processing an input data set. This partial response is then transmitted to a second processing element through a first connection, which is specifically configured to handle such intermediate data. The second processing element further processes the partial response, either independently or in conjunction with additional data, to produce a final output. The system ensures efficient data flow by dynamically managing connections between processing elements, prioritizing high-bandwidth or low-latency links for critical data transfers. This approach enhances performance in distributed computing tasks, such as parallel processing, real-time analytics, or large-scale simulations, by minimizing bottlenecks and improving coordination between processing nodes. The invention is particularly useful in environments where data must be processed incrementally or where intermediate results are shared across multiple processing units.
20. The method of claim 19, wherein the first memory request was broadcasted to the plurality of memory units.
The invention relates to memory systems, specifically methods for handling memory requests in a distributed memory architecture. The problem addressed is efficient and reliable data access in systems with multiple memory units, where requests must be managed to ensure correctness and performance. The method involves processing memory requests in a system with multiple memory units. A first memory request is broadcasted to all memory units, allowing each unit to evaluate whether it can service the request. Each memory unit determines if it contains the requested data and, if so, responds with the data. If no unit responds, the request is deemed invalid or unserviced. The system may also handle subsequent memory requests, where a second request is processed based on the outcome of the first request, ensuring consistency and avoiding conflicts. The method ensures that memory requests are correctly routed and serviced by the appropriate memory unit, improving reliability and performance in distributed memory systems. The broadcast mechanism allows parallel evaluation of requests, reducing latency and improving efficiency. The system may also include mechanisms to handle errors, such as invalid requests or conflicts, ensuring robust operation.
Cooperative Patent Classification codes for this invention. Click any code to explore related patents in that topic.
December 17, 2019
December 20, 2022
Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.