10599436

Data Processing Method and Apparatus, and System

PublishedMarch 24, 2020
Assigneenot available in USPTO data we have
Technical Abstract

Patent Claims
20 claims

Legal claims defining the scope of protection. Each claim is shown in both the original legal language and a plain English translation.

Claim 1

Original Legal Text

1. A data processing method, applied to a system comprising a central processing unit (CPU) pool and a storage pool, wherein the CPU pool is communicatively connected to the storage pool; and the CPU pool comprises at least two CPUs, a master node, at least one mapper node, and at least one reducer node running in the CPU pool, wherein the at least one mapper node comprises a first mapper node, the at least one reducer node comprises a first reducer node, and the first mapper node and the first reducer node run on different CPUs in the CPU pool; the storage pool comprises a remote storage area shared by the first mapper node and the first reducer node; and the method comprises: executing, by the first mapper node, a map task on a data slice, and obtaining N groups of at least one data segment according to an execution result of the map task, wherein N is a positive integer, each of the at least one data segment is to be processed by a corresponding reducer node, and the at least one data segment comprises a first data segment, and an M th group, the first data segment being a data segment to be processed by the first reducer node, and the M th group comprising an M th first data segment, wherein M is a positive integer less than or equal to N; storing, by the first mapper node, all first data segments in the N groups of at least one data segment into the remote storage area, and generating N storage messages, wherein an M th storage message comprises a storage address of the M th first data segment in the remote storage area and a data volume of the M th first data segment; and sending, by the first mapper node, the N storage messages to the master node.

Plain English translation pending...
Claim 2

Original Legal Text

2. The method according to claim 1 , wherein a duration during which the first mapper node executes the map task on the data slice comprises N first time segments; and, wherein executing, by the first mapper node, the map task on the data slice, and obtaining N groups of at least one data segment according to an execution result of the map task specifically comprises: when an M th first time segment ends, obtaining, by the first mapper node, the M th group of at least one data segment according to an execution result obtained by executing the map task in the M th first time segment.

Plain English Translation

This invention relates to distributed data processing systems, specifically improving the efficiency of map tasks in a distributed computing framework. The problem addressed is the inefficiency in traditional map tasks where intermediate results are only generated at the end of the entire task, leading to delays in subsequent processing stages. The method involves a distributed computing system where a mapper node processes a data slice by executing a map task. The execution time is divided into N time segments, each producing intermediate results. Specifically, when the Mth time segment ends, the mapper node generates the Mth group of data segments based on the execution results from that segment. This allows for incremental processing, where intermediate results are produced at regular intervals rather than waiting for the entire task to complete. This approach enables earlier availability of partial results for subsequent stages, such as reduce tasks, improving overall system efficiency and reducing latency. The method ensures that data is processed in smaller, manageable chunks, allowing for better resource utilization and fault tolerance. The system dynamically adjusts the number of time segments based on task complexity and system load, optimizing performance across different workloads.

Claim 3

Original Legal Text

3. The method according to claim 2 , wherein storing, by the first mapper node, all first data segments in the N groups of the at least one data segment into the remote storage area, generating N storage messages, and sending the N storage messages to the master node specifically comprises: storing, by the first mapper node when obtaining the M th group of at least one data segment, the M th first data segment into the remote storage area, generating the M th storage message, and sending the AP storage message to the master node.

Plain English Translation

This invention relates to distributed data processing systems, specifically methods for efficiently storing data segments in a distributed storage environment. The problem addressed is optimizing the storage process in systems where data is divided into segments and distributed across multiple nodes, ensuring reliable and coordinated storage operations. The method involves a distributed system with at least one mapper node and a master node. The mapper node processes data segments divided into N groups, where each group contains at least one data segment. The mapper node stores each group of data segments into a remote storage area, generating a corresponding storage message for each group. These storage messages are then sent to the master node to confirm the storage operation. Specifically, when the mapper node obtains the Mth group of data segments, it stores the Mth data segment into the remote storage area, generates the Mth storage message, and sends this message to the master node. This process ensures that each group of data segments is individually stored and acknowledged, improving data integrity and tracking in distributed storage systems. The method enhances reliability by confirming each storage operation before proceeding, reducing the risk of data loss or corruption in large-scale distributed environments.

Claim 4

Original Legal Text

4. The method according to claim 1 , wherein each of the at least one reducer node has a corresponding remote storage area, wherein a remote storage area corresponding to the first reducer node is configured to store a data segment to be processed by the first reducer node, the to-be processed data segments being in data segments obtained by all mapper nodes; and, wherein storing, by the first mapper node, the first data segment into the remote storage area comprises: storing, by the first mapper node, the first data segment into the remote storage area corresponding to the first reducer node.

Plain English Translation

This invention relates to distributed data processing systems, specifically improving data handling efficiency in map-reduce frameworks. The problem addressed is the inefficiency in data transfer between mapper and reducer nodes, which can bottleneck performance in large-scale data processing tasks. The system includes multiple mapper nodes that generate data segments during the map phase and at least one reducer node that processes these segments in the reduce phase. Each reducer node has a dedicated remote storage area where it stores the data segments it will process. When a mapper node generates a data segment, it directly stores that segment into the remote storage area corresponding to the specific reducer node responsible for processing it. This eliminates the need for intermediate data transfer steps, reducing latency and improving overall system throughput. The invention optimizes data flow by ensuring that each data segment is routed directly to the correct reducer's storage area, minimizing unnecessary network traffic and storage operations. This approach is particularly beneficial in distributed computing environments where data volume and processing demands are high. The system dynamically assigns data segments to reducers and manages storage allocation to maintain efficiency as workloads scale.

Claim 5

Original Legal Text

5. The method according to claim 1 , wherein each of the at least one mapper node has a corresponding remote storage area, and a remote storage area corresponding to the first mapper node is configured to store the at least one data segment obtained by the first mapper node; and, wherein storing, by the first mapper node, the first data segment into the remote storage area comprises: storing, by the first mapper node, the first data segment into the remote storage area corresponding to the first mapper node.

Plain English Translation

This invention relates to distributed data processing systems, specifically methods for managing data segments in a distributed computing environment. The problem addressed is efficient and scalable storage of data segments generated by mapper nodes in a distributed processing framework, such as MapReduce or similar systems. The invention describes a method where each mapper node in a distributed system has a dedicated remote storage area. When a mapper node processes input data and generates at least one data segment, it stores that segment in its corresponding remote storage area. For example, a first mapper node generates one or more data segments and stores them exclusively in its assigned remote storage area, ensuring data locality and reducing network overhead. This approach improves performance by minimizing cross-node data transfers and simplifying data management. The system ensures that data segments are stored in the correct remote storage area by associating each mapper node with a specific storage location. This method enhances scalability and fault tolerance, as data segments remain accessible even if other nodes fail. The invention optimizes distributed data processing by leveraging localized storage while maintaining the flexibility of distributed computing frameworks.

Claim 6

Original Legal Text

6. The method according to claim 1 , wherein a quantity of remote storage areas is equal to a product of a quantity of mapper nodes and a quantity of reducer nodes, and each remote storage area is shared by one mapper node and one reducer node; and, wherein storing, by the first mapper node, the first data segment into the remote storage area comprises: storing, by the first mapper node, the first data segment into the remote storage area shared by the first mapper node and the first reducer node.

Plain English Translation

This invention relates to distributed data processing systems, specifically optimizing data storage and transfer between mapper and reducer nodes in a parallel processing framework. The problem addressed is inefficient data movement and storage allocation in distributed computing environments, where intermediate data generated by mapper nodes must be efficiently transferred to reducer nodes for further processing. The system involves a distributed architecture where data is processed in parallel by multiple mapper nodes, which generate intermediate data segments. These segments are stored in remote storage areas before being accessed by reducer nodes for aggregation or further processing. The key innovation is a structured allocation of remote storage areas, where the total number of storage areas equals the product of the number of mapper nodes and the number of reducer nodes. Each storage area is uniquely shared by one mapper node and one reducer node, ensuring direct and efficient data transfer without contention. For example, a first mapper node generates a data segment and stores it in a specific remote storage area that is exclusively shared with a first reducer node. This direct mapping eliminates the need for intermediate coordination or additional data movement steps, reducing latency and improving overall system throughput. The approach ensures that each mapper-reducer pair has a dedicated storage space, optimizing resource utilization and minimizing bottlenecks in large-scale data processing tasks.

Claim 7

Original Legal Text

7. A data processing method, applied to a system comprising a central processing unit (CPU) pool and a storage pool, wherein the CPU pool is communicatively connected to the storage pool; and the CPU pool comprises at least two CPUs, a master node, at least one mapper node, and at least one reducer node running in the CPU pool, wherein the at least one mapper node comprises a first mapper node, the at least one reducer node comprises a first reducer node, and the first mapper node and the first reducer node run on different CPUs in the CPU pool; the storage pool comprises a remote storage area shared by the first mapper node and the first reducer node; and the method comprises: receiving, by the first reducer node, a storage message sent by the master node, wherein the storage message includes a storage address of a first data segment in the remote storage area and a data volume of the first data segment, the first data segment being to be processed by the first reducer node and being in at least one data segment obtained by the first mapper node; obtaining, by the first reducer node, the first data segment with the data volume from the remote storage area according to the storage address carried in the storage message; and executing, by the first reducer node, a reduce task on the first data segment.

Plain English Translation

This invention relates to distributed data processing systems, specifically optimizing task execution in a system with a CPU pool and a storage pool. The system addresses inefficiencies in parallel data processing where tasks are distributed across multiple CPUs but may experience bottlenecks due to shared storage access or uneven workload distribution. The system includes a CPU pool with at least two CPUs, a master node, mapper nodes, and reducer nodes. The master node coordinates task distribution, while mapper nodes process input data into intermediate segments, and reducer nodes aggregate these segments. A storage pool provides a remote storage area shared by all nodes. The key innovation involves a method where a reducer node receives a storage message from the master node, specifying the storage address and size of a data segment generated by a mapper node. The reducer node retrieves the segment from the shared storage and executes a reduce task on it. The mapper and reducer nodes run on separate CPUs to avoid resource contention, and the shared storage ensures efficient data transfer without direct inter-node communication. This approach improves parallel processing efficiency by decoupling data generation and reduction, reducing synchronization overhead and enhancing scalability.

Claim 8

Original Legal Text

8. The method according to claim 7 , wherein each of the at least one reducer node has a corresponding remote storage area, wherein a remote storage area corresponding to the first reducer node is configured to store a data segment to be processed by the first reducer node, the to-be processed data segment being in data segments obtained by all mapper nodes; and, wherein obtaining, by the first reducer node, the first data segment with the data volume from the remote storage area according to the storage address comprises: determining, by the first reducer node according to the storage address, the remote storage area corresponding to the first reducer node, determining a start address of the first data segment in the remote storage area corresponding to the first reducer node, and reading the first data segment with the data volume from the start address.

Plain English Translation

This invention relates to distributed data processing systems, specifically improving efficiency in data reduction phases. The problem addressed is the inefficiency in data retrieval during the reduce phase of distributed processing frameworks, where reducer nodes must access data segments generated by mapper nodes. The solution involves assigning each reducer node a dedicated remote storage area to store its specific data segments, eliminating the need for reducers to search across all available data segments. When a reducer node needs a data segment, it directly accesses its corresponding remote storage area using a storage address, which specifies the exact location of the segment. The reducer node determines the start address of the desired segment within its storage area and reads only the required data volume, reducing unnecessary data transfers and improving processing speed. This method ensures that reducers only access relevant data, minimizing overhead and enhancing overall system performance. The approach is particularly useful in large-scale data processing environments where efficient data retrieval is critical.

Claim 9

Original Legal Text

9. The method according to claim 7 , wherein each of the at least one mapper node has a corresponding remote storage area, and a remote storage area corresponding to the first mapper node is configured to store the at least one data segment obtained by the first mapper node according to an execution result obtained after the mapper node executes a map task; and wherein obtaining, by the first reducer node, the first data segment with the data volume from the remote storage area according to the storage address comprises: determining, by the first reducer node according to the storage address, the remote storage area corresponding to the first mapper node, determining a start address of the first data segment in the remote storage area corresponding to the first mapper node, and reading the first data segment with the data volume from the start address.

Plain English Translation

This invention relates to distributed data processing systems, specifically improving data retrieval efficiency in map-reduce frameworks. The problem addressed is the latency and inefficiency in accessing intermediate data segments generated by mapper nodes during distributed processing tasks. The solution involves optimizing the storage and retrieval of these segments to reduce processing delays. In a distributed processing system, mapper nodes execute map tasks to generate data segments, which are then stored in corresponding remote storage areas. Each mapper node has a dedicated remote storage area where its processed data segments are stored based on execution results. When a reducer node needs to access a specific data segment, it retrieves the segment by first determining the remote storage area associated with the mapper node that generated the segment. The reducer then identifies the start address of the desired data segment within that storage area and reads the segment directly from that address. This direct access method minimizes the overhead of searching or transferring data, improving overall processing efficiency. The system ensures that data segments are stored in a structured manner, allowing reducer nodes to quickly locate and retrieve the required segments without unnecessary delays. This approach is particularly useful in large-scale data processing environments where minimizing latency is critical for performance.

Claim 10

Original Legal Text

10. The method according to claim 7 , wherein a quantity of remote storage areas is equal to a product of a quantity of mapper nodes and a quantity of reducer nodes, and each remote storage area is shared by one mapper node and one reducer node; and wherein obtaining, by the first reducer node, the first data segment with the data volume from the remote storage area according to the storage address comprises: determining, by the first reducer node according to the storage address, the remote storage area shared by the first mapper node and the first reducer node, determining a start address of the first data segment in the remote storage area shared by the first mapper node and the first reducer node, and reading the first data segment with the data volume from the start address.

Plain English Translation

This invention relates to distributed data processing systems, specifically optimizing data storage and retrieval in a MapReduce framework. The problem addressed is inefficient data handling between mapper and reducer nodes, leading to bottlenecks in large-scale data processing tasks. The system involves a distributed storage architecture where the number of remote storage areas is determined by multiplying the number of mapper nodes by the number of reducer nodes. Each storage area is uniquely shared between one mapper node and one reducer node, creating a dedicated communication channel. When a reducer node needs to access data processed by a mapper node, it first determines the specific storage area shared between them using a storage address. The reducer then identifies the start address of the required data segment within that storage area and reads the data directly from that location. This approach eliminates the need for reducers to search through multiple storage locations, reducing latency and improving processing efficiency. The direct mapping between mapper-reducer pairs and storage areas ensures that data is retrieved from the most optimal path, minimizing network traffic and improving overall system throughput. The solution is particularly valuable in big data environments where minimizing data transfer overhead is critical for performance.

Claim 11

Original Legal Text

11. A computer device, comprising at least one central processing unit (CPU), wherein the at least one CPU in the computer device is in a CPU pool, the CPU pool being communicatively connected to a storage pool; and the CPU pool running a master node, at least one mapper node, and at least one reducer node, wherein the at least one mapper node comprises a first mapper node, and the at least one reducer node comprises a first reducer node, wherein the first mapper node and the first reducer node run on different CPUs in the CPU pool, and the first mapper node runs on one or more CPUs in the computer device; and the first mapper node and the first reducer node share a remote storage area comprised in the storage pool; and wherein the computer device comprises at least one memory having a plurality of instructions stored thereon, when the instructions are executed by the one or more CPUs in the computer device to realize the first mapper node, the instructions cause the one or more CPUs to: execute a map task on a data slice, and obtain N groups of at least one data segment according to an execution result of the map task, wherein N is a positive integer, each of the at least one data segment is to be processed by a corresponding reducer node, and the at least one data segment comprises a first data segment, and an M th group, the first data segment being a data segment to be processed by the first reducer node, and the M th group comprising an M th first data segment, wherein M is a positive integer less than or equal to N; and store all first data segments in the N groups of at least one data segment into the remote storage area, generate N storage messages, and send the N storage messages to the master node, wherein an M th storage message comprises a storage address of the M th first data segment in the remote storage area and a data volume of the AP first data segment.

Plain English Translation

This invention relates to distributed computing systems, specifically a computer device configured for efficient data processing in a MapReduce framework. The system addresses the challenge of optimizing resource utilization and data management in distributed computing environments by leveraging a CPU pool and a storage pool. The CPU pool contains multiple central processing units (CPUs) that collectively run a master node, mapper nodes, and reducer nodes. The mapper nodes and reducer nodes operate on separate CPUs within the pool, ensuring parallel processing and resource isolation. A first mapper node executes a map task on a data slice, generating N groups of data segments, where each segment is designated for processing by a corresponding reducer node. The first reducer node processes a specific data segment from these groups. All data segments are stored in a shared remote storage area within the storage pool, which is accessible to both mapper and reducer nodes. The mapper node generates storage messages containing the storage addresses and data volumes of the segments and sends these messages to the master node. This approach enhances data processing efficiency by minimizing inter-node communication overhead and ensuring coordinated data management across distributed resources. The system is particularly useful in large-scale data processing applications requiring high throughput and scalability.

Claim 12

Original Legal Text

12. The computer device according to claim 11 , wherein a duration during which the first mapper node executes the map task on the data slice comprises N first time segments; and, wherein executing, by the first mapper node, the map task on the data slice, and obtaining N groups of at least one data segment according to an execution result of the map task specifically comprises: when an M th first time segment ends, obtaining, by the first mapper node, the Mt h group of at least one data segment according to an execution result obtained by executing the map task in the M th first time segment.

Plain English Translation

This invention relates to distributed data processing systems, specifically improving the efficiency of map tasks in a distributed computing framework. The problem addressed is the inefficiency in traditional map tasks where data processing is performed in a single continuous execution, leading to delays in intermediate results and reduced flexibility in resource allocation. The system includes a computer device with a mapper node that processes a data slice by executing a map task. The map task execution is divided into N time segments, where each segment represents a distinct phase of processing. As each time segment (M) completes, the mapper node generates a corresponding group of data segments (Mth group) based on the results of that segment. This segmented approach allows for incremental processing and intermediate result generation, enabling better resource management and faster feedback loops. The mapper node dynamically adjusts processing based on the completion of each time segment, ensuring that partial results are available sooner and can be used for further processing or monitoring. This segmented execution model improves system responsiveness and adaptability, particularly in large-scale distributed environments where delays in processing can significantly impact overall performance. The invention enhances the efficiency of data processing by breaking down the map task into smaller, manageable segments, allowing for more granular control and optimization of computational resources.

Claim 13

Original Legal Text

13. The computer device according to claim 11 , wherein the instructions cause the one or more CPUs to: when obtaining the M th group of at least one data segment, store the M th first data segment into the remote storage area, generate the M th storage message, and send the M th storage message to the master node.

Plain English Translation

This invention relates to distributed data storage systems, specifically improving data reliability and efficiency in large-scale storage networks. The problem addressed is ensuring data integrity and availability while minimizing storage overhead in distributed environments where data is divided into segments and stored across multiple nodes. The system includes a computer device that processes data segments in groups, where each group contains at least one data segment. When processing the Mth group of data segments, the device stores the first data segment of that group into a remote storage area. After storing, the device generates a storage message indicating the successful storage of that segment and sends this message to a master node, which coordinates the overall storage process. This ensures that the master node is aware of the storage status of each segment, allowing for tracking and recovery if needed. The system may also include mechanisms for handling subsequent groups of data segments, where each group is processed similarly. The master node can use the storage messages to verify data placement and trigger recovery actions if segments are not stored correctly. This approach enhances data reliability by confirming storage operations and provides a structured method for managing distributed data segments. The invention is particularly useful in environments where data must be stored redundantly across multiple nodes to prevent loss or corruption.

Claim 14

Original Legal Text

14. The computer device according to claim 11 , wherein each of the at least one reducer node has a corresponding remote storage area, wherein a remote storage area corresponding to the first reducer node is configured to store a data segment to be processed by the first reducer node, the to-be processed data segments being in data segments obtained by all mapper nodes; and wherein the instructions cause the one or more CPUs to store the first data segment into the remote storage area corresponding to the first reducer node.

Plain English Translation

This invention relates to distributed data processing systems, specifically optimizing data storage and retrieval in a framework where mapper and reducer nodes handle large-scale data processing tasks. The problem addressed is inefficient data distribution and storage, leading to bottlenecks in processing performance. The system includes multiple mapper nodes that partition input data into smaller data segments. These segments are then processed by reducer nodes, which aggregate or transform the data. Each reducer node has a dedicated remote storage area for storing data segments assigned to it. The system ensures that a data segment to be processed by a specific reducer node is stored in its corresponding remote storage area, reducing latency and improving parallel processing efficiency. The invention further includes mechanisms to dynamically assign data segments to reducer nodes based on workload distribution, ensuring balanced processing and minimizing idle time. The remote storage areas are optimized for fast access, allowing reducer nodes to quickly retrieve and process their assigned data segments without contention. This approach enhances scalability and fault tolerance, as data segments are isolated per reducer, reducing the impact of node failures. The system is particularly useful in big data environments, such as Hadoop or Spark frameworks, where efficient data distribution is critical for performance. By localizing data storage to reducer nodes, the invention minimizes network overhead and improves overall throughput.

Claim 15

Original Legal Text

15. The computer device according to claim 11 , wherein each of the at least one mapper node has a corresponding remote storage area, wherein a remote storage area corresponding to the first mapper node is configured to store the at least one data segment obtained by the first mapper node; and wherein the instructions cause the one or more CPUs to store the first data segment into the remote storage area corresponding to the first mapper node.

Plain English Translation

This invention relates to distributed data processing systems, specifically improving data storage efficiency in systems using mapper nodes. The problem addressed is the inefficiency in storing intermediate data segments generated by mapper nodes during distributed processing tasks, such as in map-reduce frameworks. Traditional systems often rely on local storage or centralized storage, leading to bottlenecks, increased latency, or resource contention. The invention describes a distributed data processing system where each mapper node has a dedicated remote storage area. When a mapper node, such as the first mapper node, processes input data and generates at least one data segment, that segment is stored in the corresponding remote storage area. This remote storage area is specifically allocated for the mapper node's output, ensuring that data segments are stored in a distributed manner rather than being centralized or locally stored. The system uses one or more central processing units (CPUs) to execute instructions that direct the storage of the first data segment into the remote storage area assigned to the first mapper node. This approach reduces storage contention, improves parallelism, and enhances overall system performance by distributing storage load across multiple remote storage areas. The invention optimizes data flow in distributed processing environments by aligning storage resources with the nodes generating the data, minimizing transfer overhead and improving scalability.

Claim 16

Original Legal Text

16. The computer device according to claim 11 , wherein a quantity of remote storage areas is equal to a product of a quantity of mapper nodes and a quantity of reducer nodes, and each remote storage area is shared by one mapper node and one reducer node; and wherein the instructions cause the one or more CPUs to store the first data segment into the remote storage area shared by the first mapper node and the first reducer node.

Plain English Translation

This invention relates to distributed data processing systems, specifically optimizing data storage and transfer between mapper and reducer nodes in a parallel processing framework. The problem addressed is inefficient data handling in distributed computing environments, where intermediate data generated by mapper nodes must be transferred to reducer nodes for further processing, often leading to bottlenecks and delays. The system includes a plurality of mapper nodes and reducer nodes, each responsible for processing different stages of a data workflow. A key feature is the use of remote storage areas, where the number of storage areas is determined by multiplying the number of mapper nodes by the number of reducer nodes. Each storage area is uniquely shared between one mapper node and one reducer node, ensuring direct and exclusive access for data transfer. When a mapper node generates a data segment, it stores the segment in the corresponding remote storage area shared with its paired reducer node. This reduces the need for intermediate data transfer across the network, minimizing latency and improving overall processing efficiency. The system ensures that data segments are stored in the correct remote storage area, allowing reducer nodes to directly access the data they need without additional coordination or routing overhead. This approach optimizes resource utilization and enhances performance in large-scale distributed computing environments.

Claim 17

Original Legal Text

17. A computer device, comprising at least one central processing unit (CPU), wherein the at least one CPU in the computer device is in a CPU pool, wherein the CPU pool is communicatively connected to a storage pool; and the CPU pool runs a master node, at least one mapper node, and at least one reducer node, wherein the at least one mapper node comprises a first mapper node, the at least one reducer node comprises a first reducer node, wherein the first mapper node and the first reducer node run on different CPUs in the CPU pool, and the first reducer node runs on one or more CPUs in the computer device; and the first mapper node and the first reducer node share a remote storage area comprised in the storage pool; and wherein the computer device comprises at least one memory having a plurality of instructions stored thereon, when the instructions are executed by the one or more CPUs in the computer device to realize the first reducer node, the instructions cause the one or more CPUs to: receive a storage message sent by the master node, wherein the storage message includes a storage address of a first data segment in the remote storage area and a data volume of the first data segment, wherein the first data segment is to be processed by the first reducer node, the first data segment being in the at least one data segment obtained by the first mapper node; obtain the first data segment with the data volume from the remote storage area according to the storage address carried in the storage message; and execute a reduce task on the first data segment.

Plain English Translation

This invention relates to distributed computing systems, specifically a computer device configured for efficient data processing in a distributed environment. The system addresses the challenge of managing and processing large datasets by leveraging a CPU pool and a storage pool to optimize resource utilization and task execution. The computer device includes at least one central processing unit (CPU) that is part of a CPU pool, which is communicatively connected to a storage pool. The CPU pool runs a master node, at least one mapper node, and at least one reducer node. The mapper nodes and reducer nodes are distributed across different CPUs within the pool to ensure parallel processing. Specifically, a first mapper node and a first reducer node operate on separate CPUs, with the reducer node potentially running on multiple CPUs. Both nodes share a remote storage area within the storage pool, enabling efficient data access. The system processes data by first having the mapper node generate data segments, which are stored in the remote storage area. The master node then sends a storage message to the reducer node, containing the storage address and data volume of a specific data segment designated for processing. The reducer node retrieves this segment from the remote storage area and executes a reduce task on it. This approach ensures that data processing is distributed, scalable, and resource-efficient, improving performance in large-scale data processing applications.

Claim 18

Original Legal Text

18. The computer device according to claim 17 , wherein each of the at least one reducer node has a corresponding remote storage area, wherein a remote storage area corresponding to the first reducer node is configured to store a data segment to be processed by the first reducer node, the to-be processed data segment being in data segments obtained by all mapper nodes; and wherein the instructions cause the one or more CPUs to determine, according to the storage address, the remote storage area corresponding to the first reducer node, determine a start address of the first data segment in the remote storage area corresponding to the first reducer node, and read the first data segment with the data volume from the start address.

Plain English Translation

This invention relates to distributed data processing systems, specifically optimizing data access in a framework where mapper and reducer nodes handle large-scale data processing. The problem addressed is inefficient data retrieval in distributed systems, where reducers must access data segments generated by mappers, leading to latency and resource overhead. The system includes multiple mapper nodes that partition input data into segments and at least one reducer node responsible for processing these segments. Each reducer node has a dedicated remote storage area that stores specific data segments assigned to it. When a reducer node needs to process a data segment, it determines the corresponding remote storage area based on a storage address, locates the segment's start address within that storage, and reads the required data volume from that position. This approach minimizes redundant data transfers and improves processing efficiency by ensuring reducers access only the relevant segments stored in their designated storage areas. The invention enhances performance by reducing network traffic and storage access delays, particularly in large-scale distributed computing environments like Hadoop or Spark. The system dynamically manages data distribution and retrieval, ensuring reducers can quickly access their assigned data segments without unnecessary overhead. This method is particularly useful in scenarios requiring high-throughput data processing, such as big data analytics or machine learning workloads.

Claim 19

Original Legal Text

19. The computer device according to claim 17 , wherein each of the at least one mapper node has a corresponding remote storage area, wherein a remote storage area corresponding to the first mapper node is configured to store the at least one data segment obtained by the first mapper node according to an execution result obtained after the mapper node executes a map task; and wherein the instructions cause the one or more CPUs to determine, according to the storage address, the remote storage area corresponding to the first mapper node, determine a start address of the first data segment in the remote storage area corresponding to the first mapper node, and read the first data segment with the data volume from the start address.

Plain English Translation

This invention relates to distributed computing systems, specifically optimizing data storage and retrieval in a mapper-reducer framework. The problem addressed is inefficient data handling in distributed processing, where intermediate data generated by mapper nodes is not optimally stored or accessed, leading to performance bottlenecks. The system includes multiple mapper nodes, each with a dedicated remote storage area. When a mapper node executes a map task, it generates at least one data segment, which is stored in its corresponding remote storage area. The storage area is organized such that each data segment has a known storage address. When retrieving data, the system determines the specific remote storage area associated with a mapper node based on the storage address, identifies the start address of the desired data segment within that storage area, and reads the segment with the required data volume from that start address. This approach ensures efficient data placement and retrieval, reducing latency and improving overall system performance. The system dynamically manages storage addresses to maintain consistency and accessibility of intermediate data across distributed nodes.

Claim 20

Original Legal Text

20. The computer device according to claim 17 , wherein a quantity of remote storage areas is equal to a product of a quantity of mapper nodes and a quantity of reducer nodes, and each remote storage area is shared by one mapper node and one reducer node; and wherein the instructions instruct the one or more CPUs to determine, according to the storage address, the remote storage area shared by the first mapper node and the first reducer node, determine a start address of the first data segment in the remote storage area shared by the first mapper node and the first reducer node, and read the first data segment with the data volume from the start address.

Plain English Translation

This invention relates to distributed data processing systems, specifically optimizing data storage and retrieval in a mapper-reducer architecture. The problem addressed is inefficient data handling in distributed computing environments, where mappers and reducers must exchange intermediate data, often leading to bottlenecks and delays. The system includes a computer device with one or more CPUs and memory storing executable instructions. The device manages a distributed storage system with multiple remote storage areas, where the number of storage areas equals the product of the number of mapper nodes and reducer nodes. Each storage area is uniquely shared by one mapper node and one reducer node, ensuring direct and exclusive access between the paired nodes. When a first mapper node generates a data segment for a first reducer node, the system determines the specific remote storage area shared by these two nodes. It then identifies the start address of the data segment within that storage area and reads the segment with the specified data volume from that address. This approach minimizes data transfer overhead by eliminating the need for intermediate storage or coordination between multiple nodes, improving processing efficiency and reducing latency. The invention ensures that each mapper-reducer pair has a dedicated storage space, streamlining data exchange and enhancing performance in distributed computing tasks.

Patent Metadata

Filing Date

Unknown

Publication Date

March 24, 2020

Inventors

Haiyan LIU
Jun XU
Qun YU

Want to explore more patents?

Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.

Citation & reuse

Analysis on this page is generated by Patentable — an AI-powered patent intelligence platform. AI-generated summaries, explanations, FAQs, and analysis may be reused with attribution and a visible link back to the canonical URL below. Patent abstracts and claims are USPTO public domain.

Cite as: Patentable. “DATA PROCESSING METHOD AND APPARATUS, AND SYSTEM” (10599436). https://patentable.app/patents/10599436

© 2026 Nomic Interactive Technology LLC. Machine-readable context available at /api/llm-context/10599436. See llms.txt for full attribution policy.

DATA PROCESSING METHOD AND APPARATUS, AND SYSTEM