System And Method For Large-Scale Data Processing Using An Application-Independent Framework

PublishedJune 21, 2022

Assigneenot available in USPTO data we have

Technical Abstract

Patent Claims

20 claims

Legal claims defining the scope of protection, as filed with the USPTO.

1. A method of performing large-scale processing of data in a distributed and parallel processing environment, comprising: at a set of interconnected computing systems, each having one or more processors and memory: identifying an application-specific map operation to retrieve data and produce intermediate data values; identifying an application-specific reduce operation to combine the intermediate data values; executing a plurality of map worker processes, wherein each map worker process executes the map operation to read designated portions of input files, produce the intermediate data values in accordance with the map operation, and store the intermediate data values in intermediate data structures; executing a plurality of reduce worker processes, wherein each reduce worker process executes the reduce operation to read a respective subset of the intermediate data values from the intermediate data structures and to produce final output data by combining the respective subset of the intermediate data values in accordance with the reduce operation; and tracking a status of a plurality of tasks executed by the map worker processes and the reduce worker processes.

2. The method of claim 1 , wherein tracking the status of the plurality of tasks comprises storing the status in one or more tables.

3. The method of claim 1 , further comprising: determining the plurality of tasks associated with the map operation and the reduce operation; and assigning the plurality of tasks to the plurality of map worker processes and the plurality of reduce worker processes; wherein: determining the plurality of tasks comprises determining, for the input files, a plurality of map tasks specifying data from the input files to be processed into the intermediate data values and a plurality of reduce tasks, each specifying a respective subset of the intermediate data values to be processed into the final output data; assigning the map tasks comprises assigning the map tasks to underutilized ones of the map worker processes; and assigning the reduce tasks comprises assigning the reduce tasks to underutilized ones of the reduce worker processes.

4. The method of claim 3 , wherein: the set of interconnected computer systems are grouped into a plurality of datacenters; and assigning the map tasks to underutilized ones of the map worker processes comprises assigning map tasks for data stored on computer systems in a respective datacenter to map worker processes that are running on computer systems in the respective datacenter.

5. The method of claim 1 , further comprising applying a partition operation to the intermediate data values, wherein the partition operation specifies a respective intermediate data structure of the set of intermediate data structures in which to store each intermediate data value.

6. The method of claim 1 , further comprising: identifying an application-specific combiner operation, distinct from the application-specific map operation and the application-specific reduce operation, for combining initial data values produced by the application-specific map operation so as to produce the intermediate data values.

7. The method of claim 6 , wherein executing the plurality of map worker processes includes executing the map operation to read designated portions of input files and produce the initial data values and executing the combiner operation to combine the initial data to produce the intermediate data values and store the intermediate data values in intermediate data structures.

8. The method of claim 1 , wherein the intermediate data values comprise key-value pairs, and identifying the reduce operation is in addition to the identifying the map operation.

9. The method of claim 8 , wherein the reduce operation combines key-value pairs having a same key, and wherein combining key-value pairs having a same key comprises, for each distinct key, forming a respective aggregated key-value pair whose key is the respective key and whose value is a sum of the values of the key-value pairs whose keys match the respective key.

10. A system for large-scale processing of data in a distributed and parallel processing environment, comprising: a set of interconnected computing systems, each having one or more processors and memory, the set of interconnected computing systems including: an application-specific map operation; an application-specific reduce operation to combine the intermediate data values in accordance with the reduce operation; a plurality of map worker processes, wherein each map worker process executes the application-specific map operation to read designated portions of input files, produce the intermediate values in accordance with the application-specific map operation, and store the intermediate data values in intermediate data structures; a plurality of reduce worker processes, wherein each reduce worker process executes the application-specific reduce operation to read a respective subset of the intermediate data values from the intermediate data structures and to produce final output data by combining the respective subset of the intermediate data values in accordance with the application-specific reduce operation; and a tracking operation that tracks a status of a plurality of tasks executed by the map worker processes and the reduce worker processes.

11. The system of claim 10 , further comprising one more tables, wherein the one are more tables are configured for the tracking of the status of the plurality of tasks.

12. The system of claim 10 , further comprising: an operation that determines the plurality of tasks associated with the map operation and the reduce operation; and an assigning operation that assigns the plurality of tasks to the plurality of map worker processes and the plurality of reduce worker processes; wherein determining the plurality of tasks comprises determining, for the input files, a plurality of map tasks specifying data from the input files to be processed into the intermediate data values and a plurality of reduce tasks, each specifying a respective subset of the intermediate data values to be processed into the final output data; and wherein assigning the plurality of tasks comprises assigning the map tasks to underutilized ones of the map worker processes, and assigning the reduce tasks to underutilized ones of the reduce worker processes.

13. The system of claim 12 , wherein: the set of interconnected computer systems are grouped into a plurality of datacenters; and when assigning the map tasks to underutilized ones of the map worker processes, the supervisory process preferentially assigns map tasks for data stored on computer systems in a respective datacenter to map worker processes that are running on computer systems in the respective datacenter.

14. The system of claim 10 , further comprising a partition operation that operates on intermediate data values, wherein the partition operation specifies a respective intermediate data structure of the set of intermediate data structures in which to store each intermediate data value.

15. The system of claim 14 , wherein the intermediate data values comprise key-value pairs, and the reduce operation is in addition to the map operation.

16. The system of claim 15 , wherein the reduce operation combines key-value pairs having a same key, wherein combining key-value pairs having a same key comprises, for each distinct key, forming a respective aggregated key-value pair whose key is the respective key and whose value is a sum of the values of the key-value pairs whose keys match the respective key.

17. A non-transitory computer readable storage medium storing one or more programs configured for execution by a plurality of processors of a set of interconnected computing systems, the one or more programs comprising instructions for: identifying an application-specific map operation to retrieve data and produce intermediate data values in accordance with the operation; identifying an application-specific reduce operation to combine the intermediate data values in accordance with the reduce operation; executing a plurality of map worker processes, wherein each map worker process executes the map operation to read designated portions of input files, produce the intermediate data values in accordance with the map operation, and store the intermediate data values in intermediate data structures; executing a plurality of reduce worker processes, wherein each reduce worker process executes the reduce operation to read a respective subset of the intermediate data values from the intermediate data structures and to produce final output data by combining the respective subset of the intermediate data values in accordance with the reduce operation; and tracking a status of a plurality of tasks executed by the map worker processes and the reduce worker processes.

18. The non-transitory computer readable storage medium of claim 17 , wherein tracking the status of the plurality of tasks comprises storing the status in one or more tables.

19. The non-transitory computer readable storage medium of claim 17 , further comprising: determining the plurality of tasks associated with the map operation and the reduce operation; and assigning the plurality of tasks to the plurality of map worker processes and the plurality of reduce worker processes.

20. The non-transitory computer readable storage medium of claim 19 , wherein: determining the plurality of tasks comprises determining, for the input files, a plurality of map tasks specifying data from the input files to be processed into the intermediate data values and a plurality of reduce tasks, each specifying a respective subset of the intermediate data values to be processed into the final output data; assigning the map tasks comprises assigning the map tasks to underutilized ones of the map worker processes; and assigning the reduce tasks comprises assigning the reduce tasks to underutilized ones of the reduce worker processes.

Patent Metadata

Filing Date

Unknown

Publication Date

June 21, 2022

Inventors

Jeffrey Dean

Sanjay Ghemawat

Want to explore more patents?

Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.

Browse All Patents Try Prior Art Search