Cache Management for Map-Reduce Applications

PublishedSeptember 18, 2018

Assigneenot available in USPTO data we have

InventorsLiang Liu Junmei Qu ChaoQiang Zhu Wei Zhuang

Technical Abstract

Patent Claims

21 claims

Legal claims defining the scope of protection, as filed with the USPTO.

1. A method for optimizing a cache on a computing node for a MapReduce application on a distributed file system, the method comprising: training a first machine learning model to determine an optimal cache slice size on the computing node for processing a map request in a shortest processing time based on first parameters in historical records for previously executed map tasks on the computing node, the first parameters including a first total data size to be processed, a first size of each data record, and a number of map tasks that will execute simultaneously on the computing node; receiving, by a computer, the map request for the MapReduce application on the distributed file system that includes one or more storage medium connected to the computing node; receiving, by the computer, first parameters for processing the map request; determining, by the trained first machine learning model, the optimal cache slice size for the computing node for processing the map request corresponding to the shortest processing time of the map request, wherein the optimal cache slice size is determined based on the received first parameters for processing the map request; reading, by the computing node, based on the determined optimal cache slice size, data from the one or more storage medium of the distributed file system into the cache of the computing node; processing, by the computing node, the map request; and writing, by the computing node, a final result data of the map request processing to the one or more storage medium.

2. The method according to claim 1 , further comprising: responsive to the received map request processing requiring an iterative computation, performing one or more times: writing, by the computing node, an intermediate result data of the map request processing into the cache of the computing node, based on the determined optimal cache slice size; reading, by the computing node, the intermediate result data from the cache of the computing node, based on the determined optimal cache slice size; processing, by the computing node, the map request; and responsive to the map request requiring another iteration, writing, by the computing node, a second intermediate result data of the map request processing into the cache of the computing node, based on the determined optimal cache slice size, or responsive to the map request completing the iterative computation, writing, by the computing node, the final result data of the map request processing to the one or more storage medium.

3. The method according to claim 1 , wherein the first machine learning model for the optimal map request cache slice size is a first multiple linear regression model of a relationship between a map request processing time and the optimal cache slice size in which the first multiple linear regression model is established based on historical records of previously executed map tasks executed by at least one computing node, wherein the historical records of the previously executed map tasks include: a processing time of the map request, the first total data size to be processed, the first size of each data record, and the number of map tasks executing simultaneously on the computing node.

4. The method according to claim 3 , further comprising: determining, by the computer, a processing time of the map request; and correcting, by the computer, a coefficient of the first multiple linear regression model based on the determined map request processing time and the received first parameters for processing the map request.

5. The method according to claim 1 , further comprising: training a second machine learning model to determine a second optimal cache slice size on the computing node for processing a reduce request in a shortest processing time based on second parameters in historical records for previously executed reduce tasks on the computing node, the second parameters including a second total data size to be processed, a second size of each data record, and a number of reduce tasks that will execute simultaneously on the computing node; receiving, by the computer, the reduce request for the MapReduce application on the distributed the system; receiving, by the computer, second parameters for processing the reduce request; determining, by the trained second machine learning model, the second optimal cache slice size for the computing node for processing the reduce request corresponding to the shortest processing time of the reduce request, wherein the second optimal cache slice size is determined based on the received second parameters; reading, by the computing node, the final result data of the map request processing from the one or more storage medium, based on the determined second optimal cache slice size; processing, by the computing node, the reduce request; and writing, by the computing node, a final result data of the reduce request processing to the one or more storage medium.

6. The method according to claim 5 , wherein the second machine learning model for the optimal reduce request cache slice size is a second multiple linear regression model of a relationship between a reduce request processing time and the second optimal cache slice size, in which the second multiple linear regression model is established based on historical records of previously executed reduce tasks executed by at least one computing node, wherein the historical records of the previously executed reduce tasks include: a processing time of the reduce request, the second total data size to be processed, the second size of each data record, and the number of reduce tasks executing simultaneously on the computing node.

7. The method according to claim 6 , further comprising: determining, by the computer, the processing time of the reduce request; and correcting, by the computer, a coefficient of the second multiple linear regression model based on the determined reduce request processing time and the received second parameters for processing the reduce request.

8. A computer program product for optimizing a cache on a computing node for a MapReduce application on a distributed file system, the computer program product comprising one or more computer readable storage medium and program instructions stored on at least one of the one or more computer readable storage medium, the program instructions comprising; program instructions to train a first machine learning model to determine an optimal cache slice size on the computing node for processing a map request in a shortest processing time based on first parameters in historical records for previously executed map tasks on the computing node, the first parameters including a first total data size to be processed, a first size of each data record, and a number of map tasks that will execute simultaneously on the computing node; program instructions to receive, by a computer, the map request for the MapReduce application on the distributed file system that includes one or more storage medium connected to the computing node; program instructions to receive, by the computer, first parameters for processing the map request; program instructions to determine, by the trained first machine learning model, the optimal cache slice size for the computing node for processing the map request corresponding to the shortest processing time of the map request, wherein the cache slice size is determined based on the received first parameters for processing the map request; program instructions to read, by the computing node, based on the determined optimal cache slice size, data from the one or more storage medium of the distributed the system into the cache of the computing node; program instructions to process, by the computing node, the map request; and program instructions to write, by the computing node, a final result data of the map request processing to the one or more storage medium.

9. The computer program product according to claim 8 , further comprising: responsive to the received map request processing requiring an iterative computation, performing one or more times: program instructions to write, by the computing node, an intermediate result data of the map request processing into the cache of the computing node, based on the determined optimal cache slice size; program instructions to read, by the computing node, the intermediate result data from the cache, based on the determined optimal cache slice size; program instructions to process, by the computing node, the map request; and program instructions, responsive to the map request requiring another iteration, to write, by the computing node, a second intermediate result data of the map request processing into the cache, based on the determined optimal cache slice size, and responsive to the map request completing the iterative computation, to write, by the computing node, the final result data of the map request processing to the one or more storage medium.

10. The computer program product according to claim 8 , wherein the machine learning model for the optimal map request cache slice size is a first multiple linear regression model of a relationship between a map request processing time and the optimal cache slice size in which the first multiple linear regression model is established based on historical records of previously executed map tasks executed by at least one computing node, wherein the historical records of the previously executed map tasks include: a processing time of the map request, the first total data size to be processed, the first size of each data record, and the number of map tasks executing simultaneously on the computing node.

11. The computer program product according to claim 10 , further comprising: program instructions to determine, by the computer, a processing time of the map request; and program instructions to correct, by the computer, a coefficient of the first multiple linear regression model based on the determined map request processing time and the received parameters for processing the map request.

12. The computer program product according to claim 8 , further comprising: program instructions to train a second machine learning model to determine a second optimal cache slice size on the computing node for processing a reduce request in a shortest processing time based on second parameters in historical records for previously executed reduce tasks on the computing node, the second parameters including a second total data size to be processed, a second size of each data record, and a number of reduce tasks that will execute simultaneously on the computing node; program instructions to receive, by the computer, the reduce request for the MapReduce application on the distributed file system; program instructions to receive, by the computer, second parameters for processing the reduce request; program instructions to determine, by the trained second machine learning model, the second optimal cache slice size for the computing node for processing the reduce request corresponding to the shortest processing time of the reduce request, wherein the second optimal cache slice size is determined based on the received second parameters; program instructions to read, by the computing node, the final result data of the map request processing from the one or more storage medium, based on the determined second optimal cache slice size; program instructions to process, by the computing node, the reduce request; and program instructions to write, by the computing node, a final result data of the reduce request processing to the one or more storage medium.

13. The computer program product according to claim 12 , wherein the second machine learning model for the optimal reduce request cache slice size is a second multiple linear regression model of a relationship between a reduce request processing time and the second optimal cache slice size, in which the second multiple linear regression model is established based on historical records of previously executed reduce tasks executed by at least one computing node, wherein the historical records of the previously executed reduce tasks include: a processing time of the reduce request, the second total data size to be processed, the second size of each data record, and the number of reduce tasks executing simultaneously on the computing node.

14. The computer program product according to claim 13 , further comprising: program instructions to determine, by the computer, the processing time of the reduce request; and program instructions to correct, by the computer, a coefficient of the second multiple linear regression model based on the determined reduce request processing time and the received second parameters for processing the reduce request.

15. A computer system for optimizing a cache on a computing node for a MapReduce application on a distributed file system, the computer system comprising one or more processors, one or more computer readable memories, one or more computer readable tangible storage medium, and program instructions stored on at least one of the one or more storage medium for execution by at least one of the one or more processors via at least one of the one or more memories, the program instructions comprising: program instructions to train a first machine learning model to determine an optimal cache slice size on the computing node for processing a map request in a shortest processing time based on first parameters in historical records for previously executed map tasks on the computing node, the parameters including a first total data size to be processed, a first size of each data record, and a number of map tasks that will execute simultaneously on the computing node; program instructions to receive, by a computer, the map request for the MapReduce application on the distributed file system that includes one or more storage medium connected to the computing node; program instructions to receive, by the computer, first parameters for processing the map request; program instructions todetermine, by the trained first machine learning model, the optimal cache slice size for the computing node for processing the map request corresponding to the shortest processing time of the map request, wherein the cache slice size is determined based on the received first parameters for processing the map request; program instructions to read, by the computing node, based on the determined optimal cache slice size, data from the one or more storage medium of the distributed file system into the cache of the computing node; program instructions to process, by the computing node, the map request; and program instructions to write, by the computing node, a final result data of the map request processing to the one or more storage medium.

16. The computer system according to claim 15 , further comprising: responsive to the received map request processing requiring an iterative computation, performing one or more times: program instructions to write, by the computing node, an intermediate result data of the map request processing into the cache of the computing node, based on the determined optimal cache slice size; program instructions to read, by the computing node, the intermediate result data from the cache, based on the determined optimal cache slice size; program instructions to process, by the computing node, the map request; and program instructions, responsive to the map request requiring another iteration, to write, by the computing node, a second intermediate result data of the map request processing into the cache, based on the determined optimal cache slice size, and responsive to the map request completing the iterative computation, to write, by the computing node, the final result data of the map request processing to the one or more storage medium.

17. The computer system according to claim 15 , wherein the machine learning model for the optimal map request cache slice size is a first multiple linear regression model of a relationship between a map request processing time and the optimal cache slice size in which the first multiple linear regression model is established based on historical records of previously executed map tasks executed by at least one computing node, wherein the historical records of the previously executed map tasks include: a processing time of the map request, the first total data size to be processed, the first size of each data record, and the number of map tasks executing simultaneously on the computing node.

18. The computer system according to claim 17 , further comprising: program instructions to determine, by the computer, a processing time of the map request; and program instructions to correct, by the computer, a coefficient of the first multiple linear regression model based on the determined map request processing time and the received parameters for processing the map request.

19. The computer system according to claim 15 , further comprising: program instructions to train a second machine learning model to determine a second optimal cache slice size on the computing node for processing a reduce request in a shortest processing time based on second parameters in historical records for previously executed reduce tasks on the computing node, the second parameters including a second total data size to be processed, a second size of each data record, and a number of reduce tasks that will execute simultaneously on the computing node; program instructions to receive, by the computer, the reduce request for the MapReduce application on the distributed the system; program instructions to receive, by the computer, second parameters for processing the reduce request; program instructions to determine, by the trained second machine learning model, the second optimal cache slice size for the computing node for processing the reduce request corresponding to the shortest processing time of the reduce request, wherein the second optimal cache slice size is determined based on the received second parameters; program instructions to read, by the computing node, the final result data of the map request processing from the one or more storage medium, based on the determined second optimal cache slice size; program instructions to process, by the computing node, the reduce request; and program instructions to write, by the computing node, a final result data of the reduce request processing to the one or more storage medium.

20. The computer system according to claim 19 , wherein the second machine learning model for the optimal reduce request cache slice size is a second multiple linear regression model of a relationship between a reduce request processing time and the second optimal cache slice size, in which the second multiple linear regression model is established based on historical records of previously executed reduce tasks executed by at least one computing node, wherein the historical records of the previously executed reduce tasks include: a processing time of the reduce request, the second total data size to be processed, the second size of each data record, and the number of reduce tasks executing simultaneously on the computing node.

21. The computer system according to claim 20 , further comprising: program instructions to determine, by the computer, the processing time of the reduce request; and program instructions to correct, by the computer, a coefficient of the second multiple linear regression model based on the determined reduce request processing time and the received second parameters for processing the reduce request.

Patent Metadata

Filing Date

Unknown

Publication Date

September 18, 2018

Inventors

Liang Liu

Junmei Qu

ChaoQiang Zhu

Wei Zhuang

Want to explore more patents?

Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.

Browse All Patents Try Prior Art Search