US-11243807

Systems, methods, and apparatuses for implementing a scheduler and workload manager with workload re-execution functionality for bad execution runs

PublishedFebruary 8, 2022

Assigneenot available in USPTO data we have

Inventorsnot available in USPTO data we have

Technical Abstract

In accordance with disclosed embodiments, there are provided systems, methods, and apparatuses for implementing a stateless, deterministic scheduler and work discovery system with interruption recovery. For instance, according to one embodiment, there is disclosed a system to implement a stateless scheduler service, in which the system includes: a processor and a memory to execute instructions at the system; a compute resource discovery engine to identify one or more computing resources available to execute workload tasks; a workload discovery engine to identify a plurality of workload tasks to be scheduled for execution; a cache to store information on behalf of the compute resource discovery engine and the workload discovery engine; a scheduler to request information from the cache specifying the one or more computing resources available to execute workload tasks and the plurality of workload tasks to be scheduled for execution; and further in which the scheduler is to schedule at least a portion of the plurality of workload tasks for execution via the one or more computing resources based on the information requested. Other related embodiments are disclosed.

Patent Claims

21 claims

Legal claims defining the scope of protection, as filed with the USPTO.

1. A method performed by a system having at least a processor and a memory therein, wherein the method comprises: allocating a cache within the memory of the system; identifying, via a workload discovery engine, pending workload tasks to be scheduled for execution from one or more workload queues and updating the cache; identifying, via a compute resource discovery engine, a plurality of computing resources available to execute the workload tasks and updating the cache; identifying, via an external services monitor, a plurality of external services accessible to the workload tasks and updating the cache; executing a scheduler via the processor of the system, wherein the scheduler performs at least the following operations: scheduling the workload tasks for execution on the plurality of computing resources; identifying a failure condition for one of the plurality of external services accessible to the workload tasks; identifying any of the workload tasks potentially affected by the failure condition of the external service based on the workload tasks specifying the external service as a dependency and based further on execution of the workload tasks overlapping in time with a time frame associated with the failure condition; and scheduling the workload tasks potentially affected by the failure condition of the external service for a repeated execution on the plurality of computing resources.

2. The method of claim 1 , wherein the external services monitor listens to and monitors the health and operational status of the plurality of external services accessible to the workload tasks and updates the information in the cache specifying the timeframe of any service degradation, failure mode, or service outage associated with any of the plurality of external services monitored.

3. The method of claim 1 , wherein identifying any of the workload tasks potentially affected by the failure condition of the external service further comprises: marking results of the workload tasks as unsatisfactory in the cache; and wherein the scheduler on a subsequent scheduling heartbeat iteration schedules the workload tasks having results marked as unsatisfactory for a repeated execution.

4. The method of claim 1 , wherein identifying any of the workload tasks potentially affected by the failure condition of the external service further comprises: identifying all currently executing workload tasks potentially affected by the failure condition of the external service; terminating execution of the currently executing workload tasks potentially affected by the failure condition of the external service; marking results of the workload tasks as unsatisfactory in the cache; and wherein the scheduler schedules the workload tasks having been terminated for a repeated execution.

5. The method of claim 1 , wherein identifying any of the workload tasks potentially affected by the failure condition of the external service further comprises: identifying all previously completed workload tasks potentially affected by the failure condition of the external service; marking results of the previously completed workload tasks as unsatisfactory in the cache; and wherein the scheduler schedules the previously completed workload tasks for a repeated execution.

6. The method of claim 5 , further comprising: saving results from the previously completed workload tasks and marked as unsatisfactory concurrently with new results generated by the repeated execution of the previously completed workload tasks; and returning both the results marked as unsatisfactory and the new results to a submitter of the workload task.

7. The method of claim 1 , wherein identifying any of the workload tasks potentially affected by the failure condition of the external service comprises: identifying the failure condition of the external service and the time frame associated with the failure condition based on the information in the cache as updated by the external services monitor.

8. The method of claim 1 , wherein scheduling the workload tasks potentially affected by the failure condition of the external service for a repeated execution comprises: scheduling the repeated execution on a compute cloud which is different than a compute cloud having executed the workload tasks potentially affected by the failure condition of the external service.

9. The method of claim 1 , wherein scheduling the workload tasks potentially affected by the failure condition of the external service for a repeated execution comprises: scheduling the repeated execution on a different one of the plurality of computing resources having a different compute footprint than a compute resource having executed the workload tasks potentially affected by the failure condition of the external service; wherein the computing resources having the different compute footprint is selected from the group comprising: a compute footprint optimized for CPU bandwidth; a compute footprint optimized for GPU bandwidth; a compute footprint optimized for Input/Output (I/O) throughput; a compute footprint optimized for memory; a compute footprint utilizing AMD CPU architecture; a compute footprint utilizing Intel CPU architecture; compute footprints utilizing different sized Virtual Machines (VMs); compute footprints utilizing different operating systems; and compute footprints utilizing different CPU core quantities.

10. The method of claim 1 , wherein scheduling the workload tasks potentially affected by the failure condition of the external service for a repeated execution comprises: scheduling the repeated execution with a datacenter in different geographical regions and having a different compute footprint than the compute resource having initially executed the workload tasks potentially affected by the failure condition of the external service.

11. The method of claim 1 , wherein executing the scheduler via the processor of the system comprises the scheduler to perform at least the following additional operations: producing a list of the workload tasks to be executed based on the information retrieved from the cache; computing available compute capacity to execute workload tasks at each of the plurality of computing resources based on the information retrieved from the cache; selecting the workload tasks for execution via the plurality of computing resources based on the information retrieved from the cache; and planning execution of the workload tasks by scheduling the workload tasks for execution at the plurality of computing resources based on the computed available capacity to execute workload tasks at each of the plurality of computing resources.

12. The method of claim 11 , wherein the scheduler to further perform at least the following additional operations: initiating execution of the workload tasks at the plurality of computing resources pursuant to the planned execution; and removing from the list of the workload tasks to be executed as represented at the cache any of the workload tasks for which execution is initiated.

13. The method of claim 1 , further comprising: operating, from the system, an external cloud interface to communicatively link the system with one or more third-party private and/or public computing clouds via a public Internet.

14. The method of claim 1 , wherein identifying the plurality of computing resources available to execute workload tasks and updating the cache specifying the identified computing resources, comprises: the compute resource discovery engine to autonomously discover any one of: one or more third-party compute clouds accessible to the scheduler; one or more private on-demand compute clouds accessible to the scheduler; one or more public on-demand compute clouds accessible to the scheduler; one or more computing pods within a local host organization within which a scheduling service of the system operates when the one or more computing pods are accessible to the scheduler; one or more remote computing pods within a remote host organization separate from the local host organization within which the scheduling service operates when the one or more remote computing pods are accessible to the scheduling service through the remote host organization; an OpenStack computing cloud accessible to the scheduler; a VMWare computing cloud accessible to the scheduler; an Amazon Web Services (AWS) public computing cloud accessible to the scheduler; a Microsoft Azure public computing cloud accessible to the scheduler; an AWS Direct Connect privately leased computing space accessible to the scheduler; and an Azure ExpressRoute privately leased computing space accessible to the scheduler.

15. The method of claim 1 , wherein the system comprises a multi-tenant database system having customer data stored therein for a plurality of distinct customer organizations; wherein each customer organization is an entity selected from the group consisting of: a separate and distinct remote organization, an organizational group within the host organization, a business partner of the host organization, or a customer organization that subscribes to cloud computing services provided by the host organization; wherein the system operates at a host organization as a cloud-based service provider to the plurality of distinct customer organizations; and wherein the cloud-based service provider receives inputs from the plurality of distinct customer organizations to schedule workload tasks for execution the plurality of computing resources.

16. Non-transitory computer readable storage media having instructions stored thereupon that, when executed by a system having at least a processor and a memory therein, the instructions cause the system to perform operations including: allocating a cache within the memory of the system; identifying, via a workload discovery engine, pending workload tasks to be scheduled for execution from one or more workload queues and updating the cache; identifying, via a compute resource discovery engine, a plurality of computing resources available to execute the workload tasks and updating the cache; identifying, via an external services monitor, a plurality of external services accessible to the workload tasks and updating the cache; executing a scheduler via the processor of the system, wherein the scheduler performs at least the following operations: scheduling the workload tasks for execution on the plurality of computing resources; identifying a failure condition for one of the plurality of external services accessible to the workload tasks; identifying any of the workload tasks potentially affected by the failure condition of the external service based on the workload tasks specifying the external service as a dependency and based further on execution of the workload tasks overlapping in time with a time frame associated with the failure condition; and scheduling the workload tasks potentially affected by the failure condition of the external service for a repeated execution on the plurality of computing resources.

17. The non-transitory computer readable storage media of claim 16 , wherein identifying any of the workload tasks potentially affected by the failure condition of the external service further comprises: identifying all currently executing workload tasks potentially affected by the failure condition of the external service; terminating execution of the currently executing workload tasks potentially affected by the failure condition of the external service; marking results of the workload tasks as unsatisfactory in the cache; and wherein the scheduler schedules the workload tasks having been terminated for a repeated execution.

18. The non-transitory computer readable storage media of claim 16 , wherein identifying any of the workload tasks potentially affected by the failure condition of the external service further comprises: identifying all previously completed workload tasks potentially affected by the failure condition of the external service; marking results of the previously completed workload tasks as unsatisfactory in the cache; and wherein the scheduler schedules the previously completed workload tasks for a repeated execution.

19. A system to implement a scheduling service, wherein the system comprises: a processor and a memory to execute instructions at the system; a cache allocated within the memory of the system to store information on behalf of a compute resource discovery engine and a workload discovery engine and an external services monitor and a scheduler; and system logic to cause the system to perform various operations including: identifying, via the compute resource discovery engine, a plurality of computing resources available to execute the workload tasks and updating the cache specifying the identified plurality of computing resources; identifying, via the workload discovery engine, pending workload tasks to be scheduled for execution from one or more workload queues and updating the cache with the identified plurality of the pending workload tasks; identifying, via the external services monitor, a plurality of external services accessible to the workload tasks and updating the cache with the identified plurality of external services; scheduling, via the scheduler, the workload tasks for execution on the plurality of computing resources; identifying, via the scheduler, a failure condition for one of the plurality of external services accessible to the workload tasks; identifying, via the scheduler, any of the workload tasks potentially affected by the failure condition of the external service based on the workload tasks specifying the external service as a dependency and based further on execution of the workload tasks overlapping in time with a time frame associated with the failure condition; and further scheduling, via the scheduler, the workload tasks potentially affected by the failure condition of the external service for a repeated execution on the plurality of computing resources.

20. The system of claim 19 , further comprising: identifying, via the scheduler, all currently executing workload tasks potentially affected by the failure condition of the external service; terminating, via the scheduler, execution of the currently executing workload tasks potentially affected by the failure condition of the external service; marking, via the scheduler, results of the workload tasks as unsatisfactory in the cache; and scheduling, via the workload scheduler, the workload tasks having been terminated for a repeated execution.

21. The system of claim 19 , wherein identifying, via the scheduler, any of the workload tasks potentially affected by the failure condition of the external service further comprises: identifying all previously completed workload tasks potentially affected by the failure condition of the external service; marking results of the previously completed workload tasks as unsatisfactory in the cache; and scheduling, via the scheduler, the previously completed workload tasks for a repeated execution.

Classification Codes (CPC)

Cooperative Patent Classification codes for this invention. Click any code to explore related patents in that topic.

G06F

Patent Metadata

Filing Date

August 1, 2019

Publication Date

February 8, 2022

Want to explore more patents?

Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.

Browse All Patents Try Prior Art Search