In accordance with disclosed embodiments, there are provided systems, methods, and apparatuses for implementing a stateless, deterministic scheduler and work discovery system with interruption recovery. For instance, according to one embodiment, there is disclosed a system to implement a stateless scheduler service, in which the system includes: a processor and a memory to execute instructions at the system; a compute resource discovery engine to identify one or more computing resources available to execute workload tasks; a workload discovery engine to identify a plurality of workload tasks to be scheduled for execution; a cache to store information on behalf of the compute resource discovery engine and the workload discovery engine; a scheduler to request information from the cache specifying the one or more computing resources available to execute workload tasks and the plurality of workload tasks to be scheduled for execution; and further in which the scheduler is to schedule at least a portion of the plurality of workload tasks for execution via the one or more computing resources based on the information requested. Other related embodiments are disclosed.
Legal claims defining the scope of protection. Each claim is shown in both the original legal language and a plain English translation.
1. A method performed by a system having at least a processor and a memory therein, wherein the method comprises: allocating a cache within the memory of the system; identifying, via a workload discovery engine, pending workload tasks to be scheduled for execution from one or more workload queues and updating the cache; identifying, via a compute resource discovery engine, a plurality of computing resources available to execute the workload tasks and updating the cache; identifying, via an external services monitor, a plurality of external services accessible to the workload tasks and updating the cache; executing a scheduler via the processor of the system, wherein the scheduler performs at least the following operations: scheduling the workload tasks for execution on the plurality of computing resources; identifying a failure condition for one of the plurality of external services accessible to the workload tasks; identifying any of the workload tasks potentially affected by the failure condition of the external service based on the workload tasks specifying the external service as a dependency and based further on execution of the workload tasks overlapping in time with a time frame associated with the failure condition; and scheduling the workload tasks potentially affected by the failure condition of the external service for a repeated execution on the plurality of computing resources.
2. The method of claim 1 , wherein the external services monitor listens to and monitors the health and operational status of the plurality of external services accessible to the workload tasks and updates the information in the cache specifying the timeframe of any service degradation, failure mode, or service outage associated with any of the plurality of external services monitored.
This invention relates to monitoring and managing the health and operational status of external services in a computing environment. The problem addressed is ensuring reliable access to external services by workload tasks, particularly during service degradation, failures, or outages. The solution involves a monitoring system that continuously tracks the health and operational status of multiple external services accessible to workload tasks. The system updates a cache with detailed information about any service issues, including the timeframe of degradation, failure modes, or outages. This allows workload tasks to make informed decisions about service usage, improving resilience and reliability. The monitoring system dynamically adjusts based on real-time service conditions, ensuring workload tasks can adapt to service disruptions without manual intervention. The cache serves as a centralized repository for service status data, enabling quick access and reducing the need for repeated health checks. This approach minimizes downtime and enhances the overall efficiency of workload task execution by proactively managing external service dependencies.
3. The method of claim 1 , wherein identifying any of the workload tasks potentially affected by the failure condition of the external service further comprises: marking results of the workload tasks as unsatisfactory in the cache; and wherein the scheduler on a subsequent scheduling heartbeat iteration schedules the workload tasks having results marked as unsatisfactory for a repeated execution.
This invention relates to a system for managing workload tasks in a computing environment where external services may experience failures. The problem addressed is ensuring reliable execution of workload tasks when an external service failure could corrupt or invalidate task results, leading to incorrect downstream processing. The system includes a scheduler that periodically checks for failure conditions in external services and identifies workload tasks that may be affected by such failures. When a failure is detected, the system marks the results of potentially affected workload tasks as unsatisfactory in a cache. The scheduler then reschedules these tasks for repeated execution during a subsequent scheduling cycle, ensuring that only valid results are processed. This approach prevents corrupted data from propagating through the system while maintaining task execution efficiency by avoiding unnecessary retries of unaffected tasks. The method involves monitoring external services for failures, analyzing task dependencies to determine which tasks may be impacted, and selectively retrying only those tasks with invalidated results. The cache-based marking system allows for efficient tracking of affected tasks without requiring extensive re-evaluation of all workloads. This solution improves system reliability by ensuring data integrity while minimizing computational overhead.
4. The method of claim 1 , wherein identifying any of the workload tasks potentially affected by the failure condition of the external service further comprises: identifying all currently executing workload tasks potentially affected by the failure condition of the external service; terminating execution of the currently executing workload tasks potentially affected by the failure condition of the external service; marking results of the workload tasks as unsatisfactory in the cache; and wherein the scheduler schedules the workload tasks having been terminated for a repeated execution.
This invention relates to managing workload tasks in a computing system when an external service experiences a failure condition. The problem addressed is ensuring system reliability and data integrity when external service disruptions occur, which can lead to incomplete or corrupted task results. The solution involves identifying and handling workload tasks that may be affected by such failures to prevent propagation of erroneous data. The method first identifies all currently executing workload tasks that could be impacted by the external service failure. These tasks are then terminated to prevent further processing with potentially corrupted data. The results of these terminated tasks are marked as unsatisfactory in the cache to indicate they should not be used. Finally, the scheduler reschedules the terminated tasks for repeated execution, ensuring they are reprocessed once the external service is restored or the failure condition is resolved. This approach maintains system reliability by isolating affected tasks and ensuring they are reprocessed under stable conditions. The method is particularly useful in distributed computing environments where external service dependencies are common.
5. The method of claim 1 , wherein identifying any of the workload tasks potentially affected by the failure condition of the external service further comprises: identifying all previously completed workload tasks potentially affected by the failure condition of the external service; marking results of the previously completed workload tasks as unsatisfactory in the cache; and wherein the scheduler schedules the previously completed workload tasks for a repeated execution.
This invention relates to workload management systems that handle failures in external services by identifying and reprocessing affected tasks. The system monitors external services for failure conditions that may impact workload execution. When a failure is detected, the system identifies all previously completed workload tasks that could be affected by the failure. These tasks are marked as unsatisfactory in the cache, and the scheduler reschedules them for repeated execution to ensure data integrity and system reliability. The approach ensures that tasks dependent on external services are re-evaluated if the service fails, preventing incorrect or incomplete results from being used. The system dynamically adjusts task execution based on real-time service status, improving fault tolerance in distributed computing environments. This method is particularly useful in systems where external service reliability is uncertain, such as cloud-based applications or microservices architectures. The invention enhances robustness by automatically detecting and correcting potential errors caused by external service failures, ensuring consistent and accurate workload processing.
6. The method of claim 5 , further comprising: saving results from the previously completed workload tasks and marked as unsatisfactory concurrently with new results generated by the repeated execution of the previously completed workload tasks; and returning both the results marked as unsatisfactory and the new results to a submitter of the workload task.
This invention relates to workload task processing systems, specifically improving the handling of previously completed tasks that are later marked as unsatisfactory. The problem addressed is the inefficiency in systems where tasks must be re-executed when results are deemed unsatisfactory, often leading to redundant processing and delayed feedback to users. The method involves a system that processes workload tasks, where some tasks may initially be completed but later flagged as unsatisfactory by the submitter. When a task is marked as unsatisfactory, the system concurrently saves the original unsatisfactory results while re-executing the task to generate new results. Both the unsatisfactory results and the newly generated results are then returned to the submitter. This approach ensures that the submitter receives immediate feedback on the original results while also obtaining updated results from the re-execution, improving transparency and efficiency in task processing. The system may also track task dependencies, ensuring that re-execution of dependent tasks is handled appropriately. This method is particularly useful in environments where task accuracy is critical, such as data processing, scientific computing, or automated workflows.
7. The method of claim 1 , wherein identifying any of the workload tasks potentially affected by the failure condition of the external service comprises: identifying the failure condition of the external service and the time frame associated with the failure condition based on the information in the cache as updated by the external services monitor.
This invention relates to managing workload tasks in a computing system when an external service experiences a failure condition. The problem addressed is the need to efficiently identify and mitigate the impact of external service failures on dependent workload tasks, ensuring system reliability and minimizing disruptions. The method involves monitoring external services to detect failure conditions and their associated time frames. A cache stores this information, updated by an external services monitor. When a failure is detected, the system identifies workload tasks potentially affected by the failure by analyzing the cached data. This includes determining which tasks rely on the failed service and the specific time frame during which the failure occurred. The system then takes corrective actions, such as rerouting tasks, delaying execution, or notifying administrators, to maintain system stability. The external services monitor continuously tracks the status of external services, updating the cache in real-time. The cache stores failure conditions, timestamps, and other relevant details to enable quick retrieval and analysis. By leveraging this cached data, the system can rapidly assess the impact of failures and respond proactively, reducing downtime and improving overall system resilience. The approach ensures that workload tasks are managed effectively even when external dependencies are compromised.
8. The method of claim 1 , wherein scheduling the workload tasks potentially affected by the failure condition of the external service for a repeated execution comprises: scheduling the repeated execution on a compute cloud which is different than a compute cloud having executed the workload tasks potentially affected by the failure condition of the external service.
This invention relates to fault-tolerant workload management in distributed computing environments, specifically addressing the problem of handling failures in external services that disrupt workload execution. The method involves detecting a failure condition in an external service that affects workload tasks, then rescheduling those tasks for repeated execution. A key aspect is executing the repeated tasks on a different compute cloud than the one originally used, ensuring isolation from the failure condition. This approach mitigates the risk of repeated failures by leveraging independent infrastructure. The method may also include analyzing the failure condition to determine affected tasks and prioritizing their rescheduling. The solution is particularly useful in multi-cloud environments where workloads depend on external services that may experience regional or provider-specific outages. By dynamically rerouting affected tasks to alternative compute clouds, the system improves resilience and reduces downtime. The invention assumes a distributed computing architecture with multiple available compute clouds and the capability to monitor external service health. The primary technical challenge addressed is ensuring continuous workload execution despite external service failures, which is critical for mission-critical applications.
9. The method of claim 1 , wherein scheduling the workload tasks potentially affected by the failure condition of the external service for a repeated execution comprises: scheduling the repeated execution on a different one of the plurality of computing resources having a different compute footprint than a compute resource having executed the workload tasks potentially affected by the failure condition of the external service; wherein the computing resources having the different compute footprint is selected from the group comprising: a compute footprint optimized for CPU bandwidth; a compute footprint optimized for GPU bandwidth; a compute footprint optimized for Input/Output (I/O) throughput; a compute footprint optimized for memory; a compute footprint utilizing AMD CPU architecture; a compute footprint utilizing Intel CPU architecture; compute footprints utilizing different sized Virtual Machines (VMs); compute footprints utilizing different operating systems; and compute footprints utilizing different CPU core quantities.
This invention relates to workload management in computing systems, specifically addressing the problem of handling failures in external services that disrupt workload execution. The method involves detecting a failure condition in an external service and identifying workload tasks potentially affected by this failure. To mitigate the impact, the affected tasks are rescheduled for repeated execution on a different computing resource with a distinct compute footprint. The compute footprint refers to the hardware and software configuration of the resource, which can vary in terms of CPU architecture (e.g., AMD or Intel), GPU bandwidth, I/O throughput, memory allocation, VM size, operating system, or CPU core quantity. By leveraging different compute footprints, the method aims to improve resilience and reliability in workload execution, ensuring that tasks are not repeatedly affected by the same external service failure. The approach allows for dynamic adaptation to resource characteristics, optimizing performance and reducing downtime. This technique is particularly useful in distributed computing environments where workloads depend on external services that may experience intermittent failures.
10. The method of claim 1 , wherein scheduling the workload tasks potentially affected by the failure condition of the external service for a repeated execution comprises: scheduling the repeated execution with a datacenter in different geographical regions and having a different compute footprint than the compute resource having initially executed the workload tasks potentially affected by the failure condition of the external service.
This invention relates to fault-tolerant workload management in distributed computing environments, specifically addressing the problem of service disruptions caused by external service failures. The method ensures continuity by rescheduling affected workload tasks for repeated execution in a manner that mitigates the risk of recurring failures. When a failure condition in an external service is detected, the system identifies workload tasks potentially impacted by this failure. These tasks are then rescheduled for repeated execution, but with a critical enhancement: the repeated execution is performed in a datacenter located in a different geographical region and with a distinct compute footprint compared to the original compute resource. This geographical and architectural diversity reduces the likelihood of the same failure condition affecting the rescheduled tasks, thereby improving system resilience. The approach leverages distributed infrastructure to isolate and recover from external service disruptions, ensuring higher availability and reliability for dependent workloads. The method is particularly useful in cloud computing and multi-region deployments where external service dependencies are common.
11. The method of claim 1 , wherein executing the scheduler via the processor of the system comprises the scheduler to perform at least the following additional operations: producing a list of the workload tasks to be executed based on the information retrieved from the cache; computing available compute capacity to execute workload tasks at each of the plurality of computing resources based on the information retrieved from the cache; selecting the workload tasks for execution via the plurality of computing resources based on the information retrieved from the cache; and planning execution of the workload tasks by scheduling the workload tasks for execution at the plurality of computing resources based on the computed available capacity to execute workload tasks at each of the plurality of computing resources.
This invention relates to a method for optimizing workload task scheduling in a distributed computing system. The system includes multiple computing resources and a scheduler that manages task execution across these resources. The problem addressed is inefficient task allocation, which can lead to underutilized resources or bottlenecks in processing. The method involves a scheduler that retrieves information from a cache to identify workload tasks requiring execution. The scheduler then generates a list of these tasks and computes the available compute capacity at each computing resource based on the cached data. Using this information, the scheduler selects tasks for execution and plans their execution by assigning them to the appropriate computing resources. The scheduling is based on the computed available capacity to ensure efficient resource utilization and balanced workload distribution. The scheduler dynamically adjusts task allocation in real-time, optimizing performance by leveraging cached data to minimize delays and improve decision-making. This approach enhances system efficiency by reducing idle time and ensuring tasks are executed where resources are available. The method is particularly useful in environments with fluctuating workloads and diverse computing resources.
12. The method of claim 11 , wherein the scheduler to further perform at least the following additional operations: initiating execution of the workload tasks at the plurality of computing resources pursuant to the planned execution; and removing from the list of the workload tasks to be executed as represented at the cache any of the workload tasks for which execution is initiated.
This invention relates to workload scheduling in distributed computing systems, addressing the challenge of efficiently managing and executing tasks across multiple computing resources. The method involves a scheduler that dynamically plans and adjusts the execution of workload tasks based on real-time conditions, such as resource availability and task dependencies. The scheduler generates an execution plan that specifies the order and timing of task execution across the computing resources. Once the plan is established, the scheduler initiates the execution of the workload tasks according to the plan. As tasks are initiated, they are removed from a cached list of pending tasks to ensure accurate tracking of remaining workloads. This approach optimizes resource utilization and minimizes execution delays by dynamically adapting to changing system conditions. The method also includes mechanisms to handle task dependencies, ensuring that tasks are executed in the correct sequence. By continuously monitoring and adjusting the execution plan, the system improves overall efficiency and reliability in distributed computing environments.
13. The method of claim 1 , further comprising: operating, from the system, an external cloud interface to communicatively link the system with one or more third-party private and/or public computing clouds via a public Internet.
This invention relates to a system for managing and processing data, particularly in environments requiring secure and scalable computing resources. The system addresses the challenge of efficiently integrating and utilizing external cloud computing services while maintaining data security and operational flexibility. The system includes a core processing unit that handles data storage, retrieval, and analysis tasks. It employs encryption mechanisms to protect data during transmission and storage, ensuring confidentiality and integrity. The system also incorporates a user interface for configuring and monitoring operations, allowing users to define processing rules, access controls, and performance metrics. A key feature of the system is its ability to operate an external cloud interface that communicatively links the system with one or more third-party private and/or public computing clouds via the public Internet. This interface enables the system to leverage external cloud resources for additional processing power, storage capacity, or specialized services, such as machine learning or big data analytics. The interface ensures secure and seamless integration with these external clouds, allowing the system to dynamically allocate tasks and resources based on demand. This capability enhances scalability and cost-efficiency by offloading workloads to external clouds when needed, while maintaining control over data security and compliance requirements. The system thus provides a flexible and robust solution for organizations requiring hybrid cloud computing environments.
14. The method of claim 1 , wherein identifying the plurality of computing resources available to execute workload tasks and updating the cache specifying the identified computing resources, comprises: the compute resource discovery engine to autonomously discover any one of: one or more third-party compute clouds accessible to the scheduler; one or more private on-demand compute clouds accessible to the scheduler; one or more public on-demand compute clouds accessible to the scheduler; one or more computing pods within a local host organization within which a scheduling service of the system operates when the one or more computing pods are accessible to the scheduler; one or more remote computing pods within a remote host organization separate from the local host organization within which the scheduling service operates when the one or more remote computing pods are accessible to the scheduling service through the remote host organization; an OpenStack computing cloud accessible to the scheduler; a VMWare computing cloud accessible to the scheduler; an Amazon Web Services (AWS) public computing cloud accessible to the scheduler; a Microsoft Azure public computing cloud accessible to the scheduler; an AWS Direct Connect privately leased computing space accessible to the scheduler; and an Azure ExpressRoute privately leased computing space accessible to the scheduler.
This invention relates to a system for dynamically discovering and managing computing resources across diverse environments to execute workload tasks efficiently. The system addresses the challenge of optimizing workload distribution by autonomously identifying available computing resources from various sources, including third-party, private, and public on-demand compute clouds, as well as local and remote computing pods within different host organizations. The system also supports integration with specific cloud platforms such as OpenStack, VMWare, Amazon Web Services (AWS), and Microsoft Azure, including privately leased computing spaces like AWS Direct Connect and Azure ExpressRoute. A compute resource discovery engine autonomously scans and updates a cache with the identified resources, ensuring the scheduler has real-time access to the latest available compute options. This enables efficient workload task execution by leveraging the most suitable resources from a broad range of accessible computing environments, improving resource utilization and task performance. The system dynamically adapts to changes in resource availability, ensuring optimal workload distribution across heterogeneous computing infrastructures.
15. The method of claim 1 , wherein the system comprises a multi-tenant database system having customer data stored therein for a plurality of distinct customer organizations; wherein each customer organization is an entity selected from the group consisting of: a separate and distinct remote organization, an organizational group within the host organization, a business partner of the host organization, or a customer organization that subscribes to cloud computing services provided by the host organization; wherein the system operates at a host organization as a cloud-based service provider to the plurality of distinct customer organizations; and wherein the cloud-based service provider receives inputs from the plurality of distinct customer organizations to schedule workload tasks for execution the plurality of computing resources.
This invention relates to a cloud-based multi-tenant database system that manages workload task scheduling for multiple customer organizations. The system stores customer data for distinct entities, which may include separate remote organizations, internal organizational groups, business partners, or subscribers to cloud services. Operating as a cloud service provider, the system receives inputs from these organizations to schedule and execute workload tasks across computing resources. The multi-tenant architecture ensures data isolation while allowing centralized management of tasks for diverse customer types. The system dynamically allocates resources based on inputs from different organizations, optimizing task execution in a shared cloud environment. This approach enhances scalability and efficiency by consolidating workload management for various customer relationships under a unified cloud infrastructure. The solution addresses the challenge of managing heterogeneous workloads in a multi-tenant cloud setting, ensuring secure and efficient task processing for different organizational structures.
16. Non-transitory computer readable storage media having instructions stored thereupon that, when executed by a system having at least a processor and a memory therein, the instructions cause the system to perform operations including: allocating a cache within the memory of the system; identifying, via a workload discovery engine, pending workload tasks to be scheduled for execution from one or more workload queues and updating the cache; identifying, via a compute resource discovery engine, a plurality of computing resources available to execute the workload tasks and updating the cache; identifying, via an external services monitor, a plurality of external services accessible to the workload tasks and updating the cache; executing a scheduler via the processor of the system, wherein the scheduler performs at least the following operations: scheduling the workload tasks for execution on the plurality of computing resources; identifying a failure condition for one of the plurality of external services accessible to the workload tasks; identifying any of the workload tasks potentially affected by the failure condition of the external service based on the workload tasks specifying the external service as a dependency and based further on execution of the workload tasks overlapping in time with a time frame associated with the failure condition; and scheduling the workload tasks potentially affected by the failure condition of the external service for a repeated execution on the plurality of computing resources.
This invention relates to a system for managing workload execution in a computing environment, particularly focusing on handling dependencies on external services and mitigating failures. The system includes a cache within the memory of a computing system, which is dynamically updated by multiple engines. A workload discovery engine identifies pending tasks from workload queues and updates the cache with this information. A compute resource discovery engine identifies available computing resources and updates the cache accordingly. An external services monitor tracks accessible external services and updates the cache with their status. A scheduler then uses this cached data to allocate workload tasks to available computing resources. The scheduler also monitors external services for failures and determines which tasks may be affected by these failures. If a task depends on an external service that fails during its execution window, the scheduler reschedules the affected task for repeated execution. This ensures that tasks dependent on unreliable external services are retried, improving system resilience and reliability. The system dynamically adapts to changes in workload, resources, and service availability, optimizing task execution and recovery from failures.
17. The non-transitory computer readable storage media of claim 16 , wherein identifying any of the workload tasks potentially affected by the failure condition of the external service further comprises: identifying all currently executing workload tasks potentially affected by the failure condition of the external service; terminating execution of the currently executing workload tasks potentially affected by the failure condition of the external service; marking results of the workload tasks as unsatisfactory in the cache; and wherein the scheduler schedules the workload tasks having been terminated for a repeated execution.
This invention relates to fault-tolerant computing systems that manage workload tasks dependent on external services. The problem addressed is ensuring system reliability when an external service fails, particularly by identifying and handling affected workload tasks to prevent cascading failures or incorrect results. The system monitors external services for failure conditions. When a failure is detected, it identifies all currently executing workload tasks that depend on the failed service. These tasks are terminated to prevent further processing with potentially corrupted or incomplete data. The results of these terminated tasks are marked as unsatisfactory in the cache to avoid their use in subsequent operations. The scheduler then reschedules the terminated tasks for repeated execution once the external service is restored or an alternative is available. This approach ensures that affected tasks are reprocessed rather than allowing invalid results to propagate through the system. The solution improves fault tolerance by isolating the impact of external service failures and ensuring affected workloads are reprocessed under stable conditions.
18. The non-transitory computer readable storage media of claim 16 , wherein identifying any of the workload tasks potentially affected by the failure condition of the external service further comprises: identifying all previously completed workload tasks potentially affected by the failure condition of the external service; marking results of the previously completed workload tasks as unsatisfactory in the cache; and wherein the scheduler schedules the previously completed workload tasks for a repeated execution.
This invention relates to a system for managing workload tasks in a computing environment where external service failures may impact task execution. The system identifies workload tasks potentially affected by a failure condition of an external service, including previously completed tasks that may have been impacted. The system marks the results of these previously completed tasks as unsatisfactory in a cache and schedules them for repeated execution. The scheduler ensures that affected tasks are re-executed to maintain data integrity and system reliability. The system also includes a cache for storing task results and a scheduler for managing task execution. The scheduler determines whether a task is affected by the failure condition and schedules the task for execution based on this determination. The system may also include a monitoring component to detect failure conditions in external services and trigger the identification and re-execution of affected tasks. This approach ensures that tasks dependent on external services are reliably executed, even if those services experience failures. The system is particularly useful in environments where external service reliability is uncertain, such as cloud computing or distributed systems.
19. A system to implement a scheduling service, wherein the system comprises: a processor and a memory to execute instructions at the system; a cache allocated within the memory of the system to store information on behalf of a compute resource discovery engine and a workload discovery engine and an external services monitor and a scheduler; and system logic to cause the system to perform various operations including: identifying, via the compute resource discovery engine, a plurality of computing resources available to execute the workload tasks and updating the cache specifying the identified plurality of computing resources; identifying, via the workload discovery engine, pending workload tasks to be scheduled for execution from one or more workload queues and updating the cache with the identified plurality of the pending workload tasks; identifying, via the external services monitor, a plurality of external services accessible to the workload tasks and updating the cache with the identified plurality of external services; scheduling, via the scheduler, the workload tasks for execution on the plurality of computing resources; identifying, via the scheduler, a failure condition for one of the plurality of external services accessible to the workload tasks; identifying, via the scheduler, any of the workload tasks potentially affected by the failure condition of the external service based on the workload tasks specifying the external service as a dependency and based further on execution of the workload tasks overlapping in time with a time frame associated with the failure condition; and further scheduling, via the scheduler, the workload tasks potentially affected by the failure condition of the external service for a repeated execution on the plurality of computing resources.
This system provides a scheduling service for managing workload execution in a distributed computing environment. The system addresses the challenge of efficiently allocating computing resources while ensuring workload tasks can access required external services and handling service failures. The system includes a processor, memory, and a cache that stores information for a compute resource discovery engine, a workload discovery engine, an external services monitor, and a scheduler. The compute resource discovery engine identifies available computing resources and updates the cache with this information. The workload discovery engine identifies pending workload tasks from one or more workload queues and updates the cache with these tasks. The external services monitor identifies accessible external services and updates the cache accordingly. The scheduler then schedules the workload tasks for execution on the identified computing resources. If the scheduler detects a failure in an external service, it identifies any workload tasks that depend on that service and whose execution overlaps with the failure time frame. These affected tasks are rescheduled for repeated execution to ensure completion despite the service disruption. The system ensures efficient resource utilization and task execution reliability by dynamically monitoring and responding to changes in available resources and external service availability.
20. The system of claim 19 , further comprising: identifying, via the scheduler, all currently executing workload tasks potentially affected by the failure condition of the external service; terminating, via the scheduler, execution of the currently executing workload tasks potentially affected by the failure condition of the external service; marking, via the scheduler, results of the workload tasks as unsatisfactory in the cache; and scheduling, via the workload scheduler, the workload tasks having been terminated for a repeated execution.
A system for managing workload tasks in a computing environment where external service failures may impact task execution. The system monitors external services for failure conditions that could affect workload tasks. When a failure condition is detected, the system identifies all currently executing workload tasks that may be impacted by the failure. The system then terminates these tasks to prevent further processing with potentially corrupted or incomplete data. The results of the terminated tasks are marked as unsatisfactory in a cache to ensure they are not used in subsequent operations. The system then reschedules the terminated tasks for repeated execution once the failure condition is resolved or mitigated. This approach ensures data integrity and system reliability by preventing the propagation of errors caused by external service failures. The system may also include mechanisms for prioritizing the rescheduling of tasks based on their importance or dependencies, and for logging the failure events for diagnostic purposes. The overall goal is to maintain system stability and data consistency in the presence of external service disruptions.
21. The system of claim 19 , wherein identifying, via the scheduler, any of the workload tasks potentially affected by the failure condition of the external service further comprises: identifying all previously completed workload tasks potentially affected by the failure condition of the external service; marking results of the previously completed workload tasks as unsatisfactory in the cache; and scheduling, via the scheduler, the previously completed workload tasks for a repeated execution.
This invention relates to a system for managing workload tasks in a computing environment where external service failures may impact task execution. The system detects failure conditions in external services and identifies workload tasks that could be affected by these failures. The system includes a scheduler that manages task execution and a cache that stores task results. When a failure condition is detected, the system identifies not only pending tasks but also previously completed tasks that may have been affected by the failure. The system marks the results of these previously completed tasks as unsatisfactory in the cache and reschedules them for repeated execution. This ensures that tasks potentially compromised by the failure are re-evaluated, maintaining data integrity and system reliability. The system dynamically adjusts task execution based on external service status, preventing the propagation of erroneous results and improving overall system robustness. The scheduler coordinates the identification, marking, and rescheduling processes to handle affected tasks efficiently. This approach is particularly useful in environments where external service reliability is critical, such as distributed computing systems or cloud-based applications.
Cooperative Patent Classification codes for this invention. Click any code to explore related patents in that topic.
August 1, 2019
February 8, 2022
Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.