Determining a Recovery Mechanism in a Storage System Using a Machine Learning Module

PublishedOctober 13, 2020

Assigneenot available in USPTO data we have

InventorsBrian A. RINALDI Clint A. HARDY Lokesh M. GUPTA

Technical Abstract

Patent Claims

20 claims

Legal claims defining the scope of protection. Each claim is shown in both the original legal language and a plain English translation.

Claim 1

Original Legal Text

1. A method, comprising: in response to an occurrence of a failure in a storage controller, providing input on a plurality of attributes of the storage controller at a time of occurrence of the failure to a machine learning module; in response to receiving the input, generating, by the machine learning module, a plurality of output values corresponding to a plurality of recovery mechanisms to recover from the failure in the storage controller; and recovering from the failure in the storage controller, by applying a recovery mechanism whose output value is greatest among the plurality of output values that are generated by the machine learning module.

Plain English Translation

This invention relates to automated failure recovery in storage systems. Storage controllers are critical components that manage data storage operations, and failures in these systems can lead to downtime, data loss, or performance degradation. Traditional recovery methods often rely on predefined rules or manual intervention, which may not be optimal for all failure scenarios. The invention addresses this problem by using a machine learning module to dynamically select the most effective recovery mechanism for a given failure. When a failure occurs in a storage controller, the system collects input data on various attributes of the controller at the time of failure, such as error logs, system state, and performance metrics. This input is fed into a machine learning module, which processes the data to generate output values for multiple potential recovery mechanisms. Each output value represents the likelihood or effectiveness of a corresponding recovery mechanism in resolving the failure. The system then applies the recovery mechanism with the highest output value, ensuring the most suitable solution is deployed automatically. This approach improves recovery efficiency by leveraging machine learning to analyze failure contexts and select the best recovery strategy, reducing downtime and minimizing manual intervention. The machine learning module is trained on historical failure data to improve its accuracy over time. The invention can be applied to various storage systems, including enterprise storage arrays and cloud-based storage solutions.

Claim 2

Original Legal Text

2. The method of claim 1 , wherein the storage controller controls access to a plurality of storage devices for a plurality of hosts, and wherein the storage controller is comprised of: a host adapter that is an interface between the storage controller and a host computational device; a device adapter that is an interface between the storage controller and a storage device that is in a Redundant Array of Independent Disks (RAID) configuration; a cache; and a non-volatile storage (NVS).

Plain English Translation

A storage controller manages data access for multiple hosts and storage devices configured in a Redundant Array of Independent Disks (RAID) setup. The controller includes a host adapter serving as an interface between the controller and host computational devices, allowing communication and data transfer. A device adapter acts as an interface between the controller and the RAID storage devices, facilitating data storage and retrieval operations. The controller also incorporates a cache to temporarily store frequently accessed data, improving performance by reducing latency. Additionally, a non-volatile storage (NVS) component ensures data persistence during power failures or system disruptions, safeguarding critical information. The system optimizes data handling by coordinating these components to efficiently manage access, storage, and retrieval across multiple hosts and RAID devices, enhancing reliability and performance in storage environments.

Claim 3

Original Legal Text

3. The method of claim 2 , wherein the plurality of attributes includes: measures corresponding to indications and characteristics of errors and panics that have been generated in the storage controller; and a measure of a hardware part associated with the failure.

Plain English Translation

This invention relates to monitoring and analyzing storage controller failures in a computing system. The problem addressed is the need to accurately identify and diagnose hardware and software failures in storage controllers to improve system reliability and reduce downtime. The method involves collecting and analyzing multiple attributes related to storage controller failures, including error and panic indications, error characteristics, and hardware component associations. The attributes include measures corresponding to indications and characteristics of errors and panics generated in the storage controller, such as error codes, frequency, severity, and timing. Additionally, the method tracks hardware parts associated with failures, such as specific components like memory modules, processors, or storage devices that may be contributing to the issues. By analyzing these attributes, the system can correlate failures with specific hardware parts, identify recurring error patterns, and predict potential failures before they occur. This enables proactive maintenance and reduces the risk of system outages. The method supports automated failure diagnosis and root cause analysis, improving troubleshooting efficiency and system reliability.

Claim 4

Original Legal Text

4. The method of claim 2 , wherein the plurality of attributes includes: a measure of whether the cache is queued for segments; a measure of whether the NVS is queued for segments; a measure of whether the device adapter is queued for resources; and a measure of whether a RAID rebuild is in progress.

Plain English Translation

This invention relates to data storage systems, specifically methods for monitoring and managing system performance by tracking multiple attributes related to cache, non-volatile storage (NVS), device adapters, and RAID operations. The problem addressed is the need for comprehensive performance monitoring to optimize resource allocation and prevent bottlenecks in storage systems. The method involves evaluating a plurality of attributes to assess system status. These attributes include whether the cache is queued for segments, whether the NVS is queued for segments, whether the device adapter is queued for resources, and whether a RAID rebuild is in progress. By monitoring these factors, the system can determine resource availability and potential performance constraints. The cache and NVS attributes indicate whether data segments are pending processing, while the device adapter attribute assesses whether hardware resources are fully utilized. The RAID rebuild attribute identifies ongoing background operations that may impact system performance. This approach enables dynamic adjustments to workload distribution, ensuring efficient use of storage resources and minimizing delays. The method helps prevent system slowdowns by proactively identifying and addressing bottlenecks in cache, NVS, device adapters, and RAID operations. The combination of these attributes provides a holistic view of system health, allowing for better decision-making in resource management.

Claim 5

Original Legal Text

5. The method of claim 2 , wherein the plurality of attributes includes: a measure of whether the storage controller is executing a mainline code or an error recovery code at a time of the failure; a measure of whether the device adapter is fenced; and a measure of whether the host adapter is fenced.

Plain English Translation

This invention relates to storage system diagnostics, specifically methods for analyzing storage controller failures by evaluating multiple system attributes to determine the root cause. The problem addressed is the difficulty in diagnosing storage controller failures due to the lack of comprehensive attribute analysis, which can lead to prolonged downtime and inefficient troubleshooting. The method involves collecting and analyzing a plurality of attributes related to the storage controller and its components at the time of failure. Key attributes include whether the storage controller is executing mainline code or error recovery code, the fenced status of the device adapter, and the fenced status of the host adapter. Mainline code execution indicates normal operation, while error recovery code suggests a prior failure state. A fenced device adapter or host adapter is isolated from the system to prevent further errors, which helps identify whether the failure is hardware or software-related. By evaluating these attributes, the system can determine whether the failure is due to a transient error, a persistent hardware fault, or a software issue. This structured approach improves diagnostic accuracy and reduces recovery time by narrowing down potential causes. The method is particularly useful in high-availability storage environments where rapid failure resolution is critical.

Claim 6

Original Legal Text

6. The method of claim 2 , wherein the plurality of attributes includes: a measure of whether the storage controller is in a single server configuration or is in a dual server configuration; and a measure of previously known recovery mechanisms for errors corresponding to the failure.

Plain English Translation

This invention relates to storage systems and methods for managing error recovery in storage controllers. The problem addressed is the need for adaptive error recovery mechanisms that account for different storage controller configurations and prior recovery strategies. The invention provides a method for determining and applying appropriate recovery actions based on system attributes, including whether the storage controller operates in a single-server or dual-server configuration and the previously known recovery mechanisms for specific error types. By evaluating these attributes, the system can select the most effective recovery strategy, improving reliability and minimizing downtime. The method involves analyzing the storage controller's configuration and historical recovery data to tailor the response to detected failures, ensuring optimal performance and data integrity. This approach enhances fault tolerance by dynamically adjusting recovery procedures based on the system's current state and past recovery experiences. The invention is particularly useful in enterprise storage environments where high availability and efficient error handling are critical.

Claim 7

Original Legal Text

7. The method of claim 1 , the method further comprising: transmitting, by the storage controller, the plurality of output values to a central computing device that generates weights and biases to be applied to machine learning modules of a plurality of storage controllers.

Plain English Translation

This invention relates to distributed machine learning in storage systems, specifically optimizing storage controller performance through collaborative training. The problem addressed is the inefficiency of individual storage controllers operating independently without leveraging collective data insights. The solution involves a distributed machine learning framework where multiple storage controllers generate output values from their local operations. These output values are transmitted to a central computing device, which processes them to generate optimized weights and biases. These parameters are then applied to machine learning modules within each storage controller, enabling adaptive performance improvements across the distributed system. The central computing device aggregates data from all storage controllers to refine the machine learning models, ensuring consistent and coordinated enhancements in storage operations. This approach allows for real-time adjustments based on aggregated system-wide data, improving efficiency, reliability, and predictive capabilities of the storage controllers. The invention enables scalable, collaborative learning in storage systems without requiring centralized data storage, preserving data locality while benefiting from collective intelligence.

Claim 8

Original Legal Text

8. A system, comprising: a memory; and a processor coupled to the memory, wherein the processor performs operations, the operations comprising: in response to an occurrence of a failure in a storage controller, providing input on a plurality of attributes of the storage controller at a time of occurrence of the failure to a machine learning module; in response to receiving the input, generating, by the machine learning module, a plurality of output values corresponding to a plurality of recovery mechanisms to recover from the failure in the storage controller; and recovering from the failure in the storage controller, by applying a recovery mechanism whose output value is greatest among the plurality of output values that are generated by the machine learning module.

Plain English Translation

The system addresses failures in storage controllers by using machine learning to select the most effective recovery mechanism. Storage controllers manage data storage operations, and failures can disrupt data access, leading to downtime and potential data loss. Traditional recovery methods rely on predefined rules or manual intervention, which may not always be optimal or timely. The system includes a processor and memory, where the processor executes operations to handle failures. When a failure occurs, the system collects attributes of the storage controller at the time of failure, such as error logs, system state, and performance metrics. These attributes are fed into a machine learning module, which evaluates multiple recovery mechanisms. The module generates output values representing the likelihood of success or efficiency of each mechanism. The system then applies the recovery mechanism with the highest output value, ensuring the most effective recovery path is chosen automatically. This approach improves recovery speed and reliability by leveraging data-driven decisions rather than static rules. The machine learning module can be trained on historical failure data to refine its predictions over time.

Claim 9

Original Legal Text

9. The system of claim 8 , wherein the storage controller controls access to a plurality of storage devices for a plurality of hosts, and wherein the storage controller is comprised of: a host adapter that is an interface between the storage controller and a host computational device; a device adapter that is an interface between the storage controller and a storage device that is in a Redundant Array of Independent Disks (RAID) configuration; a cache; and a non-volatile storage (NVS).

Plain English Translation

This invention relates to a storage controller system designed to manage access to multiple storage devices in a Redundant Array of Independent Disks (RAID) configuration for multiple host computational devices. The system ensures efficient and reliable data storage and retrieval operations. The storage controller includes a host adapter that serves as an interface between the storage controller and the host computational devices, facilitating communication and data transfer. A device adapter acts as an interface between the storage controller and the RAID-configured storage devices, enabling the controller to interact with the storage hardware. The system also incorporates a cache to temporarily store frequently accessed data, improving performance by reducing latency. Additionally, a non-volatile storage (NVS) component is included to preserve critical data during power failures or system disruptions, ensuring data integrity. The storage controller is responsible for coordinating access to the storage devices for multiple hosts, managing data distribution, redundancy, and fault tolerance in the RAID configuration. This setup enhances data availability, performance, and reliability for the connected hosts. The system optimizes storage operations by leveraging the cache for faster access and the NVS for data protection, making it suitable for high-demand environments.

Claim 10

Original Legal Text

10. The system of claim 9 , wherein the plurality of attributes includes: measures corresponding to indications and characteristics of errors and panics that have been generated in the storage controller; and a measure of a hardware part associated with the failure.

Plain English Translation

A system for monitoring and analyzing storage controller failures includes a plurality of attributes that provide detailed insights into error conditions and hardware failures. The system collects measures corresponding to indications and characteristics of errors and panics generated within the storage controller, allowing for precise identification of failure modes. Additionally, the system tracks a measure of the specific hardware part associated with the failure, enabling targeted diagnostics and maintenance. This approach enhances fault detection and resolution by correlating software-level errors with hardware components, improving system reliability and reducing downtime. The system may also include mechanisms for logging, analyzing, and reporting these attributes to facilitate proactive maintenance and troubleshooting. By integrating these attributes, the system provides a comprehensive view of storage controller health, supporting both automated and manual failure analysis. The solution is particularly useful in high-availability environments where minimizing disruptions is critical.

Claim 11

Original Legal Text

11. The system of claim 9 , wherein the plurality of attributes includes: a measure of whether the cache is queued for segments; a measure of whether the NVS is queued for segments; a measure of whether the device adapter is queued for resources; and a measure of whether a RAID rebuild is in progress.

Plain English Translation

This invention relates to a system for monitoring and managing storage device performance in a computing environment, particularly focusing on cache, non-volatile storage (NVS), device adapters, and RAID (Redundant Array of Independent Disks) operations. The system addresses inefficiencies in storage performance by tracking key attributes that impact system responsiveness and resource allocation. The system includes a monitoring component that evaluates multiple attributes to assess the operational state of storage devices. These attributes include whether the cache is queued for segments, indicating pending data transfers or processing delays. Similarly, it checks if the NVS is queued for segments, which reflects the status of non-volatile storage operations. The system also monitors whether device adapters are queued for resources, highlighting potential bottlenecks in data access. Additionally, it detects if a RAID rebuild is in progress, a resource-intensive process that can degrade performance. By analyzing these attributes, the system provides insights into storage subsystem health and identifies potential performance issues. This enables proactive management, such as load balancing, resource allocation adjustments, or prioritization of critical operations. The system enhances overall storage efficiency by ensuring optimal use of cache, NVS, and device adapter resources while minimizing disruptions from RAID rebuilds. This approach is particularly valuable in high-demand environments where storage performance directly impacts system reliability and user experience.

Claim 12

Original Legal Text

12. The system of claim 9 , wherein the plurality of attributes includes: a measure of whether the storage controller is executing a mainline code or an error recovery code at a time of the failure; a measure of whether the device adapter is fenced; and a measure of whether the host adapter is fenced.

Plain English Translation

The invention relates to a storage system that monitors and manages failures in a storage controller environment. The system detects failures in storage controllers and collects detailed diagnostic information to facilitate recovery and troubleshooting. The system includes a storage controller, a device adapter, and a host adapter, each of which may experience failures that disrupt data access or storage operations. The system tracks multiple attributes related to these failures, including whether the storage controller is executing mainline code or error recovery code at the time of failure, whether the device adapter is fenced (isolated to prevent further errors), and whether the host adapter is fenced. These attributes help determine the root cause of failures and guide recovery actions. The system may also include a management module that analyzes the collected data to identify patterns, assess system health, and recommend corrective measures. The goal is to improve system reliability by providing detailed failure context and automated recovery mechanisms.

Claim 13

Original Legal Text

13. The system of claim 9 , wherein the plurality of attributes includes: a measure of whether the storage controller is in a single server configuration or is in a dual server configuration; and a measure of previously known recovery mechanisms for errors corresponding to the failure.

Plain English Translation

This invention relates to storage systems and methods for managing failures in storage controllers. The system addresses the challenge of efficiently handling failures in storage controllers, particularly in environments where the controller may operate in either a single-server or dual-server configuration. The system includes a storage controller with a failure detection mechanism that identifies failures and determines whether the controller is in a single-server or dual-server configuration. The system also evaluates previously known recovery mechanisms for errors corresponding to the failure, allowing it to select an appropriate recovery strategy based on the system configuration and failure type. The storage controller may include multiple attributes that influence recovery decisions, such as the system configuration and available recovery options. The system ensures reliable data access and minimizes downtime by dynamically adapting recovery processes based on the detected failure and system setup. This approach improves fault tolerance and operational efficiency in storage environments.

Claim 14

Original Legal Text

14. The system of claim 8 , the operations further comprising: transmitting, by the storage controller, the plurality of output values to a central computing device that generates weights and biases to be applied to machine learning modules of a plurality of storage controllers.

Plain English Translation

The system involves a distributed storage architecture where multiple storage controllers manage data storage operations. Each storage controller processes data using machine learning modules to optimize storage performance, such as predicting access patterns or managing data placement. The system addresses the challenge of efficiently training and updating these machine learning models across a distributed storage environment, ensuring consistent performance and accuracy. The storage controller generates a plurality of output values based on its local operations, such as performance metrics, error rates, or other relevant data. These output values are transmitted to a central computing device, which aggregates data from multiple storage controllers. The central computing device analyzes the aggregated data to generate updated weights and biases for the machine learning modules. These weights and biases are then distributed back to the storage controllers to refine their machine learning models, improving their predictive accuracy and operational efficiency. By centralizing the training process, the system ensures that all storage controllers benefit from a globally optimized model, reducing the need for each controller to independently train its machine learning modules. This approach enhances scalability and consistency across the distributed storage system.

Claim 15

Original Legal Text

15. A computer program product, the computer program product comprising a computer readable storage medium having computer readable program code embodied therewith, the computer readable program code configured to perform operations in a storage controller or a computational device, the operations comprising: in response to an occurrence of a failure in the storage controller, providing input on a plurality of attributes of the storage controller at a time of occurrence of the failure to a machine learning module; in response to receiving the input, generating, by the machine learning module, a plurality of output values corresponding to a plurality of recovery mechanisms to recover from the failure in the storage controller; and recovering from the failure in the storage controller, by applying a recovery mechanism whose output value is greatest among the plurality of output values that are generated by the machine learning module.

Plain English Translation

This invention relates to storage systems and addresses the challenge of efficiently recovering from failures in storage controllers. Storage controllers manage data storage operations, and failures can disrupt system availability. Traditional recovery methods often rely on predefined rules or manual intervention, which may not be optimal for all failure scenarios. The invention describes a computer program product that uses machine learning to dynamically select the best recovery mechanism for a storage controller failure. When a failure occurs, the system collects attributes of the storage controller at the time of failure, such as error type, system state, and performance metrics. These attributes are fed into a machine learning module, which generates output values for multiple possible recovery mechanisms. The recovery mechanism with the highest output value is selected and applied to resolve the failure. This approach improves recovery efficiency by leveraging data-driven decision-making rather than static rules. The machine learning module can be trained on historical failure data to enhance its accuracy over time. The system operates within a storage controller or a computational device connected to it, ensuring rapid response to failures. This method reduces downtime and improves system reliability by adapting recovery strategies based on real-time conditions.

Claim 16

Original Legal Text

16. The computer program product of claim 15 , wherein the storage controller controls access to a plurality of storage devices for a plurality of hosts, and wherein the storage controller is comprised of: a host adapter that is an interface between the storage controller and a host computational device; a device adapter that is an interface between the storage controller and a storage device that is in a Redundant Array of Independent Disks (RAID) configuration; a cache; and a non-volatile storage (NVS).

Plain English Translation

This invention relates to a storage controller system designed to manage access to multiple storage devices in a Redundant Array of Independent Disks (RAID) configuration for multiple host computational devices. The storage controller includes several key components to facilitate efficient and reliable data storage and retrieval. A host adapter serves as the interface between the storage controller and the host computational devices, enabling communication and data transfer. A device adapter acts as the interface between the storage controller and the RAID storage devices, ensuring proper interaction with the storage array. The system also incorporates a cache to temporarily store frequently accessed data, improving performance by reducing latency. Additionally, a non-volatile storage (NVS) component is included to provide persistent storage for critical data, ensuring data integrity even in the event of power loss or system failures. The storage controller is designed to coordinate access to the storage devices for multiple hosts, managing data distribution, redundancy, and fault tolerance within the RAID configuration. This system enhances data availability, reliability, and performance in enterprise storage environments.

Claim 17

Original Legal Text

17. The computer program product of claim 16 , wherein the plurality of attributes includes: measures corresponding to indications and characteristics of errors and panics that have been generated in the storage controller; and a measure of a hardware part associated with the failure.

Plain English Translation

This invention relates to a computer program product for analyzing and managing failures in a storage controller system. The technology addresses the challenge of identifying and diagnosing errors, panics, and hardware failures in storage controllers to improve system reliability and reduce downtime. The computer program product includes a plurality of attributes that provide detailed insights into system failures. These attributes include measures corresponding to indications and characteristics of errors and panics that have been generated in the storage controller, allowing for precise identification of software-related issues. Additionally, the attributes include a measure of a hardware part associated with the failure, enabling the isolation of hardware-related problems. This combination of software and hardware failure metrics allows for comprehensive failure analysis, helping administrators quickly diagnose and resolve issues. The program product may also include other attributes, such as performance metrics, error logs, and historical failure data, to provide a holistic view of system health. By analyzing these attributes, the system can predict potential failures, optimize maintenance schedules, and enhance overall storage controller performance. The invention improves upon prior art by integrating both software and hardware failure indicators into a unified diagnostic framework, reducing the time and effort required for troubleshooting.

Claim 18

Original Legal Text

18. The computer program product of claim 16 , wherein the plurality of attributes includes: a measure of whether the cache is queued for segments; a measure of whether the NVS is queued for segments; a measure of whether the device adapter is queued for resources; and a measure of whether a RAID rebuild is in progress.

Plain English translation pending...

Claim 19

Original Legal Text

19. The computer program product of claim 16 , wherein the plurality of attributes includes: a measure of whether the storage controller is executing a mainline code or an error recovery code at a time of the failure; a measure of whether the device adapter is fenced; a measure of whether the host adapter is fenced; a measure of whether the storage controller is in a single server configuration or is in a dual server configuration; and a measure of previously known recovery mechanisms for errors corresponding to the failure.

Plain English Translation

This invention relates to storage systems and methods for analyzing and recovering from failures in storage controllers. The technology addresses the challenge of efficiently diagnosing and recovering from storage controller failures by collecting and analyzing specific attributes related to the failure context. The system captures multiple attributes to determine the root cause and appropriate recovery actions. These attributes include whether the storage controller is running mainline code or error recovery code at the time of failure, the fencing status of the device adapter and host adapter, the storage controller's configuration (single or dual server), and previously known recovery mechanisms for similar errors. By evaluating these factors, the system can identify the most effective recovery strategy, improving system reliability and reducing downtime. The invention enhances failure analysis by providing a structured approach to diagnosing storage controller issues, ensuring faster and more accurate recovery processes. This method is particularly useful in enterprise storage environments where minimizing disruption is critical. The system dynamically assesses the failure context to select the best recovery path, optimizing performance and maintaining data integrity.

Claim 20

Original Legal Text

20. The computer program product of claim 15 , the operations further comprising: transmitting, by the storage controller, the plurality of output values to a central computing device that generates weights and biases to be applied to machine learning modules of a plurality of storage controllers.

Plain English Translation

The invention relates to distributed machine learning in storage systems, specifically optimizing storage operations using machine learning models trained across multiple storage controllers. The problem addressed is the inefficiency of traditional storage systems that lack adaptive learning capabilities to optimize performance, reliability, and resource allocation. The solution involves a storage controller that processes input data, such as storage performance metrics, to generate output values. These output values are transmitted to a central computing device, which uses them to generate weights and biases for machine learning modules. These modules are then distributed back to the storage controllers to improve their decision-making processes. The central computing device aggregates data from multiple storage controllers, enabling a collaborative learning approach that enhances the overall efficiency of the storage system. The machine learning modules in each storage controller apply the received weights and biases to optimize tasks like data placement, caching, and error handling. This distributed learning framework allows the storage system to adapt dynamically to changing workloads and environmental conditions, improving performance and reliability without centralized bottlenecks. The invention ensures that storage controllers operate with up-to-date machine learning models, reducing manual configuration and improving automation in storage management.

Patent Metadata

Filing Date

Unknown

Publication Date

October 13, 2020

Inventors

Brian A. RINALDI

Clint A. HARDY

Lokesh M. GUPTA

Want to explore more patents?

Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.

Browse All Patents Try Prior Art Search