In a clustered environment, computer nodes work together to provide high availability and reliability for applications and services. To keep the compute nodes synchronized, a cluster employs a witness service, which typically operates on a separate system. In addition to issues related to potential loss of the witness service, there are also issues of additional costs and overhead. In one or more embodiments, two or more compute nodes of a cluster include a data processing unit (DPU) system operating a witness service. For a compute node with a local DPU system that operates a local witness service, the local witness service handles data related to the local workload(s) and may synchronize with one or more remote witness services on remote DPUs in the cluster in the event of failure or other issues.
Legal claims defining the scope of protection, as filed with the USPTO.
. An information handling system comprising:
. The information handling system ofwherein the local DPU system is further configured to cause steps to be performed comprising:
. The information handling system of:
. The information handling system ofwherein the local cluster system is further configured to cause steps to be performed:
. The information handling system of:
. The information handling system ofwherein the local DPU system is further configured to cause steps to be performed:
. The information handling system ofwherein the local DPU system is further configured to cause steps to be performed:
. The information handling system ofwherein the local cluster system is further configured to cause steps to be performed:
. The information handling system of:
. An information handling system comprising:
. The information handling system ofwherein the non-transitory computer-readable medium or media of the local DPU further comprises one or more sequences of instructions which, when executed by at least one of the one or more processors, causes steps to be performed comprising:
. The information handling system of:
. The information handling system ofwherein the non-transitory computer-readable medium or media of the local DPU or the non-transitory computer-readable medium or media of the compute node further comprises one or more sequences of instructions which, when executed by at least one of the one or more processors, causes steps to be performed comprising:
. The information handling system ofwherein the non-transitory computer-readable medium or media of the local DPU or the non-transitory computer-readable medium or media of the compute node further comprises one or more sequences of instructions which, when executed by at least one of the one or more processors, causes steps to be performed comprising:
. The information handling system ofwherein the non-transitory computer-readable medium or media of the local DPU further comprises one or more sequences of instructions which, when executed by at least one of the one or more processors, causes steps to be performed comprising:
. The information handling system of:
. A processor-implemented method comprising:
. The processor-implemented method offurther comprising:
. The processor-implemented method ofwherein:
. The processor-implemented method ofwherein:
Complete technical specification and implementation details from the patent document.
The present disclosure relates generally to information handling systems. More particularly, the present disclosure relates to systems and methods for providing witness services in a cluster environment.
The subject matter discussed in the background section shall not be assumed to be prior art merely as a result of its mention in this background section. Similarly, a problem mentioned in the background section or associated with the subject matter of the background section should not be assumed to have been previously recognized in the prior art. The subject matter in the background section merely represents different approaches, which in and of themselves may also be inventions.
As the value and use of information continues to increase, individuals and businesses seek additional ways to process and store information. One option available to users is information handling systems. An information handling system generally processes, compiles, stores, and/or communicates information or data for business, personal, or other purposes thereby allowing users to take advantage of the value of the information. Because technology and information handling needs and requirements vary between different users or applications, information handling systems may also vary regarding what information is handled, how the information is handled, how much information is processed, stored, or communicated, and how quickly and efficiently the information may be processed, stored, or communicated. The variations in information handling systems allow for information handling systems to be general or configured for a specific user or specific use, such as financial transaction processing, airline reservations, enterprise data storage, or global communications. In addition, information handling systems may include a variety of hardware and software components that may be configured to process, store, and communicate information and may include one or more computer systems, data storage systems, and networking systems.
In a cluster environment, a set of two or more information handling systems operate or function essentially as a single entity. Typically, they share an Internet Protocol (IP) address, and they process tasks (or workloads). A cluster is often used to perform functions such as reading/writing files, printing, accessing or managing databases, and messaging services. Information handling systems are clustered to help provide high availability.
Each information handling system in a cluster may be referred to as, or may be considered to comprise, a compute node (or simply, “node”). Each information handling system has its own compute-related resources (e.g., hard drive, RAM, network connections, processing, etc.), and may be capable of running/supporting one or more virtual machines. If one compute node within the cluster fails, the workload being handled by the failed/failing compute node may quickly and easily be transferred to another compute node in the cluster-thereby providing high availability by reducing or eliminating downtime and outages.
A witness service helps with the transfer of the workload between compute nodes in a cluster. For example, a witness service tracks the status of execution of the workload. Therefore, as the workload handling is passed between nodes, the witness service helps ensure that it is properly and fully completed. One problem, however, is that there is typically a single witness service that runs on a remote information handling system. If the information handling system on which the witness service runs becomes inoperable (e.g., crashes), the witness service for the cluster is lost. Also, if the witness service becomes inoperable (e.g., it needs to be updated), then the cluster is again left without a witness service.
Accordingly, it is highly desirable to find new ways to provide a witness service.
In the following description, for purposes of explanation, specific details are set forth in order to provide an understanding of the disclosure. It will be apparent, however, to one skilled in the art that the disclosure can be practiced without these details. Furthermore, one skilled in the art will recognize that embodiments of the present disclosure, described below, may be implemented in a variety of ways, such as a process, an apparatus, a system/device, or a method on a tangible computer-readable medium.
Components, or modules, shown in diagrams are illustrative of exemplary embodiments of the disclosure and are meant to avoid obscuring the disclosure. It shall be understood that throughout this discussion that components may be described as separate functional units, which may comprise sub-units, but those skilled in the art will recognize that various components, or portions thereof, may be divided into separate components or may be integrated together, including, for example, being in a single system or component. It should be noted that functions or operations discussed herein may be implemented as components. Components may be implemented in software, hardware, or a combination thereof.
Furthermore, connections between components or systems within the figures are not intended to be limited to direct connections. Rather, data between these components may be modified, re-formatted, or otherwise changed by intermediary components. Also, additional or fewer connections may be used. It shall also be noted that the terms “coupled,” “connected,” “communicatively coupled,” “interfacing,” “interface,” or any of their derivatives shall be understood to include direct connections, indirect connections through one or more intermediary devices, and wireless connections. It shall also be noted that any communication, such as a signal, response, reply, acknowledgement, message, query, etc., may comprise one or more exchanges of information.
Reference in the specification to “one or more embodiments,” “preferred embodiment,” “an embodiment,” “embodiments,” or the like means that a particular feature, structure, characteristic, or function described in connection with the embodiment is included in at least one embodiment of the disclosure and may be in more than one embodiment. Also, the appearances of the above-noted phrases in various places in the specification are not necessarily all referring to the same embodiment or embodiments.
The use of certain terms in various places in the specification is for illustration and should not be construed as limiting. The terms “include,” “including,” “comprise,” “comprising,” and any of their variants shall be understood to be open terms, and any examples or lists of items are provided by way of illustration and shall not be used to limit the scope of this disclosure.
A service, function, or resource is not limited to a single service, function, or resource; usage of these terms may refer to a grouping of related services, functions, or resources, which may be distributed or aggregated. The use of memory, database, information base, data store, tables, hardware, cache, and the like may be used herein to refer to system component or components into which information may be entered or otherwise recorded. The terms “data,” “information,” along with similar terms, may be replaced by other terminologies referring to a group of one or more bits, and may be used interchangeably. The terms “packet” or “frame” shall be understood to mean a group of one or more bits. The term “frame” shall not be interpreted as limiting embodiments of the present invention to Layernetworks; and, the term “packet” shall not be interpreted as limiting embodiments of the present invention to Layernetworks. The terms “packet,” “frame,” “data,” or “data traffic” may be replaced by other terminologies referring to a group of bits, such as “datagram” or “cell.” The words “optimal,” “optimize,” “optimization,” and the like refer to an improvement of an outcome or a process and do not require that the specified outcome or process has achieved an “optimal” or peak state.
It shall be noted that: (1) certain steps may optionally be performed; (2) steps may not be limited to the specific order set forth herein; (3) certain steps may be performed in different orders; and (4) certain steps may be done concurrently.
Any headings used herein are for organizational purposes only and shall not be used to limit the scope of the description or the claims. Each reference/document mentioned in this patent document is incorporated by reference herein in its entirety.
In one or more embodiments, a stop condition may include: (1) a set number of iterations have been performed; (2) an amount of processing time has been reached; (3) convergence (e.g., the difference between consecutive iterations is less than a first threshold value); (4) divergence (e.g., the performance deteriorates); and (5) an acceptable outcome has been reached.
It shall also be noted that although embodiments described herein may be within the context of a two-node cluster, aspects of the present disclosure are not so limited. Accordingly, the aspects of the present disclosure may be applied or adapted for use in other contexts.
As noted above, a cluster environment typically comprises a set of two or more information handling systems that operate or function as a single entity. Clusters are very beneficial for a number of reasons, including but not limited to ensuring high availability, load balancing, and scaling.
If one compute node within the cluster fails, the workload being handled by the failed/failing compute node may quickly and easily be transferred to another compute node in the cluster—thereby providing high availability by reducing or eliminating downtime and outages.
To help facilitate high availability in a cluster, the cluster uses a witness service. A witness service in networking is a mechanism used in distributed systems, particularly in clusters. A witness service is an entity that observes and verifies events and actions within a cluster to ensure consistency, correctness, or fault tolerance. Given a distributed system, the witness service may be used to achieve consensus among multiple nodes by, for example, observing status of the nodes, observing agreement or disagreement between or among nodes, and verify consistency. Furthermore, in systems where redundancy is used, a witness service may monitor the health and status of components or nodes. The witness service may help determine the correct behavior when failures occur and ensure that the correct data exists and/or that redundant components are functioning properly. For example, a witness service may be used to track execution of a workload. Therefore, as the workload handles is passed between nodes, the witness service helps ensure that that it is properly and fully completed.
A witness service may also play a role in security-related tasks, such as monitoring network traffic for suspicious activities or verifying the integrity of cryptographic operations. Overall, the specific functionality and implementation of a witness service may vary depending on the requirements and architecture of the system in which it is deployed.
However, as noted above, one issue with current implementations of witness services is that there is a single witness service for a cluster, which typically runs on a remote information handling system. If the information handling system on which the witness service runs becomes inoperable (e.g., crashes), the witness service for the cluster is lost. Also, if the witness service becomes inoperable (e.g., it is being updated), the cluster is again left without a witness service.
If a workload is transferred to another compute node while the witness service is inoperable, it can create what is sometimes referred to as the “split-brain” problem. The “split-brain” problem in a cluster refers to a scenario where a cluster of compute nodes becomes divided or partitioned in such a way that communication between the nodes is disrupted or limited. This disruption can lead to various issues, including data inconsistency, service outages, and degraded performance.
In a clustered environment, nodes typically work together to provide high availability and reliability for applications and services. They communicate with each other to synchronize data and coordinate actions. However, if the cluster becomes split into separate segments due to network issues, hardware failures, or other factors, each segment may continue operating independently without awareness of the other segments. This situation can result in conflicting updates, data corruption, or other problems when the segments attempt to reconcile their states once the partition is resolved. To address the split-brain problem, a cluster employs a witness service to act as the judge or determiner of which segment is correct in order to resolve the partition issue.
In addition to the issues related to potential loss of the witness service, there are other issues of using a witness node, which may be on an additional sled/blade or an individual server, to arbitrate cluster resource ownership. First, this witness node represents additional equipment costs. Second, the witness node also represents additional operational overhead. For example, in addition to having to separately manage the witness node, developing and maintaining the witness node adds costs and development time to the overall solution. In addition, updating (e.g., firmware (FW) and/or software (SW)) the witness node creates an issue, as noted above, of potentially creating the split-brain problem while the witness node is updating. An alternative is to take the cluster offline. Finally, if the witness service is operating on one of the compute nodes, during routine server maintenance of that compute node, it results in a loss of the witness service since a power down of the compute node in the cluster brings down the cluster and the witness service. But, such action defeats the high availability purpose of the cluster node.
Accordingly, in one or more embodiments, each compute node of the cluster includes a data processing unit (DPU) system, which may be implemented as a blade, daughter card, or other type of subsystem of the information handling system of the compute node. Each DPU system is configured to operate a witness service rather than using an additional witness compute node. In one or more embodiments, the DPUs on the compute node are communicatively connected via one or more network connections to synchronize witness data. For an information handling system that operates as a compute node and includes a local DPU system operating a local witness service running on the local DPU system, the local witness service is responsible for synchronizing the data related to the workload(s) being handled by the compute node with the one or more remote witness services operating on corresponding one or more remote DPU systems in the cluster. In the event that a local compute node fails and recovers, a remote witness service operating on a remote DPU system can participate to resolve any split-brain scenario and keep the cluster alive.
In one or more embodiments, each DPU system may comprise its own power supply (e.g., can be placed on AUX (auxiliary) power) to ensure power for the DPU system and its witness service so that the witness service may continue to operate even if its corresponding compute node is powered down or crashes. For example, in one or more embodiments, the information handling system may comprise dual power supply units (PSUs) to provide redundancy with AUX power. With power coming from an AUX unit and the local DPU running a separate operating system (OS), in the event the local compute node's OS crashes or hangs, the local compute node can warm or cold reboot while the local witness service on the local DPU remains active.
depicts an example cluster environment, according to embodiments of the present disclosure. A cluster environmenttypically comprises a set of two or more information handling systems. In the depicted embodiment, an information handling systemmay comprise a compute node for operating in a clusterwith a peer compute node. For example, if compute node-is considered a local node, its peer or remote node in the cluster is compute node-. In one or more embodiments, the information handling system or compute node may comprise the components of a typical information handling system, examples of which are described below in Section C. Among the system components, included but not limited to are one or more processors and a non-transitory computer-readable medium or media, communicatively coupled to the one or more processors of the compute node. The medium or media may comprise one or more sets of instructions which, when executed by at least one of the one or more processors, causes steps to be performed comprising supporting a local cluster service-related to a workload and may include detecting an operational state of a local data processing unit (DPU)-.
Also depicted in the information handling systemis a DPU systemthat is local to the information handling system. In one or more embodiments, the DPU systemmay be a card, sled, blade, or computing system, which may comprise the components of a typical information handling system, examples of which are described below in Section C. For example, among other components, the DPU system may include, but is not limited to, one or more data processing units and a non-transitory computer-readable medium or media communicatively coupled to the one or more data processing units for performing one or more services. While not depicted, a DPU system (e.g., DPUand DPU) are connected to one or more power supplies (e.g., a main PSU and auxiliary PSU) so that any compute node (e.g., host) shutdown, reboot, or update allows the DPU system and the witness service to remain operational. When a compute node is running, there is no data loss between it and a witness service because there is DPU with a witness service that is operational-whether local, remote, or both. Thus, no additional witness node is required to maintain cluster quorum.
Also depicted in the illustrated embodiment is a first port (Port) for connecting via a witness network to a remote DPU system-of the peer compute node-, and a second port (Port) for connecting via a DPU workload network to the remote DPU system-of the peer compute node-. The local DPU system may be configured to operate a local witness service-that synchronizes data related to the workload with a remote witness service-operating on the remote DPU system-of the peer compute node-.
Embodiments bring the responsibility of the witness service to the DPU systems (e.g., DPU-and DPU-) as the witness service. In the depicted embodiment, Portof DPUand Portof DPUare connected via one or more networking information handling systems (e.g., networking information handling system-) to synchronize the witness data. And, in the depicted embodiment, Portof DPUand Portof DPUare connected via one or more networking information handling systems (e.g., networking information handling system-), in which the connection is used for communicating data for the workload and other operations.
In one or more embodiments, the information handling system-may further comprise a local network interface card (NIC)-that includes one or more ports for connecting via a management network to a remote NIC-of the peer compute node.
As will be explained in more detail below, method embodiments use one or more of the connections shown into eliminate the need for a third compute node as a witness.
depicts the example cluster environment ofin which a compute node becomes nonoperational, according to embodiments of the present disclosure.depicts an example methodology for handling when a compute node becomes nonoperational, according to embodiments of the present disclosure.
In one or more embodiments, one of the compute nodes (e.g., compute node-, in this example) becomes () nonoperational for some reasons. For example, the operating system of the compute node-crashes or is updating, but its local DPU-is still functioning. Responsive to the local DPU system (i.e., DPU) or the local witness service (i.e., witness service) detecting that its cluster service-is not operational, the following steps may be performed.
Handling of the workload may be migrated () from the local cluster service-to the peer compute node-. That is, the workload of the nonoperational compute node may be transferred to the peer compute node, thereby maintaining high availability. In one or more embodiments, the local DPU may perform the migrating of the workload to the peer compute node (i.e., migrating the handling of the workload to the remote cluster service-).
To ensure that there is proper monitoring and handling of the workload, the local witness service-maintains () synchronizing of witness service data related to the workload with the remote witness service-operating on the remote DPU system (i.e., DPU-) at the peer compute node-.
After the compute node-resumes operating () (e.g., reboots), the compute node (or its cluster service) re-synchronizes () with its local DPU witness service-. In one or more embodiments, depending upon policy (e.g., a load balancing policy), the workload may remain with the peer compute node-or may be returned to the original compute node-.
depicts the example cluster environment ofin which a DPU system becomes nonoperational due to an update, according to embodiments of the present disclosure.depicts an example methodology for handling when a DPU system becomes nonoperational, according to embodiments of the present disclosure.
In one or more embodiments, responsive to the initiation () of an update of firmware, software, or both for a DPU system (e.g., DPU-in, in this example), the DPU system and its witness service are or will become nonoperational during the update process. The handling of the witness service or services related to the workload for this computer node (i.e., compute node-) may be migrated () from the local witness service-to the remote witness service-operating on the remote DPU system-of the peer compute node-. In one or more embodiments, the migration may be performed by the local DPU system or local witness service in preparation for the update. Additionally, or alternatively, the local cluster service-may coordinate the migration in preparation for the update or as a result of the cluster service not detecting the local DPU system or the local witness service being operational due to the update.
In one or more embodiments, while the local DPU system-is nonoperational, a local network interface card (NIC)-, which may include one or more ports for connecting (e.g., via a management networkand) to a remote NIC-of the peer compute node-, may be used to communicate between the compute nodes. For example, in one or more embodiments, migration of handling of the witness service or services related to the workload from the local witness service-to the remote witness service-operating on the remote DPU system of the peer compute node may involve communicating via the local NIC-and at least part of the management network (e.g., linkand network information handling system-) while the local DPU system-and its ports are down during the update process. In one or more embodiments, the NIC may support the enablement of or use of one or more management services or workloads-.
The local DPU system updates (). Following completion of the update, the handling of the witness service for the workload may be migrated () back to the local DPU (e.g., DPU), which is now updated, from the remote witness service-of the remote DPU system-of the peer compute node-. Concerning the migration back following the update, in one or more embodiments, the updated local DPU system may requests migration from the remote/peer compute node (e.g., request migration of the handling of the witness service for the workload from the remote DPU system-or from the remote witness service-). Additionally, or alternatively, the local cluster service-may detect that its local DPU system-is now operational (i.e., the update has completed) and may switches its witness service from the remote witness service-to its local witness service-. In one or more embodiments, the transfer back may include synchronizing with the remote DPU system (or its witness service).
In one or more embodiments, steps-may be repeated () for updating of the other DPU system. Note that the elements (e.g., cluster service-, DPU system-, witness service-, NIC, etc.) for the other compute node (i.e. compute node-) are local elements relative to it and the other compute node's-elements are remote to it.
Once all updates are completed, the clustermay continue operations with both DPUs having been updated—without interruption, including not interruption to witness services, for the cluster.
depicts the example cluster environment ofin which there has been an interruption in connectivity between DPUs/witness services, according to embodiments of the present disclosure.depicts an example methodology for handling when there has been an interruption in connectivity between DPUs/witness services, according to embodiments of the present disclosure.
As illustrated in, assume that there has been some interruption in the witness network between DPU-and DPU-. The interruption may be a result of one or more of the links, such as linkand/or link, failing. Alternatively, or additionally, the one or more networking devices (e.g., networking information handling system-), which support the witness network communication pathway may have become nonoperational or may have malfunctioned. In any event, connectivity between the DPUs has been interrupted.
In one or more embodiments, responsive to the local witness service or the local DPU system detecting () an interruption in communicating with the remote witness service via the witness network, handling of witness service data related to a workload may be migrated () to another network connection. For example, in one or more embodiments, the data may be transitioned by the local cluster service, the local DPU system, and/or a witness serviceto use at least part of a different network or differently designated network. In the example shown in, the data may be communicated via Portof the local DPUto the DPU workload network, including via networking information handling system-and connection.
Note that if both Portand Portof DPUhad connectivity issues, the data may be communicated via NICand at least part of the management network. Depending upon what is active at the other compute node-, the data may be communicated from the network information handling system-via connectiondirectly to DPUor may traverse via connectionand NIC-to finally arrive at the peer witness service-.
Since connectivity between the witness services has been maintained, normal data synchronization between the local witness service operating on the local DPU and the remote witness service operating on the remote DPU may continue ().
In one or more embodiments, responsive to the connectivity via the witness network being restored, the handling of the witness service data may be moved () back to the witness network of the local DPU. For example, the handling of data via Portof the local DPU or via the NIC may be moved back to being communicated via Portof the local DPU.
Unknown
November 13, 2025
Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.