Embodiments relate to determining whether to take a resource distribution unit (RDU) of a datacenter offline when the RDU becomes faulty. RDUs in a cloud or datacenter supply a resource such as power, network connectivity, and the like to respective sets of hosts that provide computing resources to tenant units such as virtual machines (VMs). When an RDU becomes faulty some of the hosts that it supplies may continue to function and others may become unavailable for various reasons. This can make a decision of whether to take the RDU offline for repair difficult, since in some situations countervailing requirements of the datacenter may be at odds. To decide whether to take an RDU offline, the potential impact on availability of tenant VMs, unused capacity of the datacenter, a number or ratio of unavailable hosts on the RDU, and other factors may be considered to make a balanced decision.
Legal claims defining the scope of protection, as filed with the USPTO.
1. A method performed by one or more control server devices participating in a cloud fabric that controls a compute cloud, the compute cloud comprised of a plurality of cloud server devices, the compute cloud further comprising power distribution units (PDUs), each PDU servicing a respective set of the cloud server devices, the cloud server devices hosting tenant components of tenants of the compute cloud, the method comprising: receiving a notification that a PDU is in a failure state, the PDU servicing a set of the cloud server devices, wherein the PDU comprises a controller, and wherein the PDU is configured to provide power to respective cloud server devices as controlled by the controller, wherein the PDU is configured to supply power while the controller is not able to respond to signals for controlling the supplying of power to the cloud server devices; and based on the notification, determining whether to place the PDU in a state of unavailability, wherein when the PDU enters the unavailable state each of the cloud server devices in the set of cloud server devices consequently becomes unavailable if not already unavailable, and wherein the determining whether to place the PDU in a state of unavailability comprises: determining, based on the set of cloud server devices: (i) a measure of availability of a population of the tenant components, and (ii) a measure or prediction of capacity for a resource of the compute cloud.
2. A method according to claim 1 , wherein the determining whether to place the PDU in a state of unavailability is further based on a ratio of tenant components on the set of cloud server devices to tenant components on the set of cloud server devices that are already unavailable.
3. A method according to claim 2 , wherein the ratio is computed based on a determination that making the set of cloud server devices unavailable corresponds to changing the measure or prediction of capacity within a threshold.
4. A method according to claim 3 , further comprising determining to make the set of cloud server devices unavailable based on determining that the ratio satisfies a corresponding threshold.
5. A method according to claim 1 , wherein (i) the determining, based on the set of cloud server devices, a measure or prediction of capacity for a resource of the compute cloud, is (ii) based on determining that the measure or prediction of capacity does not satisfy a corresponding threshold if the set of cloud server devices became unavailable.
6. A method according to claim 1 , further comprising determining to make the set of cloud server devices unavailable based on based on determining that if the set of cloud server devices became unavailable the measure or prediction of capacity continues to satisfy a corresponding threshold.
7. A method according claim 1 , wherein the tenant components comprise virtual machines managed by the compute cloud.
8. A method performed by one or more computing devices comprising processing hardware and storage hardware, the method for determining whether to offline a power distribution unit (PDU) in a datacenter, the datacenter comprised of PDUs providing power to respective sets of datacenter hosts that service tenant components of datacenter tenants, the method comprising: receiving an indication that the PDU is in a fault state, wherein the PDU continues to supply power to the set of datacenter hosts while the PDU is in the fault state, the fault state corresponding to the PDU not responding to control signals for controlling the supplying of power by the PDU while supplying power to the set of datacenter hosts; collecting, and storing in the storage hardware, capacity information indicating net available capacity of a computing resource of the datacenter; collecting, and storing in the storage hardware, tenant availability information indicating presence of tenant components on or more hosts receiving power from the PDU; and based on the indication that the PDU is in the fault state, determining, by the processing hardware, whether to offline the PDU, the determining comprising: (i) according to the capacity information in the storage hardware, computing a predicted net available capacity of the computing resource based on assuming offlining of the PDU, and (ii) according to the tenant availability information in the storage hardware, computing a predicted tenant availability measure based on assuming offlining of the PDU.
9. A method according to claim 8 , wherein the determining whether to offline the PDU further comprises determining whether the computed predicted net available capacity is below a minimum capacity designated for the datacenter.
10. A method according to claim 9 , wherein the determining whether to offline the PDU further comprises determining whether the predicted tenant availability measure is below an availability threshold.
11. A method according to claim 10 , wherein the determining whether to offline the PDU further comprises determining that the predicted net available capacity is below the minimum capacity and based thereon determining whether to offline the PDU based on whether the predicted tenant availability measure is below the availability threshold.
12. A method according to claim 8 , wherein the capacity information indicating net available capacity of a computing resource of the datacenter comprises a prediction of future capacity of the datacenter.
13. A method according to claim 8 , wherein the datacenter comprises a compute cloud comprised of a control fabric, and wherein the capacity information and tenant availability information are obtained from the control fabric.
14. A computing device comprising: processing hardware; storage hardware storing information configured to cause the processing hardware to perform a process, the process comprising: determining that a power distribution unit (PDU) of a datacenter is in a faulty state, the PDU supplying power to computer hosts of the datacenter while in the faulty state, the PDU comprising a controller that does not respond to power-control messages while in the faulty state, and based on the determining that the PDU is in the faulty state: determining whether to make the PDU unavailable to the computer hosts supplied thereby by (i) computing, based on assuming unavailability of whichever tenant virtual machines (VMs) are on the computer hosts being supplied power by the PDU, a level of availability of tenant VMs in the datacenter, and determining whether the level of availability of tenant VMs in the datacenter is below a threshold, and by (ii) computing a level of capacity of a computing resource provided by the computer hosts of the datacenter exclusive of the computer hosts being supplied the resource by the PDU, and determining whether the level of capacity of the computing resource is below a threshold.
15. A computing device according to claim 14 , wherein the level of capacity comprises an estimate of a net unused amount of the computing resource in the datacenter or of a given set of computing hosts thereof.
16. A computing device according to claim 14 , wherein the determining whether to make the PDU unavailable to the computer hosts supplied the power thereby is further based on a number of VMs rendered unavailable due to the unavailability of the PDU.
17. A computing device according to claim 14 , wherein the number of VMs rendered unavailable is computed based on whether VMs under the PDU comprise tenant VMs.
Cooperative Patent Classification codes for this invention. Click any code to explore related patents in that topic.
July 20, 2018
January 26, 2021
Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.