Disclosed are embodiments for automatically resolving faults in a complex network system. Some embodiments monitor one or more of system operational parameter values and message exchanges between network components. A machine learning model detects a fault in the complex network system, and an action is selected based on a cause of the fault. After the action is applied to the complex network system, additional monitoring is performed to either determine the fault has been resolved or additional actions are to be applied to further resolve the fault.
Legal claims defining the scope of protection, as filed with the USPTO.
. A network management system comprising processing circuitry configured to:
. The network management system of, wherein the network management system is configured to identify the diagnostic action for diagnosing the fault based on a component associated with the determined candidate cause.
. The network management system of, wherein the network management system is configured to identify the diagnostic action for diagnosing the fault from a plurality of diagnostic actions, the identified diagnostic action having a lowest cost.
. The network management system of, wherein the network management system is configured to identify the diagnostic action for diagnosing the fault based on at least one of:
. The network management system of, wherein the network management system is configured to identify the diagnostic action for diagnosing the fault via a diagnostic action table.
. The network management system of, wherein, to configure the diagnostic threshold based at least in part on the determined candidate cause, the network management system is configured to:
. The network management system of, wherein the network management system is configured to:
. The network management system of, wherein the network management system is configured to:
. The network management system of, wherein the diagnostic action is one of a plurality of diagnostic actions comprising at least one of:
. The network management system of, wherein the plurality of rectifying actions comprises at least one of:
. A method comprising:
. The method of, wherein identifying the diagnostic action for diagnosing the fault is based on a component associated with the determined candidate cause.
. The method of, wherein identifying the diagnostic action for diagnosing the fault comprises identifying the diagnostic action for diagnosing the fault from a plurality of diagnostic actions, the identified diagnostic action having a lowest cost.
. The method of, wherein identifying the diagnostic action for diagnosing the fault is based on at least one of:
. The method of, wherein identifying the diagnostic action for diagnosing the fault comprises identifying the diagnostic action for diagnosing the fault via a diagnostic action table.
. The method of, wherein configuring the diagnostic threshold based at least in part on the determined candidate cause comprises:
. The method of, further comprising:
. The method of, further comprising:
. The method of, wherein the diagnostic action is one of a plurality of diagnostic actions comprising at least one of:
. Non-transitory, computer-readable media comprising instructions that, when executed, are configured to cause processing circuitry of a network management system to:
Complete technical specification and implementation details from the patent document.
This application is a continuation of U.S. patent application Ser. No. 18/160,868, filed Jan. 27, 2023, which is a continuation of U.S. patent application Ser. No. 16/835,757, filed Mar. 31, 2020 (now issued as U.S. Pat. No. 11,570,038 on Jan. 31, 2023), the entire contents of which are incorporated herein by reference.
This disclosure generally relates to diagnostics of network systems. In particular, the disclosed embodiments describe use of a machine learning model to automatically resolve faults in the network system.
Users of complex wireless networks, such as Wi-Fi networks, may encounter degradation of system level experience (SLE) parameters which can result from a variety of complex factors. To ensure the complex wireless network meets the needs of its user community, it is important to quickly resolve any problems that can arise with the systems operation. Resolving the problems can include identifying one or more root causes of the system level experience problem, and to initiate corrective measures. However, when the network is comprised of a large number of devices, including devices of varying type and functionality, identifying a root cause can take a substantial amount of time. If the system is inoperative or operating in a reduced capacity during this period of time, users of the system can be impacted, in some cases severely. Thus, improved methods of isolating root causes of problems associated with complex network systems are needed.
Disclosed are example embodiments that determine and perform corrective actions to a complex network system (e.g. a wireless network system) to improve system performance. Performance of the complex system is assessed based on service level experience parameters, or more generally, operational parameters. These can include parameters such as data transmission latency measurements, percentage of connection attempts that are successful, percentage of access points (APs) that are available for association, error statistics, such as errors generated via dropped connections, packet collisions, or other sources of error, system throughput measurements, or other SLE parameters.
Some embodiments also monitor messages exchanged within the complex network system. This message information is also provided to a machine learning model, which is trained to identify faults and potential root causes of said faults. A fault can include, in various embodiments, any deviation from nominal system operation which the machine learning model is trained to detect. For example, a fault includes, in some embodiments, any one or more of a latency, throughput, jitter, error count, or other operational parameter meeting a criterion. The criterion is defined so as to detect an undesirable system condition. For example, an example criterion evaluates a latency of a device, such as an access point, to determine if the latency is below a predetermined latency threshold. In some embodiments, a fault can be defined to include two or more operational parameters meeting one or more respective criterion. For example, in some embodiments, a fault can be defined to include a latency of a device meeting a first criterion and a throughput of the device meeting a second criterion (both conditions satisfied contemporaneously, in which the latency and throughput are measured within a predetermined elapsed time of each other). A root cause of a fault relates to a condition that is causing the fault. For example, root causes can include a software and/or firmware problem with a particular device, an inoperative network connection between two devices, or other root causes.
Along with root cause identification, the disclosed embodiments identify possible actions to take to either resolve the system problem or obtain additional diagnostic information which can then be applied to increase confidence of a root cause identification. These actions include one or more of initializing a specific beacon radio, restarting a radio, rebooting a device, restarting a software component, restarting a computer, changing operating parameters of a software or hardware component, querying a system component for status information, requesting a system component to perform a task, or other actions.
Each of these actions is associated with a probability, indicating a probability that the action will resolve the problem. The actions are also associated with a cost. For example, a first action resulting in closing a large number of user sessions would typically have a higher cost than a second action that is transparent to the user community.
The disclosed embodiments then select a course of action based on the identified probabilities and associated costs. Some of the disclosed embodiments operate in an iterative manner, in that a first action is applied to the system, and then the system is monitored to collect additional data. For example, if the first action is designed to resolve the problem, the disclosed embodiments monitor the system to determine if the problem is resolved (e.g. the monitored system has returned to nominal operation). If the first action is designed to provide additional diagnostic information, the system is monitored subsequent to application of the second action to collect the additional diagnostic information. In some cases, additional actions are identified based on the system behavior after application of the first action. This process can iterate until the system achieves nominal performance, at which time the diagnostic process is considered complete.
Some embodiments utilize a cost function as defined below in Equation 1:
Some embodiments provide a user interface that is configured to accept input defining a root cause of a particular issue. For example, in some cases, a human (e.g. IT technical) diagnoses a system problem and identifies a root cause. The user interface is configured to allow the human to identify a time period during which the problem occurred, and also to enter information regarding the root cause and corrective actions. The user interface also provides an ability, in some aspects, for the operator to associate a distribution list or alert list with the identified root cause and/or corrective actions. Based on the input provided by the user interface, training data is generated that indicates the symptomatic, diagnostic, and corrective information.
In some embodiments, a machine learning model is at least partially trained via assistance from human support staff. In this mode of operation, a technician, e.g., a field support engineer, can analyze a fault with a network system and identify a root cause. The technician is then able to enter information defining the fault and the root cause, and possible actions to take in response to the fault into a training database. This training database is then used to further train the machine learning model, which benefits from the input provided by the technician.
Some embodiments are configured to automate defect reporting. For example, some embodiments interface with a defect reporting system (e.g. Jira) via a service-oriented interface or other API made available by a provider of the defect reporting system. Some embodiments perform an automatic searching of the defect reporting system for an existing defect that defines parameters similar to those identified during automated diagnostics as described above. If a similar defect report is identified, some embodiments update the report to indicate an additional incidence of the defect based on the recent diagnosis. If no similar defect is identified within the defect database, a new defect report is generated. The new defect report is populated with information from the measured operational parameters as well as information derived from the diagnostic process as described above.
is an overview diagram of an example system that is implemented in one or more of the disclosed embodiments.shows three APs-in communication with wireless terminals,,, and. APis in communication with a switch. The APand switchare in communication with a router. The routeris in communication with a network, such as the Internet. A network management systemis also connected to the network, and is configured so as to have network connectivity with at least the APs-and router.
The network management systemis configured to monitor activity of the system. The network management systemmonitors activity of the systemvia messages,,,,,,,,, andthat include information relating to operation of the system. For example, the messages-indicate, in various embodiments, operational parameter values of various devices included in the system, message activity of messages exchanged between network components of the system, or other information. For example, the network management systemcollects information relating to operational parameters of one or more of devices, such as any of APs-, wireless terminals-, switchor router. This information may include statistical information that is maintained by a respective device. For example, in some embodiments, one or more of the APs-maintains statistical information describing, for example, a number of wireless terminals associated with the respective AP, communication latencies or throughputs, delays in establishing connections or associations with wireless terminals, communication errors detected, packet collisions, packet errors, CPU utilization, memory utilization, I/O capacity, and other metrics that characterize communication conditions at the AP. In some embodiments, the network management systemis also configured to monitor individual messages based between network components of the system. For example, the network management system is configured to monitor, in some embodiments, network messages passed between the APand the switch, or the APand the router. This monitoring is achieved, in some aspects, via message summary information provided by the device (e.g. APor) to the network management system. Examples of message summary information is provided below.
Based on the monitored activity and the operational parameters, the network management system is configured to perform one or more actions on one or more of the components of the system, at least when particular conditions are detected. For example, by monitoring operational parameters and/or individual messages passed between network components, the network management systemidentifies that the systemis operating at a reduced level (relative to a nominal level). Further based on the monitoring of operational parameters and messages, the network management systemidentifies possible root causes of the reduced performance of the systemand determines one or more actions to take. In some cases, the action(s) is designed to correct a problem identified by the network management system. In other cases, the action provides additional diagnostic information that allows the network management system to determine the root cause of the problem. These concepts are further elaborated below:
shows example message portions that are implemented in one or more of the disclosed embodiments. Message portion, message portion, and message portiondiscussed below with respect toare included, in various embodiments, in one or more of the messages-discussed above with respect to. One or more fields of the example message portions shown inare used in some of the disclosed embodiments to communicate message content information exchanged between network component devices of a network system (e.g.) to a network management system (e.g.) for processing.
shows message portion, message portion, and message portion. Message portionincludes a timestamp field, source device field, destination device field, type field, length field, and parameters of interest field. The timestamp fieldindicates a time when the message information described by remaining fields of the message portionwas generated. The source device fieldidentifies a source device of a message. The destination device fieldindicates a destination device of the message. The type fieldindicates a type of message. For example, the type fieldindicates, in some embodiments, whether the message is a data message, a connection request message, a connection establishment message, a connection reset message, or some other message type. The length field indicates a length of the message. The parameters of interest fieldindicates any other characteristic of the message that may be of interest. In some embodiments, the parameters of interest fieldincludes tagged values to assist a device decoding the message portionin interpreting the contents of the parameters of interest field. The message portionis used in those embodiments that send information on individual messages passed between components of the systemto the network management system. The message portiongenerally does not aggregate data relating to multiple messages but instead represents a single message. While the message portionprovides a granular level of detail on the messages passed between components of the systemfor example, it may impose more overhead on the systemthan other messages discussed below.
Example message portionincludes a timestamp field, source device field, destination device field, type field, and count field. The timestamp fielddefines a time period when message information conveyed by the message portionwas generated. In some embodiments, a machine learning model employed by one or more of the disclosed embodiments relies on values stored in the timestamp fieldto establish time series of message exchanges upon which a diagnosis of a complex network system are derived. The source device fieldidentifies a source device of one or more messages. The destination device fieldidentifies a destination device of one or more messages represented by the message portion. A type fieldindicates a type of the one or more messages represented by the message portion. The count fieldidentifies a number of messages represented by the message portion. Thus, while the message portionrepresents a single message, and can therefore represent the message in more detail, e.g. via the parameters of interest fieldand the length field, message portionsummarizes multiple messages of a particular type exchanged between a common source (e.g. source device field) and destination (e.g. destination device field). Some embodiments are configured to utilize both the message portionand the message portion. For example, some embodiments utilize message portionto summarize messages meeting a first criterion and message portionto communicate information on messages meeting a second criterion. For example, certain types of messages (e.g. error message) are represented via message portion, where more detailed information is provided to the network management system, while message portionis used to represent other message types (e.g. data messages or other messages indicative of nominal operation).
Example message portionincludes a timestamp field, CPU utilization field, memory utilization field, latency field, packet error count field, collisions count field, a number of connections field, and other operational parameter values field. Whereas message portionand message portionsummarize or otherwise provide information on messages passed between components of a system being monitored (e.g.), message portionis designed to communicate parameter values from a network component of the system being monitored (e.g. APs-) to the network management system. The timestamp fielddefines a time period for which the operational parameter values defined by the message portionwere relevant. The source device fieldidentifies a device whose parameters are described by the message portion. The CPU utilization fielddefines a CPU utilization of a device generating the message portion. The memory utilization fielddefines a memory utilization of the device generating the message portion. The latency fielddefines a latency imparted by the device or experienced by the device on the network. The packet errors fielddefines a number of packet errors detected by the device. The collisions count fielddefines a number of packet collisions experienced by the device. The number of connections fielddefines a number of connections maintained by the device. The other operational parameter values fielddefine one or more other operational parameter values of the device. For example, other operational parameter values indicated by the message portioncan include but are not limited to an access point name, a basic service set identifier (BSSID), a communication channel, a communication frequency band, media access control (MAC) information, a number of associated wireless terminals of a network component device (e.g. at an AP) or a service set name.
shows example data structures that are maintained by one or more of the disclosed embodiments. While the data structures are described with respect toas relational database tables, other embodiments utilize other data organization methods. For example, some embodiments utilize traditional in memory structures such as arrays or linked lists, trees, queues, graphs, or other data structures. In other embodiments, an unstructured data storage technology is relied upon.
shows a model output table, root cause table, an action table, an alert list table, a class table, and a diagnostic action table. The model output tableincludes a probability fielda cause identifier field, and a component identifier field. The probability fielddefines a probability that a root cause identified via the cause identifier fieldis a root cause of a problem identified by a model as employed in this disclosure. The cause identifier fielduniquely identifies a root cause, and may be cross referenced with field, discussed below, in the root cause table. The component identifier fieldidentifies a component associated with the cause (identified via the cause identifier field). For example, the component identifier fieldidentifies a software component or process, hardware component or process, or a device. The root cause tablemaps a cause (identified via cause identifier field) to one or more actions (identified via field). The root cause tablealso includes an alert list identifier field. The alert list identifier fieldidentifies a list of addresses to alert when a particular cause is identified (the cause identified by the cause identifier field). Thus, root cause tablerepresents that multiple different actions (or a single action) can be appropriate for a single route cause (identified via the cause identifier field).
The action tableincludes an action identifier field, action type field, action function field, cost function field, a confidence value (e.g. resolution probability if the action is taken), and an action permitted field. The action identifier fielduniquely identifies a particular action that is performed in one or more of the disclosed embodiments. The action type fieldindicates whether the action is designed to rectify a problem or provide additional diagnostic information as to a root cause of the problem. The action function fieldstores information that allows an implementation to perform the identified action. For example, the action function fieldmay store an entry point to an API that implements the action, in some embodiments. Examples of actions include restarting a specific radio in an access point, restarting a beacon in an access point, restarting only radios with a specific frequency (e.g. 2.4 Ghz and/or 5 Ghz) in an access point, restart a device (such as an AP). Other examples of possible actions include upgrading software running on a device, upgrading driver software, application software upgrade, software upgrade for a specific module.
The cost function fielddefines a cost function for the action. At least some of the disclosed embodiments utilize a cost function defined by the fieldto determine a cost of invoking the action. This cost information is used in some embodiments to select between multiple actions. The confidence value fieldindicates, for rectifying actions, a probability the action will resolve the root cause problem. Some embodiments may relate the cost of an action to a probability or confidence that the action resolves the root cause when determining whether to invoke an action. For example, some embodiments determine a cost of performing an action based on an impact of the action divided by a probability or confidence that the impact fixes the identified problem. In other words, some embodiments determine a cost of an action to be inversely related to a probability or confidence that the action fixes the underlying issue. The action permitted fielddefines whether the action can be automatically performed in a particular implementation. For example, some embodiments provide a user interface that allows system administrators or other individuals to define which rectifying actions can be automatically performed by the disclosed embodiments. This user interface is, in various embodiments, a graphical user interface or even something simple such as a text configuration file that defines the permitted or unpermitted actions. Thus, some embodiments consult the permitted fieldbefore performing an action to confirm such action is permitted. Otherwise, if the action is not marked as permitted, one or more alerts may still be generated to an appropriate distribution list, as described above and below with respect to the alert list identifier fieldand the alert list table.
The alert list tableincludes an alert list identifier fieldand an alert address field. The alert list identifier fielduniquely identifies an alert distribution list. The alert address fieldidentifies one address included in the alert distribution address (that is identified via alert list identifier field). Multiple rows for a single alert list identifier value are included in the alert list tablewhen an alert distribution list includes multiple addresses.
The class tableincludes a class identifier fieldand an alert list identifier field. The class identifier fieldcan be cross referenced with the class id field, discussed above with respect to root cause table. The class table, or similar data structure, is implemented in embodiments that prefer to associate a distribution list or alert list with a class of causes (e.g. software, hardware, driver, etc.) rather than with each individual cause (e.g. divide by zero, out of memory, etc.). Thus, some embodiments associate a distribution with a class of a root cause instead of with each root cause itself.
The diagnostic action tableincludes a component type identifier fieldand an action identifier field. The diagnostic action tablemaps from component types (via field) to possible diagnostic actions (e.g. via field) to take when a component of the indicated type is experiencing a problem (or may be experiencing a problem).
The injection history tableincludes an action identifier field, injection time field, component identifier field, and a probability improvement field. The action identifier fielduniquely identifies a diagnostic action. The action identifier fieldcan be cross referenced with the action identifier fieldor the action identifier field, or action identifier field. The injection time fieldidentifies a time at which the diagnostic action was injected. The component identifier fieldidentifies a component upon which the injection was performed. For example, if the action is a restart, the component identifier fieldidentifies the component that was restarted. In various embodiments, the component identifier is comprised of multiple parts. For example, a first part identifies a physical device in some aspects (e.g. station address or other unique identifier) and a second part identifies a component of the physical device (e.g. wireless chip, CPU, software component, or other hardware component). In accordance with an example embodiment when the diagnostic action is not injected into the same component that exhibits the higher likelihood of being the root cause of the performance degradation, tableincludes first component ID that identifies the component into which the diagnostic action is injected, a second component ID (not shown in the figure) identifying the component which exhibits the highest likelihood of being the root cause of the underlying issue. When the same diagnostics action is injected more than one time, the tablealso includes a probability improvement fieldindicating the improvement achieved in identifying the root cause by reapplying the diagnostics action.
The component tablemaps from a component identifier via fieldto a component type via field. Some embodiments utilize the component tableto determine a type of a component from a component identifier. For example, some embodiments of a machine learning model, discussed below, provide likely root causes and component identifier of components potentially causing a problem. The component tableis used in some embodiments to determine a type of the component identifier by the machine learning model.
is a graphA of data demonstrating an example of an action that rectifies an underlying root cause. The measured SLE parameter in this case is a counter of Ethernet errors on a specific link Ethernet link. Prior to injecting an action into the system, in this case a restart of a communication link, the system experienced high link error rate. At time, a restart actionis invoked. The injected action proved to be a correction action which reduced the error rate to zero. No further action needed to be taken.
is a graphB of data demonstrating an example action that does not remedy an underlying root cause. A measured SLE parameter in the example data ofis a counter of Ethernet errors on an Ethernet link. Prior to injecting an action, in this case a restart of a communication link, the system experienced a high error rate. At timesthrough, restart action, restart action, restart action, restart action, restart action, restart action, restart action, restart action, restart action, and restart actionare invoked.shows that the injected actions do not rectify the underlying issue and the Ethernet errors continue at the same rate and are thus unaffected by the restart action. The error counts shown inat different times are recorded and stored for later addition to historical information, discussed further below.
Some of the disclosed embodiments measure SLE and system parameter values after the action is performed. For example, in the example of, an Ethernet error rate is monitored after the link is restarted. If the error rate is not reduced as a result of the link restart, a new root cause is identified. For example, in some embodiments the new root cause indicates the problem is caused by a loose Ethernet cable or a HW issue. Some embodiments then generate an alert, via any known messaging technology, which functions to notify a human support technical to rectify the issue. In this case, the alert may indicate that the physical connection of the ethernet link should be verified, and if all is well with the physical connection, the ethernet hardware should be swapped out for service.
is a graphC of data demonstrating an action that does not remedy the underlying root cause. The measured SLE parameter in this case is a counter of Ethernet errors on a specific Ethernet link. Prior to performing the action, (e.g., a restart of a communication link), the monitored system experienced high error rate. At each of time, time, time, time, and time, restart action, restart action, restart action, restart action, and restart actionare performed. As shown by the graphC, the actions do not rectify the underlying issue and the Ethernet errors continue at the same rate unaffected by the restart action(s). This can be seen at each of time, time, time, time, and time. In some embodiments, the error counts are recorded and stored and are included in historical SLE measurements. These error counts may be used as training for a machine learning model, as discussed further below.
In this specific example, the disclosed embodiments monitor the SLE measurements and system parameters (e.g., CPU utilization, memory consumption, etc.) after the action is performed (e.g., Ethernet error rate post link restart) and determines that since the action did not resolve the problem, the problem is most likely being caused by a defect in the software or firmware of the monitored system. Some disclosed embodiments then generate an alert, via any known messaging technology, to alert a human to the problem. Some embodiments automatically initiate an update of software and/or firmware installed on the monitored system. For example, if the embodiments determine that the underlying issue is caused by software (rather than by some other component, e.g., hardware) and these existing software and/or firmware versions are below a threshold version level, an upgrade is performed. In some embodiments, an analysis is made between known defects with the existing software and/or firmware versions and the problem exhibited by the monitored system. If the similarly between the exhibited problem and a problem described with respect to the existing software/firmware version, the disclosed embodiments initiate a software and/or firmware upgrade to a newer version (which will likely resolve the problem).
is a flowchart of an example process for detecting and resolving a problem with a network system. In some embodiments, one or more of the functions discussed below with respect toare performed by hardware processing circuitry. For example, in some embodiments, instructions (e.g.) stored in an electronic memory (e.g.and/or) configure the hardware processing circuitry (e.g.) to perform one or more of the functions discussed below with respect toand process. In some embodiments, the network management systemperforms one or more of the functions discussed below with respect to.
After start operation, processmoves to operation, which monitors operational parameter values and/or message exchanges of a network system. For example, as discussed above with respect to, operational parameter values of network component devices such as one or more of the APs-, router, wireless terminals-, or the switchare provided to a network management system (e.g.). In some embodiments, each of the network component devices maintain statistical information that indicate operational parameters of these devices. In other embodiments, network monitoring devices are deployed at strategic locations within the network system so as to collect this information either with or without direct involvement from the network component devices.
This statistical information includes one or more of CPU utilization, memory utilization, a number of established connections, latency measurements, throughput measurements, dropped connection counts, roaming information, packet error information, collision information, media access control (MAC) information, access point identification information such as basic service set identifiers, association identifiers, or other indicators of component health and/or network performance. In some embodiments, operationalso includes obtaining information on messages exchanged between network component devices of the monitored network system. For example, as discussed above, in some aspects, messages including one or more fields of example message portion, message portion, or message portionare provided to a network management system (e.g.). The one or more fields convey information relating to the number and types of messages exchanged between components of the monitored network system. The operational parameter values and/or message exchange information is received by a network management system (e.g. a device performing the process) from one or more component devices of the network system. For example, one or more of the APs-may send messages (e.g. any of the message portion, message portion, or message portion) to the network management system (e.g.).
The statistical information relating to operation of each network component device can be described as a time series. Thus, in some embodiments, operationincludes receiving, from a plurality of devices included in the network system, a time series of the respective devices operational parameter values. In some embodiments, each of these time series are provided to a machine learning model, as discussed further below.
Decision operationdetermines if a fault is detected based on the monitored operational parameter values. In some aspects, the detection of a fault is detected via a machine learning model. For example, as discussed above, a machine learning model is trained in some embodiments to detect a system operating in a sub-optimal or otherwise unsatisfactory condition. In other embodiments, the detection is based on evaluating one or more operational parameter values of the monitored system against one or more criterion. In some embodiments, the fault is detected based on a probability or confidence provided by the machine learning model being above a threshold. For example, as discussed below with respect to, some embodiments of a machine learning model provide a plurality of probability or confidence indications that a corresponding plurality of root causes are responsible for a fault. If all of these probability or confidence indications are below a predetermined threshold, some embodiments interpret operation of the monitored system to be considered normal or nominal. (e.g. no fault detected). If any one of these indications is above a predetermined threshold, decision operationdetermines a fault is detected (note that each root cause may have its own predetermined threshold for detecting a fault in some embodiments). If a fault is detected, processmoves from decision operationto operation. Otherwise, if no fault is detected, processmoves from decision operationback to operation.
In operation, a root cause of the problematic operating condition is predicted. As discussed above, in some embodiments, a machine learning model is trained to indicate probabilities that a plurality of different root causes are occurring in the monitored system. As discussed above with respect to, the machine learning model generates, in some embodiments, a plurality of probabilities (e.g.), with each probability or confidence associated with a root cause (e.g. via field).
In operation, an action is selected based on the root cause. As discussed above, a root cause can be associated with multiple possible actions. Operationevaluates the possible actions with respect to their respective cost and probability or confidence of resolving the problem. This is discussed further with respect tobelow.
Operationperforms the selected action. The selected action can include one or more of restarting a software process or component of a network device included in the network system being monitored, resetting an entire network device (e.g. power cycle), adjusting one or more configuration parameters of a network device or software component of a network device, resetting a particular hardware component of a network device (e.g. resetting a network card or chip of a network device while maintaining operation of a GPU of the device). In some embodiments, performing the action includes determining a class of the cause e.g., whether the cause is a result of hardware, software, a driver, or other technical component. In some embodiments, performing the action includes forwarding a notification to a specific distribution list based on the cause. For example, as discussed above with respect to, some embodiments associate a distribution list (e.g. via alert list identifier field) with a cause. The distribution list is then notified, in at least some embodiments, when the cause is identified. Note that in some cases, the selected action can be null or no action. This may result in an alert being generated to a specified distribution list without any corrective action being performed.
Operationmonitors the system in response to the performed action. For example, as discussed above with respect to, system behavior after the action is performed is analyzed to determine, in some cases, whether the system has returned to normal operation. This is the case when the selected action is designed to resolve the issue. In some cases, the selected action is designed to elicit additional information for determining a root cause. For example, in some embodiments, the selected action queries a network component for status information, or requests the network component to perform a function. A result of the request can be used to determine whether a network component is functioning properly or has experienced a fault.
In some embodiments, the monitoring of the system of operationis performed by a machine learning model. The machine learning model generates an indicator of whether the system has returned to normal operation. In some embodiments, the monitored time series of operational parameter values and/or message exchanges between network component devices is processed by one or more heuristics, with the output of the heuristics (the processed time series) provided to the machine learning model. For example, in some embodiments, rather than providing specific link errors to the machine learning model, heuristics determine whether a rate of change of a link error rate over time. For example, the rate of change is classified in some embodiments, as constant with time, increasing slowly with time, or increasing more rapidly with time. Some embodiments classify a timeframe of change of the link error rate. For example, the timeframe is classified as link errors start growing n seconds after a restart, start growing immediately after the restart, or other classification. In these embodiments, heuristics map each one of these different classifications into different error growth types. The error growth type is then provided to the machine learning model.
Decision operationevaluates whether the system has returned to normal or nominal operation. If the system has returned to normal operation, processreturns to operationfrom decision operationand continues to monitor the system for new indications of problems. If the system has not returned to normal operation, processmoves from decision operationto operation, where a second root cause has been identified. The second root cause identified in a second iteration of operationis generally more specific than the root cause identified during the first iteration of operation.
is a flowchart of an example process for selecting an action to invoke on a monitored system. In some embodiments, one or more of the functions discussed below with respect toare performed by hardware processing circuitry. For example, in some embodiments, instructions (e.g.) stored in an electronic memory (e.g.and/or) configure the hardware processing circuitry (e.g.) to perform one or more of the functions discussed below with respect toand process. In some embodiments, the network management systemperforms one or more of the functions discussed below with respect to.
In some embodiments, the network management systemperforms one or more of the functions discussed below with respect to.
The processdiscussed below is utilized, in some embodiments, when a root cause of a problem has been identified. The root cause is associated with one or more actions that can be performed in response to the root cause. These actions have various costs associated with them. For example, in some embodiments, a first action is transparent to users and will impart no negative effects (querying a network component for status information). A second action causes users to lose connectivity or experience reduced functionality in some other way (e.g. slower data transfer, higher network jitter, etc.). Thus, the first action is selected based on the cost in some embodiments. Also considered by the processdiscussed below is a probability or confidence that each action will resolve the root cause problem. Thus, when some actions may impart a higher cost on the monitored system, if these actions also provide for a high probability or confidence of resolution relative to other less costly actions, they may be justified in some situations.
After start operation, an action is identified in operationThe action is associated with a root cause in at least some embodiments (e.g. via root cause table). In operation, a cost associated with the action is determined. For example, as discussed above with respect to, some embodiments maintain an action table (e.g.) or other data structure that provides cost information for a particular action. The particular action is identified, in some embodiments, based on a determined root cause (e.g. via the root cause table, discussed above.). In some embodiments, the action's cost is a function of one or more parameters of the system being monitored. For example, in a system experiencing severe degradation, a cost of some actions (e.g. restarting a computer or other network component) may be relatively smaller than when the action is performed on a system experiencing only minor problems. Thus, some cost functions for actions may receive input parameters to determine the appropriate cost. In various embodiments, the input parameters could include any one or more of the operational parameters discussed above. In some embodiments, the cost of an action is based on a number of users affected by the action. This cost is dynamically determined in some embodiments before the cost is utilized to determine an action to perform.
In operation, a probability or confidence of resolution of the underlying issue by the action is determined. For example, as discussed above, some embodiments associate a resolution probability with an action via an action table (e.g.).
Unknown
May 19, 2026
Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.