Proactively detecting malicious domains using graph representation learning may be provided by extracting seed domains from a uniform resource locator (URL) feed of observed requests for access to domains; expanding the seed domains to a via a passive domain name service (PDNS) crawl to include additional domains with the seed domains; collecting a ground truth, including labeling a first set of the seed domains as benign and a second set of the seed domains as malicious; constructing a graph neural network (GNN) of the additional domains and the seed domains, wherein each domain of the additional domains and the seed domains are represented as a node in the GNN that includes feature values associated that domain; training the GNN to classify unseen domains not associated with a node as either benign or malicious; and classifying, via the GNN, a queried domain as either benign or malicious.
Legal claims defining the scope of protection, as filed with the USPTO.
. A method, comprising:
. The method of, wherein constructing the GNN includes:
. The method of, wherein labeling the first set of the seed domains as benign includes:
. The method of, wherein labeling the second set of the seed domains as malicious includes:
. The method of, wherein classifying the queried domain creates a blocklist of several domains classified as malicious from the URL feed and one or more domains not seen in the URL feed that are proactively identified as malicious based on a relationship with domains identified as malicious by the GNN.
. The method of, wherein classifying the queried domain returns a real-time response that identifies the queried domain as either benign or malicious.
. The method of, wherein the queried domain is classified as benign or malicious at hosting infrastructure upstream of content delivery without analyzing content hosted by the queried domain.
. A system, comprising a processor and a memory including instructions that when executed by the processor, perform operations including:
. The system of, wherein constructing the GNN includes:
. The system of, wherein labeling the first set of the seed domains as benign includes:
. The system of, wherein labeling the second set of the seed domains as malicious includes:
. The system of, wherein classifying the queried domain creates a blocklist of several domains classified as malicious from the URL feed and one or more domains not seen in the URL feed that are proactively identified as malicious based on a relationship with domains identified as malicious by the GNN.
. The system of, wherein classifying the queried domain returns a real-time response that identifies the queried domain as either benign or malicious.
. The system of, wherein the queried domain is classified as benign or malicious at hosting infrastructure upstream of content delivery without analyzing content hosted by the queried domain.
. A memory including instructions, that when executed by a processor, perform operations including:
. The memory of, wherein constructing the GNN includes:
. The memory of, wherein labeling the first set of the seed domains as benign includes:
. The memory of, wherein labeling the second set of the seed domains as malicious includes:
. The memory of, wherein classifying the queried domain creates a blocklist of several domains classified as malicious from the URL feed and one or more domains not seen in the URL feed that are proactively identified as malicious based on a relationship with domains identified as malicious by the GNN.
. The memory of, wherein classifying the queried domain returns a real-time response that identifies the queried domain as either benign or malicious.
Complete technical specification and implementation details from the patent document.
The present disclosure claims the benefit of U.S. Provisional Patent Application No. 63/492,395 entitled “METHODS AND TECHNIQUES TO PROACTIVELY DETECT MALICIOUS DOMAINS USING GRAPH REPRESENTATION LEARNING” and filed on Mar. 27, 2023, which is incorporated herein by reference in its entirety.
Attackers increasingly use disposable domains as the primary vector to launch cyber-attacks. To prevent these cyber-attacks, numerous defense solutions have been developed. However, existing detection mechanisms are either too late to catch such malicious domains due to limited information and short life spans or are unable to catch them due to evasive techniques, including cloaking and CAPTCHA.
The present disclosure generally relates to systems, methods, and devices for detecting malicious domains.
In light of the present disclosure, and without limiting the scope of the disclosure in any way, in an aspect of the present disclosure, which may be combined with any other aspect listed herein unless specified otherwise, a method for predicting malicious domains is provided.
In an aspect of the present disclosure, which may be combined with any other aspect listed herein unless specified otherwise, a method for predicting malicious domains in a neighborhood of seed malicious domains using a semi-supervised graph neural network includes (1) constructing a semi-supervised graph neural network, and (2) training the semi-supervised graph neural network using five-fold cross-validation.
Additional features and advantages of the disclosed method and apparatus are described in, and will be apparent from, the following Detailed Description and the Figures. The features and advantages described herein are not all-inclusive and, in particular, many additional features and advantages will be apparent to one of ordinary skill in the art in view of the figures and description. Moreover, it should be noted that the language used in the specification has been principally selected for readability and instructional purposes, and not to limit the scope of the inventive subject matter.
The present disclosure generally relates to systems, methods, and devices for detecting malicious domains. The present disclosure provides a content-agnostic approach of detecting malicious domains early in their life-cycle We observe that attackers often reuse hosting infrastructures to launch multiple malicious domains due to increased utilization of automation and economies of scale. Thus, it gives defenders the opportunities to monitor such infrastructure to identify newly hosted malicious domains. However, such infrastructures are often shared hosting environments where benign domains are also hosted, which could result in prohibitive number of false positives. Therefore, one needs innovative mechanisms to better distinguish malicious domains from the benign ones even when they share hosting infrastructures.
The present disclosure provides systems and methods that offer real-time predictions and batched blocklist updates/generation that can be used in various cybersecurity fields and networking concern for improving system reliability, reducing the severity of external threats, and reducing the odds of breach by an external party among other benefits including reduced computational resource usage for greater benefits compared to traditional approaches.
illustrates an example infection pathwayin which the present application may be applied to improve the functionality of the computing systems therein. As illustrated, a malicious domain registrationoccurs at a first time, and is made available via hosting infrastructureat a second time, is issued a TLS certificateat a third time, provided with host contentat a fourth time, and accessed by a user device, thereby compromising or infecting the user device, at a fifth time. \ Each of the computing devices used in the pathwaymay be understood with reference to the computing devicediscussed in relation to.
There are a plethora of traditional and proposed solutions to detect malicious domains. While these traditional solutions assist in detecting many malicious domains, many others either go undetected or get detected only after users are compromised. A key reason is that most existing security scanners rely on host contentto detect malicious domains after the malicious content reaches a security enforcement point (e.g., a Firewall, a browser). While content-based detection techniques are important, such approaches have a blind spot for cloaked webpages (which is a technique attackers increasingly use), require a large amount of computational resources to analyze billions of webpage contents; and by the time malicious webpage contents are available, it is difficult, if not impossible, to prevent the attack from happening. As shown in, the security systemof the present disclosure is applied at the level of hosting infrastructuredetect malicious domains much earlier in the pathwayat the time of hosting, which contrasts with traditional techniques that are applied with available host content.
The security systemdifferentiates malicious domains from benign domains with much less available information than content-based approaches. A key observation is that while the toxicity, (e.g., the ratio of malicious domains to all domains), of hosting infrastructures on the Internet, in general, is very low, the same measure in the neighborhoods that previously hosted malicious domains is relatively high. Stated differently, once a given host has been found to host a malicious domain, the given domain can be assumed (and in practice found) to be more likely to host malicious domains again in the near future. For example, the toxicity of a sample of domains observed from passive DNS on 2022 Jul. 1 is 0.002 whereas the toxicity of a sample of domains around the IPs previously hosting malicious domains on the same day is 0.063 (31.5 times higher).
Due to the increased automation and economies of scale, attackers reuse hosting infrastructure to launch attacks. By monitoring new domains hosted on the internet protocol addresses and domain hosts that recently hosted malicious domains, it would seem intuitive that one can identify new malicious domains. However, due to the increasing use of shared hosting, and the overall low ratio of toxicity, not all new domains hosted in a toxic infrastructure are malicious. In other words, being hosted on a malicious infrastructure is not conclusive evidence of the maliciousness of the domain. Therefore, additional innovative mechanisms are required to identify true malicious domains from false positives sharing the same hosting infrastructures.
Further, a malicious domain could be compromised itself, such as when a benign domain is exploited by the attacker, or an attacker-created or otherwise registered by the attacker. Because compromised domains are originally benign, compromised domains tend to be hosted on infrastructures where many other benign domains are hosted. To minimize the weak associations and reduce the conflicting labels, security systemuses a practical rule-based approach to filter out public domains and keep attack domains. Further, for missing features in network graphs, it is often the case that all related features are missing, and, therefore, existing imputation techniques do not work as these techniques assume at least some features are available. Thus, a different approach is required to impute maliciousness.
The security systemis designed to support two key use cases: batch-mode blocklist generation/updates (e.g., daily/weekly blocklist generation) and real-time prediction. On a batch-mode basis, security systemfirst compiles a seed malicious domain list first seen on a given day and identifies other recent domains hosted on the same infrastructure where the seed malicious domains are hosted. Based on these resolutions, security systembuilds a graph consisting of domains and IP addresses. Then, security systemcollects lexical and hosting features and ground truth domains to train a machine learning model, such as a Graph Neural Networks (GNN) model. Based on the trained model, security systemdetects a number of unseen malicious domains per batch periods. An ensemble of batch-period trained models is sued to predict in real-time the malicious domains not present in the training graph to further reduce the false positives.
GNNs are a class of deep learning models for learning from data represented as graphs. GNNs learn representations of either nodes, edges, or whole graphs. GNNs combine node feature information with the graph structure by recursively passing neural messages along the edges of the input graph. GNNs can be broadly categorized into two groups: those that work on homogeneous graphs having one type of nodes and edges and those that work on heterogeneous graphs having different node types and/or edge types. Graph Convolution Network (GCN), Graph Attention Networks (GAT), and GraphSAGE are examples of the former category whereas Relational GCN (RGCN) and HGT are examples of the latter.
In an example, a system for detecting malicious domains may identify a first set of daily malicious seed domains using the daily block listed Uniform Resource Locators (URLs). Once identified, the security systemmay identify a second set of daily malicious domains as a portion of the first set of daily malicious seed domains. A domain of the first set may fall within the second set of daily malicious seed domains if the domain was marked as “malicious” by at least five VirusTotal (or similar) URL Feeds.
In various embodiments, to evaluate the various hosting services, the security systemuses various different data sources. For example, passive DNS (PDNS)captures traffic by the cooperative deployment of sensors in various locations of the Domain Name Service (DNS) hierarchy. For example, Farsight PDNS data, uses sensors deployed behind DNS resolvers and provides aggregate information about domain resolutions and publicly available zone file updates. The security systemcan use such data to extract domains related to seed malicious domains as well as various domain/IP features.
In an example data source, a consensus intelligence feed, such as the VIRUSTOTAL URL Feed (VT), provides a public querying platform to obtain URL intelligence by analyzing more than 90 third-party scanners and URL/domain blocklisting services. As a non-limiting example of a consensus intelligence feed, VT provides an Application Program Interface (API) to check the status of URLs. Additionally, VT publishes an hourly feed of URLs along with aggregated intelligence for the URLs queried by Internet users all around the world. The security systemcan use the threshold of X (e.g., five) scanners as the cutoff to identify malicious domains, such that, if a domain is reported as malicious by X or more scanners, the domain is included in the ground truth of malicious domains.
In another example, the security systemcan uses a popularity intelligence feed, such as the ALEXA Top 1 Million, as AMAZON ALEXA compiles the most popular 1 million domains each day. The daily popularity of a domain does not directly correlate with whether the domain is benign; however, domains consistently appearing in a popularity intelligence feed list over a period are highly likely to be benign, as attackers use a domain for a short time period and the domain popularity of these malicious domains is likely to last only a few days. Based on these observations, the security systemcan compile a popularity intelligence feed top 30-day list, which includes the domains consistently appearing in popularity intelligence feed for the 30 day window, as one source of benign domains.
The security systemthen executes a PDNS crawl, using the second set of daily malicious domains. The PDNS crawl is executed to further identify domains with the same IP address as the malicious domains of the second set. Because multiple malicious domains are usually hosted on the same set of IPs, there is an intrinsic association among such domains. Thus, after the PDNS crawl is executed, the security systemexpands a graph in the neighborhood of seed malicious domains to likely discover additional, malicious domains that were not identified in step one. However, while domains having the same IP address as the malicious domains identified are more likely to also be malicious domains compared to random domains, there are still many benign domains in these neighborhoods. Therefore, there is a need for systems, methods, and devices that detect malicious domains from an identified neighborhood of a domain marked as “malicious.”
is a flowchart of an example methodfor the overall pipeline of detecting malicious domains for a given batch period (e.g., day), according to embodiments of the present disclosure.
At block, the security systemreceives a URL feed for the blocklisted URLs for the batch period.
At block, the security systemextracts a set of seed domains in the batch period malicious seed domains using the daily blocklisted URLs. To extract malicious seed domains, the security systemstarts with URLs that are marked as malicious by at least X consensus intelligence feed scanners and an active consensus intelligence feed scan of the corresponding domains with at least Y consensus intelligence feed scanners (e.g., where X>Y; such as X=5 and Y=3). The security systemextracts only those malicious domains that are highly likely to be created by attackers as the goal is to detect domains created by attackers as early in the domain life cycle as possible. The malicious seed extraction process is discussed in greater detail in relation to.
To evade detection, attackers deploy malicious domains with dynamic behavior by frequently changing the IP resolutions thereof or creating new domains. While doing so, attackers tend to reuse infrastructure resources. Further, attackers are increasingly using automation, and host malicious domains in a similar pool of IPs. Following this observation, at block, the security systemexecutes a PDNS crawl of recently hosted domains and expands the graph in the neighborhood of seed malicious domains to discover other likely malicious domains. While the toxicity of the neighborhood is relatively high compared to random neighborhoods, there are still many benign domains in these neighborhoods, mainly due to shared hosting on public infrastructures.
At block, the security systemconstructs a heterogeneous graph based on the PDNS records. The heterogeneous graph consists of apex domains (i.e., e2LDs), fully-qualified domain names (FQDNs), IP addresses, subnets, and Autonomous System Numbers (ASNs). To supplements the nodes with various feature data, the security systemcollects various features at block. These node features, include lexical features of domain names, and a set of novel hosting features for both domains and IPs, including those set forth in Table 1.
At block, the security systemcollects a ground truth. For malicious ground truth, the security systemuses malicious seed nodes as well as additional labeled nodes generated with heuristics described with respect to. Traditional approaches often use a top x list or popularity feed as the benign ground truth and although those domains are likely to be benign, those domains represent a biased set of benign domains due to several reasons such as exclusion of benign domains with low web traffic. This biasing inevitably results in models with high false positives in practice. In contrast, the security systemuses a pragmatic approach to compile a representative benign ground truth by considering multiple sources, as is described with respect to.
At block, with the constructed heterogeneous graph and the ground truth, the security systemtrains a semi-supervised GNN, as discussed with respect toto predict unseen malicious domains in the neighborhoods of seed malicious domains.
At block, once the GNN is trained, then the security systemis able to perform domain classification into benign or malicious domains to threat detection, mitigation, and quarantining. The classified domains may be provided individually to a user in response to a query related to a particular domain or as a block list to a user to aid the user in avoiding accessing (or blocking access to or communication from) domains identified as being malicious.
illustrates how the security systemis able to overcome the challenges faced by traditional systems, according to embodiments of the present disclosure. Traditional solutions are often unable to predict the maliciousness of a domain that is absent in the training dataset. Retraining a graph model is computationally expensive and is therefore it is not practical to retrain the model whenever a graph is updated. Thus, an inductive approach, which trains a model on one graph and then can apply the model to a totally different graph without retraining, is much desired in a practical system. This inductive approach allows the present security systemto perform real-time detection of domains unseen in the domain resolution graph in an inductive manner,
The security systemuse an ensemble classifier to further boost the classification performance of the GNN. During construction of the graph (e.g., blockof method, discussed in regard to), the security systemuses a model stackof semi-supervised GNN encoders-and a meta-learnerto fuse the embeddings from the model stackto make the final classification of malicious or benign for a domain. For a new domain, the security systemconstructs the passive DNS graph around the neighborhood of this domain (e.g., the target domain computational graph) and performs only forward passes to obtain the embeddings from these stacked models. Although illustrated with three GNN encoders, the present disclosure contemplates than any number of GNN encodersmay be used in the model stack
is a flowchart of a methodfor generating likely malicious seed domains for a batch period, according to embodiments of the present disclosure. Aside from being used to initiate the expansion process, a subset of these domains is also included in the malicious ground truth. This subset includes highly likely attack domains (e.g., having reported consensus intelligence feed≥threshold X) from seed malicious domains, and is discussed in with respect to methodfrom. In various embodiments, the batch-period is daily, but other periods of time are also contemplated that are longer or shorter than one day.
At block, the security systemselects the URLs seen for the first time in the URL feed(e.g., from VT) within the past batch period (e.g., 24 hours). The URL feedcontains all the URLs queried by users all over the world, and because the goal of the security systemis to preemptively identify the latest malicious domains, these newly seen domains are given priority for further analysis. URLs that have been seen previously may be discarded or ignored in further operations of method.
At block, the security systemselects those URLs identified in blockthat have been marked as malicious (or potentially malicious) by at least X consensus scanners. URLs identified as malicious by fewer than X consensus scanners may be discarded or ignored in further operations of method.
At block, the security systemextracts the apex domains from the URLs identified as malicious by at least X consensus scanners in block. Even though a URL may be marked as malicious, the apex domain of that URL is not necessarily malicious, which is the case with compromised domains.
At block, the security system removes domains identified as “safe” or likely benign based on additional heuristics. For example, Based on the webhosting list identified, the security systemcan exclude likely compromised domains by removing those present in a popularity feed (e.g., the ALEXA top 30-day list). The security systemcan also exclude apexes belonging to web hosting services such as 000webhostapp.com, github.io, and godaddysites.com, as these hosting services exhibit benign behavior.
In some embodiments, the security systemuses a long short-term memory (LSTM) based model to identify Domain Generation Algorithm (DGA) domains and filter those URLs out as generally not of concern, even if not benign. Usually, DGA domains are created in thousands and hosted on a limited set of IP addresses. Such malicious domains are quite different from other attack domains and having such domains included in the analysis set reduces the detection efficacy of non-DGA malicious domains. Hence, to detect attack domains with a high efficacy, the security systemexcludes DGA domains.
At block, the security systemidentifies URLs with phishing keywords, such as popular brand impersonating keywords that are more likely to be malicious. For example, intentional misspellings of a popular brand or URLs that contain a name of a legitimate website in an otherwise unrelated URL. One of skill in the art will be familiar with methods for identifying phishing keyword and what qualifies as “similar” to a brand name. The security systemadds these identified URLs to the batch period seed domain in block, and for those likely attack domains that do not have popular brand or phishing keywords, the security systemperforms additional filtering.
At blockand block, the security system checks the WHOIS data, PDNS data, or other registration data of the suspected URL. Recently registered and short-lived domains are more likely to be attack domains. To this end, the security systemidentifies those domains that are registered within a threshold time R (e.g., one year) of the day methodis executed, and adds the newly registered domains to the batch period seed domain in block. If the WHOIS record is not available (per block), the security systemchecks the PDNS records (per block) to obtain its footprint. If the PDNS record is available and the footprint duration is less than a threshold time D (e.g., one year), the security systemadds the newly registered domains to the batch period seed domain in block. If the PDNS records are not available other heuristics may be applied. Otherwise, if the PDSN record are not available (and no other filtering heuristics are applied) or the length of registration or footprint is greater than R or D, the security systemmay discard or ignore that URL.
is a flowchart of a methodfor malicious ground truth generation, according to embodiments of the present disclosure. In addition to the output of the seed selection pipeline (e.g., per methoddiscussed in relation to), the security systemat block also actively queries a sample set of newly observed domains (e.g., a randomly or otherwise selected subset thereof) to enrich and diversify the batch-period list of malicious domains. In both cases, the security systemmay use one of the most conservative thresholds of X positive consensus scanners to construct the malicious ground truth. Also, based on the observation that attack domains are short-lived, the security systemselects the domains that are registered within a threshold time (e.g., within the last year).
At block, the security systemselects a random set of newly seen domains on a given day.
At block, the security systemperforms an active consensus score lookup to identify whether at least X consensus scanners have identified the domain as malicious. If fewer than X consensus scanners have identified the domain an malicious, the security systemmay discard or ignore that domain for the rest of method.
At block, the security systemchecks the WHOIS or other registration data for the domain. If the domain is new (e.g., registered or tracked for less than R or D days), or registration data are unavailable, the security systemlabels the domain as malicious as part of the batch-period seed domains. Otherwise, the security systemmay discard or ignore that domain for the rest of method.
is a flowchart of an example methodto generate the batch-period benign domain ground truth, according to embodiments of the present disclosure.
At block, the security systemselects a set of newly seen domains for the batch period (e.g., a given day) from the consensus feed. Focusing on newly seen domains reduces the bias in the benign ground truth, as popularity feeds usually contain popular long-established domains.
At block, the security systemfilters out those domains which resolve into a known list of sinkhole IP addresses. The rationale is that sinkholed domains are known to be malicious.
At block, the security systemfilters out those domains that have invalid or expired certificates. A benign domain is likely to have a valid unexpired certificate, whereas malicious domains are likely to be used for a short time and hence attackers have little/no incentives to re-new their certificates.
At block, the security systemfilters DGA domains identified by the LSTM-based DGA detection tool, because benign domains more likely to have proper names in a natural human language (e.g., English, Arabic, Chinese).
At block, the security systemfilters out domains impersonating popular brand names, because benign domains are less likely to imitate popular brands. In various embodiments, other phishing heuristics can be used in blockto filter out domains attempting to impersonate other domains.
At block, the security systemexcludes the TLDs managed by Freenom—.gq, .ml, .cf, .ga, and .tk—because these TLDs have a very low reputation in general and hence are less likely to be benign. Other TLDs associated with various security policies or known to host malicious domains by a third part may also be excluded per block.
Unknown
May 26, 2026
Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.