Detection of Malicious Domains Using Recurring Patterns in Domain Names

PublishedJanuary 8, 2019

Assigneenot available in USPTO data we have

InventorsJiri Havelka Michal Sofka Martin Rehák

Technical Abstract

Patent Claims

25 claims

Legal claims defining the scope of protection, as filed with the USPTO.

1. A method, comprising: identifying, by a security device from monitored network traffic of one or more users, one or more suspicious domain names as candidate domains, the one or more suspicious domain names identified as suspicious based on a co-occurrence of domain words used in discovered domain names within the monitored network traffic; segmenting, by the security device, the discovered domain names into one or more linguistic constituent parts using natural language processing, the one or more linguistic constituent parts corresponding to human-recognizable words; determining, by the security device, sets of domain names sharing one or more common linguistic units; identifying, by the security device, the suspicious domain names based on one or more characteristics selected from: domains sharing a common linguistic unit being a substantial proportion of domains in the monitored network traffic; traffic to domains sharing a common linguistic unit being a substantial proportion of traffic in the monitored network traffic; a set of domains sharing a common linguistic unit being an unlikely co-occurrence across a set of the one or more users; and a set of domains sharing a common linguistic unit that exhibits a relationship between linguistic units in the discovered domain names; determining, by the security device, one or more features of the candidate domains; and confirming, by the security device, certain domains of the candidate domains as malicious domains using a parameterized classifier against the one or more features.

2. The method as in claim 1 , wherein segmenting comprises: using a segmentation dictionary having a set of human-recognizable words in one or more languages.

3. The method as in claim 1 , further comprising: determining a common linguistic unit across the discovered domain names, wherein the determined common linguistic unit is unrecognized in a segmentation dictionary; and adding the determined common linguistic unit to the segmentation dictionary.

4. The method as in claim 1 , further comprising: monitoring the monitored network traffic for a given time interval.

5. The method as in claim 1 , further comprising: removing known non-malicious domains from consideration as a candidate domain.

6. The method as in claim 1 , wherein determining one or more features comprises: determining one or both of a number or proportion of domains sharing a common linguistic unit within the candidate domains.

7. The method as in claim 1 , wherein determining one or more features comprises: determining correlated domain registration information of the candidate domains.

8. The method as in claim 7 , wherein correlated domain registration information is selected from a group consisting of: similar domain creation dates; a maliciousness likelihood of a registration country; a shared registration country; a maliciousness likelihood of a registrant; a shared registrant; a maliciousness likelihood of a domain name server (DNS); and a shared DNS.

9. The method as in claim 1 , wherein determining one or more features comprises: correlating domain uniform resource locator (URL) requests to other suspicious domains.

10. The method as in claim 9 , wherein correlating comprises: determining existence of one or more characteristics selected from: a particular domain sharing a uniform resource locator (URL) pattern with other candidate domains; a particular domain exhibiting a URL pattern known to be malicious; and a particular domain sharing one or more traffic pattern characteristics with other candidate domains.

11. The method as in claim 1 , wherein determining one or more features comprises: correlating user behaviors across the monitored traffic of the one or more users.

12. The method as in claim 11 , wherein correlating comprises: determining existence of one or more characteristics selected from: a likelihood of appearance of particular domain across the one or more users; and an amount of intersection of candidate domains for each user across candidate domains of all of the one or more users.

13. The method as in claim 1 , wherein the parameterized classifier is one of either a linear classifier or a non-linear classifier, and is trained using an objective function with an optimization.

14. The method as in claim 1 , wherein the reoccurrence of linguistic units comprises one or more linguistic units with visually representative character replacements.

15. An apparatus, comprising: one or more network interfaces to communicate with computer network; a processor coupled to the network interfaces and adapted to execute one or more processes; and a memory configured to store a process executable by the processor, the process when executed operable to: identify, from monitored network traffic of one or more users, one or more suspicious domain names as candidate domains, the one or more suspicious domain names identified as suspicious based on a co-occurrence of domain words used in discovered domain names within the monitored network traffic; segment the discovered domain names into one or more linguistic constituent parts using natural language processing, the one or more linguistic constituent parts corresponding to human-recognizable words; determine sets of domain names sharing one or more common linguistic units; identify the suspicious domain names based on one or more characteristics selected from: domains sharing a common linguistic unit being a substantial proportion of domains in the monitored network traffic; traffic to domains sharing a common linguistic unit being a substantial proportion of traffic in the monitored network traffic; a set of domains sharing a common linguistic unit being an unlikely co-occurrence across a set of the one or more users; and a set of domains sharing a common linguistic unit that exhibits a relationship between linguistic units in the discovered domain names; determine one or more features of the candidate domains; and confirm certain domains of the candidate domains as malicious domains using a parameterized classifier against the one or more features.

16. The apparatus as in claim 15 , wherein the human-recognizable words are identified according to a segmentation dictionary having a set of human-recognizable words in one or more languages.

17. The apparatus as in claim 15 , wherein the process when executed is further operable to: determine a common linguistic unit across the discovered domain names, wherein the determined common linguistic unit is unrecognized in a segmentation dictionary; and add the determined common linguistic unit to the segmentation dictionary.

18. The apparatus as in claim 15 , wherein the process when executed to determine one or more features is further operable to: determine one or both of a number or proportion of domains sharing a common linguistic unit within the candidate domains.

19. The apparatus as in claim 15 , wherein the process when executed to determine one or more features is further operable to: determine correlated domain registration information of the candidate domains.

20. The apparatus as in claim 15 , wherein the process when executed to determine one or more features is further operable to: correlate domain uniform resource locator (URL) requests to other suspicious domains.

21. The apparatus as in claim 15 , wherein the process when executed to determine one or more features is further operable to: correlate user behaviors across the monitored traffic of the one or more users.

22. A tangible, non-transitory, computer-readable media having software encoded thereon, the software when executed by a processor operable to: identify, from monitored network traffic of one or more users, one or more suspicious domain names as candidate domains, the one or more suspicious domain names identified as suspicious based on a co-occurrence of domain words used in discovered domain names within the monitored network traffic; segment the discovered domain names into one or more linguistic constituent parts using natural language processing, the one or more linguistic constituent parts corresponding to human-recognizable words; determine sets of domain names sharing one or more common linguistic units; identify the suspicious domain names based on one or more characteristics selected from: domains sharing a common linguistic unit being a substantial proportion of domains in the monitored network traffic; traffic to domains sharing a common linguistic unit being a substantial proportion of traffic in the monitored network traffic; a set of domains sharing a common linguistic unit being an unlikely co-occurrence across a set of the one or more users; and a set of domains sharing a common linguistic unit that exhibits a relationship between linguistic units in the discovered domain names; determine one or more features of the candidate domains; and confirm certain domains of the candidate domains as malicious domains using a parameterized classifier against the one or more features.

23. The tangible, non-transitory, computer-readable media as in claim 22 , wherein the process when executed is further operable to: determine a common linguistic unit across the discovered domain names, wherein the determined common linguistic unit is unrecognized in a segmentation dictionary; and add the determined common linguistic unit to the segmentation dictionary.

24. The tangible, non-transitory, computer-readable media as in claim 22 , wherein the process when executed to determine one or more features is further operable to: determine one or both of a number or proportion of domains sharing a common linguistic unit within the candidate domains.

25. The tangible, non-transitory, computer-readable media as in claim 22 , wherein the process when executed to determine one or more features is further operable to: determine correlated domain registration information of the candidate domains.

Patent Metadata

Filing Date

Unknown

Publication Date

January 8, 2019

Inventors

Jiri Havelka

Michal Sofka

Martin Rehák

Want to explore more patents?

Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.

Browse All Patents Try Prior Art Search