A similarity fingerprint for a data object such as a file can be automatically determined using one or more anchor values. The one or more anchor values can be provided or determined. For each anchor value, a set of distances between each instance of the anchor value in the data object is determined. The set of distances for the instance of the anchor value is aggregated into a single value. The single value is added as a component of the similarity fingerprint. Thus, if there are N anchor values, there can be N components of the similarity fingerprint. The similarity fingerprints of different data objects can be compared and the results of the comparison can be used to determine how similar the data objects are.
Legal claims defining the scope of protection, as filed with the USPTO.
1. A method for automatically determining in a processor a similarity fingerprint for a data object, the method comprising: determining a plurality of anchor values in the data of the data object, wherein the plurality of anchor values are different values; and for each anchor value in the plurality of different anchor values, the processor automatically determining a set of distances associated with the anchor value, wherein determining the set of distances comprises: automatically locating a first instance of the anchor value in the data object, for each remaining instance of one or more remaining instances of the anchor value in the data object, the processor automatically determining a distance between the remaining instance and a previous instance of the anchor value in the data object, and including the distance in the set of distances; the processor automatically aggregating the set of distances to a single value, wherein the single value is a coordinate for the anchor value, and the processor automatically adding the coordinate to a vector representing the similarity fingerprint.
2. The method of claim 1 , wherein determining the plurality of anchor values includes determining the plurality of anchor values based, at least in part, on a type of the data object.
3. The method of claim 1 , wherein determining the plurality of anchor values includes determining the plurality of anchor values according to a property of a rolling hash function applied to the data object.
4. The method of claim 1 , wherein aggregating the set of distances comprises automatically determining one of a median and a mean distance of the set of distances.
5. The method of claim 1 wherein locating the first instance of the anchor value includes automatically determining an offset form a beginning of the data object.
6. The method of claim 1 further comprising comparing the similarity fingerprint to fingerprint associated with known malware to determine if the data object contains malware.
7. The method of claim 1 wherein the aggregating the set of distances comprises automatically determining one of a Shannon entropy and a Gini index of the set of distances.
8. The method of claim 1 wherein determining the plurality of anchor values includes using at least one of properties of the data object and results of a function applied to the object data to perform the determination.
9. A system for automatically determining in a processor a similarity fingerprint for a data object, the system comprising: one or more processors; and a non-transitory machine-readable medium having instructions stored thereon, that when executed, cause the one or more processors to: determine a plurality of anchor values in the data of the data object, wherein the plurality of anchor values are different values; and for each anchor value in the plurality of different anchor values, determine a set of distances associated with the anchor value, wherein said determining the set of distances comprises: automatically locating a first instance of the anchor value in the data object, for each remaining instance of one or more remaining instances of the anchor value in the data object, automatically determining a distance between the remaining instance and a previous instance of the anchor value in the data object, and including the distance in the set of distances; automatically aggregating the set of distances to a single value, wherein the single value is a coordinate for the anchor value, and automatically adding the coordinate to a vector representing the similarity fingerprint.
10. The system of claim 9 , wherein the instructions stored on the non-transitory machine-readable medium that cause the one or more processors to determine the plurality of anchor values include instructions stored on the non-transitory machine-readable medium, that when executed, cause the one or more processors to determine the plurality of anchor values based, at least in part, on a type of the data object.
11. The system of claim 9 , wherein the instructions stored on the non-transitory machine-readable medium that cause the one or more processors to determine the plurality of anchor values include instructions stored on the non-transitory machine-readable medium, that when executed, cause the one or more processors to determine the plurality of anchor values according to a property of a rolling hash function applied to the data object.
12. The system of claim 9 , wherein said aggregating the set of distances comprises automatically determining one of a median and a mean distance of the set of distances.
13. The system of claim 9 , wherein said locating the first instance of the anchor value includes automatically determining an offset form a beginning of the data object.
14. The system of claim 9 , wherein the instructions stored on the non-transitory machine-readable medium further comprise instructions stored on the non-transitory machine-readable medium, that when executed, compare the similarity fingerprint to fingerprint associated with known malware to determine if the data object contains malware.
15. The system of claim 9 , wherein said aggregating the set of distances comprises automatically determining one of a Shannon entropy and a Gini index of the set of distances.
16. The system of claim 9 , wherein the instructions stored on the non-transitory machine-readable medium that cause the one or more processors to determine the plurality of anchor values include instructions stored on the non-transitory machine-readable medium, that when executed, cause the one or more processors to use at least one of properties of the data object and results of a function applied to the object data to perform the determination.
17. A non-transitory machine-readable storage medium having a program stored thereon, the program causing a processor to execute steps for automatically determining a similarity fingerprint for a data object, said steps comprising: determining a plurality of anchor values in the data of the data object, wherein the plurality of anchor values are different values; and for each anchor value in the plurality of different anchor values, the processor automatically determining a set of distances associated with the anchor value, wherein determining the set of distances comprises: automatically locating a first instance of the anchor value in the data object, for each remaining instance of one or more remaining instances of the anchor value in the data object, the processor automatically determining a distance between the remaining instance and a previous instance of the anchor value in the data object, and including the distance in the set of distances; the processor automatically aggregating the set of distances to a single value, wherein the single value is a coordinate for the anchor value, and the processor automatically adding the coordinate to a vector representing the similarity fingerprint.
18. The non-transitory machine-readable storage medium of claim 17 , wherein determining the plurality of anchor values includes determining the plurality of anchor values based, at least in part, on a type of the data object.
19. The non-transitory machine-readable storage medium of claim 17 , wherein determining the plurality of anchor values includes determining the plurality of anchor values according to a property of a rolling hash function applied to the data object.
20. The non-transitory machine-readable storage medium of claim 17 , said steps further comprising comparing the similarity fingerprint to fingerprint associated with known malware to determine if the data object contains malware.
Cooperative Patent Classification codes for this invention. Click any code to explore related patents in that topic.
March 24, 2017
June 16, 2020
Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.