US-8180756

Similarity-based searching

PublishedMay 15, 2012

Assigneenot available in USPTO data we have

Inventorsnot available in USPTO data we have

Technical Abstract

Pairs of similar objects in a population of objects can be found using a process that includes identifying a comparison vector x in a set of vectors having non-zero features, determining an estimated similarity contribution of a subset of features of the comparison vector x to a similarity between the comparison vector x and each vector in the set of vectors, generating an index that includes features based on a comparison of the similarity contribution with a similarity threshold, and identifying another vector in the set that is similar to the vector x using the index.

Patent Claims

24 claims

Legal claims defining the scope of protection, as filed with the USPTO.

1. A computer-implemented method for identifying users colluding with a user represented by vector x in a first plurality of vectors, wherein each of the vectors represents a corresponding user, wherein each feature of each vector represents the corresponding user's click-behavior with regard to a content item, the method comprising: determining, using one or more computers, a score for each vector y, wherein the score represents a partial similarity between the vector x and each vector y, wherein each vector y is a vector in the first plurality of vectors, and each vector y includes at least one feature indexed in an index, wherein the partial similarity represents a degree of similarity between features of the vector x and corresponding features of each vector y, and wherein the partial similarity score is determined using features of vector x and only the indexed features of each vector y; determining, using one or more computers, an upper bound, the upper bound being an estimate of the maximum similarity between non-processed features of the vector x and non-processed features of the other vectors y, the non-processed features being features that have not been used to calculate the partial similarity scores; as long as the upper bound is greater than or equal to a similarity threshold, repeating the operations of determining a partial similarity score and determining an upper bound; determining that the upper bound is lower than the similarity threshold; based on determining that the upper bound is lower than the similarity threshold, updating scores only for vectors y having a non-zero partial similarity score; and identifying a user represented by a vector y with a non-zero score as a user colluding with the user represented by the vector x using the score for the vector y.

2. The method of claim 1 , wherein identifying a user represented by a vector y with a non-zero score as a user colluding with the user represented by the vector x comprises computing a similarity score between the vector x and the vector y using the score for the vector y and identifying the user represented by the vector y as a user colluding with the user represented by the vector x if the similarity score satisfies the similarity threshold.

3. The method of claim 1 , further comprising: computing a similarity estimate between the vector x and the vector y using the score for the vector y; if the similarity estimate satisfies a similarity threshold, computing a similarity score between the vector x and the vector y; and determining that the similarity score satisfies the similarity threshold, wherein identifying a user represented by the vector y as a user colluding with the user represented by the vector x is further based on determining that the similarity score satisfies the similarity threshold.

4. The method of claim 3 , wherein the similarity estimate is based a sum of the score for the vector y and a product of (i) a minimum size of at least one of the vector x or a set of the non-processed features of the vector y, (ii) a maximum weight of the vector x, and (iii) a maximum weight of the vector y.

5. The method of claim 3 , wherein the similarity estimate between the vector x and any vector y is greater than or equal to the similarity score between the vector x and the vector y.

6. The method of claim 3 , further comprising: computing the similarity score between the vector x and the vector y only if the similarity estimate satisfies the similarity threshold.

7. The method of claim 1 , further comprising: determining a size threshold minsize, wherein minsize is a function of the vector x and the similarity threshold; and for each feature of the vector x, removing a corresponding indexed feature of vector y from the index if a size of the vector y is not at least equal to the size threshold.

8. The method of claim 1 , wherein the content item is a markup language document.

9. A system for identifying users colluding with a user represented by vector x in a first plurality of vectors, wherein each of the vectors represents a corresponding user, wherein each feature of each vector represents the corresponding user's click-behavior with regard to a content item, the system comprising: one or more computers; and a computer-readable storage device having stored thereon instructions that, when executed by the one or more computers, cause the one or more computers to perform operations comprising: determining, using one or more computers, a score for each vector y, wherein the score represents a partial similarity between the vector x and each vector y, wherein each vector y is a vector in the first plurality of vectors, and each vector y includes at least one feature indexed in an index, wherein the partial similarity represents a degree of similarity between features of the vector x and corresponding features of each vector y, and wherein the partial similarity score is determined using features of vector x and only the indexed features of each vector y; determining, using one or more computers, an upper bound, the upper bound being an estimate of the maximum similarity between non-processed features of the vector x and non-processed features of the other vectors y, the non-processed features being features that have not been used to calculate the partial similarity scores; as long as the upper bound is greater than or equal to a similarity threshold, repeating the operations of determining a partial similarity score and determining an upper bound; determining that the upper bound is lower than the similarity threshold; based on determining that the upper bound is lower than the similarity threshold, updating scores only for vectors y having a non-zero partial similarity score; and identifying a user represented by a vector y with a non-zero score as a user colluding with the user represented by the vector x using the score for the vector y.

10. The system of claim 9 , wherein identifying a user represented by a vector y with a non-zero score as a user colluding with the user represented by the vector x comprises computing a similarity score between the vector x and the vector y using the score for the vector y and identifying the user represented by the vector y as a user colluding with the user represented by the vector x if the similarity score satisfies the similarity threshold.

11. The system of claim 9 , wherein the operations further comprise: computing a similarity estimate between the vector x and the vector y using the score for the vector y; if the similarity estimate satisfies a similarity threshold, computing a similarity score between the vector x and the vector y; and determining that the similarity score satisfies the similarity threshold, wherein identifying a user represented by the vector y as a user colluding with the user represented by the vector x is further based on determining that the similarity score satisfies the similarity threshold.

12. The system of claim 11 , wherein the similarity estimate is based a sum of the score for the vector y and a product of (i) a minimum size of at least one of the vector x or a set of the non-processed features of the vector y, (ii) a maximum weight of the vector x, and (iii) a maximum weight of the vector y.

13. The system of claim 11 , wherein the similarity estimate between the vector x and any vector y is greater than or equal to the similarity score between the vector x and the vector y.

14. The system of claim 11 , wherein the operations further comprise: computing the similarity score between the vector x and the vector y only if the similarity estimate satisfies the similarity threshold.

15. The system of claim 9 , wherein the operations further comprise: determining a size threshold minsize, wherein minsize is a function of the vector x and the similarity threshold; and for each feature of the vector x, removing a corresponding indexed feature of vector y from the index if a size of the vector y is not at least equal to the size threshold.

16. The system of claim 9 , wherein the content item is a markup language document.

17. A computer-readable storage device having stored thereon instructions for identifying users colluding with a user represented by vector x in a first plurality of vectors, wherein each of the vectors represents a corresponding user, wherein each feature of each vector represents the corresponding user's click-behavior with regard to a content item, wherein the instructions, when executed by a computer, cause the computer to perform operations comprising: determining, using one or more computers, a score for each vector y, wherein the score represents a partial similarity between the vector x and each vector y, wherein each vector y is a vector in the first plurality of vectors, and each vector y includes at least one feature indexed in an index, wherein the partial similarity represents a degree of similarity between features of the vector x and corresponding features of each vector y, and wherein the partial similarity score is determined using features of vector x and only the indexed features of each vector y; determining, using one or more computers, an upper bound, the upper bound being an estimate of the maximum similarity between non-processed features of the vector x and non-processed features of the other vectors y, the non-processed features being features that have not been used to calculate the partial similarity scores; as long as the upper bound is greater than or equal to a similarity threshold, repeating the operations of determining a partial similarity score and determining an upper bound; determining that the upper bound is lower than the similarity threshold; based on determining that the upper bound is lower than the similarity threshold, updating scores only for vectors y having a non-zero partial similarity score; and identifying a user represented by a vector y with a non-zero score as a user colluding with the user represented by the vector x using the score for the vector y.

18. The storage device of claim 17 , wherein identifying a user represented by a vector y with a non-zero score as a user colluding with the user represented by the vector x comprises computing a similarity score between the vector x and the vector y using the score for the vector y and identifying the user represented by the vector y as a user colluding with the user represented by the vector x if the similarity score satisfies the similarity threshold.

19. The storage device of claim 17 , wherein the operations further comprise: computing a similarity estimate between the vector x and the vector y using the score for the vector y; if the similarity estimate satisfies a similarity threshold, computing a similarity score between the vector x and the vector y; and determining that the similarity score satisfies the similarity threshold, wherein identifying a user represented by the vector y as a user colluding with the user represented by the vector x is further based on determining that the similarity score satisfies the similarity threshold.

20. The storage device of claim 19 , wherein the similarity estimate is based a sum of the score for the vector y and a product of (i) a minimum size of at least one of the vector x or a set of the non-processed features of the vector y, (ii) a maximum weight of the vector x, and (iii) a maximum weight of the vector y.

21. The storage device of claim 19 , wherein the similarity estimate between the vector x and any vector y is greater than or equal to the similarity score between the vector x and the vector y.

22. The storage device of claim 19 , wherein the operations further comprise: computing the similarity score between the vector x and the vector y only if the similarity estimate satisfies the similarity threshold.

23. The storage device of claim 17 , wherein the operations further comprise: determining a size threshold minsize, wherein minsize is a function of the vector x and the similarity threshold; and for each feature of the vector x, removing a corresponding indexed feature of vector y from the index if a size of the vector y is not at least equal to the size threshold.

24. The storage device of claim 17 , wherein the content item is a markup language document.

Classification Codes (CPC)

Cooperative Patent Classification codes for this invention. Click any code to explore related patents in that topic.

G06F

Patent Metadata

Filing Date

August 19, 2011

Publication Date

May 15, 2012

Want to explore more patents?

Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.

Browse All Patents Try Prior Art Search