Pairs of similar objects in a population of objects can be found using a process that includes identifying a comparison vector x in a set of vectors having non-zero features, determining an estimated similarity contribution of a subset of features of the comparison vector x to a similarity between the comparison vector x and each vector in the set of vectors, generating an index that includes features based on a comparison of the similarity contribution with a similarity threshold, and identifying another vector in the set that is similar to the vector x using the index.
Legal claims defining the scope of protection, as filed with the USPTO.
1. A computer-implemented method for identifying users colluding with a user represented by vector x in a first plurality of vectors, wherein each of the vectors represents a corresponding user, wherein each feature of each vector represents the corresponding user's click-behavior with regard to a content item, the method comprising: determining, using one or more computers, a score for each vector y, wherein the score represents a partial similarity between the vector x and each vector y, wherein each vector y is a vector in the first plurality of vectors, and each vector y includes at least one feature indexed in an index, wherein the partial similarity represents a degree of similarity between features of the vector x and corresponding features of each vector y, and wherein the partial similarity score is determined using features of vector x and only the indexed features of each vector y; determining, using one or more computers, an upper bound, the upper bound being an estimate of the maximum similarity between non-processed features of the vector x and non-processed features of the other vectors y, the non-processed features being features that have not been used to calculate the partial similarity scores; as long as the upper bound is greater than or equal to a similarity threshold, repeating the operations of determining a partial similarity score and determining an upper bound; determining that the upper bound is lower than the similarity threshold; based on determining that the upper bound is lower than the similarity threshold, updating scores only for vectors y having a non-zero partial similarity score; and identifying a user represented by a vector y with a non-zero score as a user colluding with the user represented by the vector x using the score for the vector y.
2. The method of claim 1 , wherein identifying a user represented by a vector y with a non-zero score as a user colluding with the user represented by the vector x comprises computing a similarity score between the vector x and the vector y using the score for the vector y and identifying the user represented by the vector y as a user colluding with the user represented by the vector x if the similarity score satisfies the similarity threshold.
3. The method of claim 1 , further comprising: computing a similarity estimate between the vector x and the vector y using the score for the vector y; if the similarity estimate satisfies a similarity threshold, computing a similarity score between the vector x and the vector y; and determining that the similarity score satisfies the similarity threshold, wherein identifying a user represented by the vector y as a user colluding with the user represented by the vector x is further based on determining that the similarity score satisfies the similarity threshold.
4. The method of claim 3 , wherein the similarity estimate is based a sum of the score for the vector y and a product of (i) a minimum size of at least one of the vector x or a set of the non-processed features of the vector y, (ii) a maximum weight of the vector x, and (iii) a maximum weight of the vector y.
5. The method of claim 3 , wherein the similarity estimate between the vector x and any vector y is greater than or equal to the similarity score between the vector x and the vector y.
6. The method of claim 3 , further comprising: computing the similarity score between the vector x and the vector y only if the similarity estimate satisfies the similarity threshold.
7. The method of claim 1 , further comprising: determining a size threshold minsize, wherein minsize is a function of the vector x and the similarity threshold; and for each feature of the vector x, removing a corresponding indexed feature of vector y from the index if a size of the vector y is not at least equal to the size threshold.
8. The method of claim 1 , wherein the content item is a markup language document.
9. A system for identifying users colluding with a user represented by vector x in a first plurality of vectors, wherein each of the vectors represents a corresponding user, wherein each feature of each vector represents the corresponding user's click-behavior with regard to a content item, the system comprising: one or more computers; and a computer-readable storage device having stored thereon instructions that, when executed by the one or more computers, cause the one or more computers to perform operations comprising: determining, using one or more computers, a score for each vector y, wherein the score represents a partial similarity between the vector x and each vector y, wherein each vector y is a vector in the first plurality of vectors, and each vector y includes at least one feature indexed in an index, wherein the partial similarity represents a degree of similarity between features of the vector x and corresponding features of each vector y, and wherein the partial similarity score is determined using features of vector x and only the indexed features of each vector y; determining, using one or more computers, an upper bound, the upper bound being an estimate of the maximum similarity between non-processed features of the vector x and non-processed features of the other vectors y, the non-processed features being features that have not been used to calculate the partial similarity scores; as long as the upper bound is greater than or equal to a similarity threshold, repeating the operations of determining a partial similarity score and determining an upper bound; determining that the upper bound is lower than the similarity threshold; based on determining that the upper bound is lower than the similarity threshold, updating scores only for vectors y having a non-zero partial similarity score; and identifying a user represented by a vector y with a non-zero score as a user colluding with the user represented by the vector x using the score for the vector y.
10. The system of claim 9 , wherein identifying a user represented by a vector y with a non-zero score as a user colluding with the user represented by the vector x comprises computing a similarity score between the vector x and the vector y using the score for the vector y and identifying the user represented by the vector y as a user colluding with the user represented by the vector x if the similarity score satisfies the similarity threshold.
11. The system of claim 9 , wherein the operations further comprise: computing a similarity estimate between the vector x and the vector y using the score for the vector y; if the similarity estimate satisfies a similarity threshold, computing a similarity score between the vector x and the vector y; and determining that the similarity score satisfies the similarity threshold, wherein identifying a user represented by the vector y as a user colluding with the user represented by the vector x is further based on determining that the similarity score satisfies the similarity threshold.
12. The system of claim 11 , wherein the similarity estimate is based a sum of the score for the vector y and a product of (i) a minimum size of at least one of the vector x or a set of the non-processed features of the vector y, (ii) a maximum weight of the vector x, and (iii) a maximum weight of the vector y.
13. The system of claim 11 , wherein the similarity estimate between the vector x and any vector y is greater than or equal to the similarity score between the vector x and the vector y.
14. The system of claim 11 , wherein the operations further comprise: computing the similarity score between the vector x and the vector y only if the similarity estimate satisfies the similarity threshold.
15. The system of claim 9 , wherein the operations further comprise: determining a size threshold minsize, wherein minsize is a function of the vector x and the similarity threshold; and for each feature of the vector x, removing a corresponding indexed feature of vector y from the index if a size of the vector y is not at least equal to the size threshold.
16. The system of claim 9 , wherein the content item is a markup language document.
17. A computer-readable storage device having stored thereon instructions for identifying users colluding with a user represented by vector x in a first plurality of vectors, wherein each of the vectors represents a corresponding user, wherein each feature of each vector represents the corresponding user's click-behavior with regard to a content item, wherein the instructions, when executed by a computer, cause the computer to perform operations comprising: determining, using one or more computers, a score for each vector y, wherein the score represents a partial similarity between the vector x and each vector y, wherein each vector y is a vector in the first plurality of vectors, and each vector y includes at least one feature indexed in an index, wherein the partial similarity represents a degree of similarity between features of the vector x and corresponding features of each vector y, and wherein the partial similarity score is determined using features of vector x and only the indexed features of each vector y; determining, using one or more computers, an upper bound, the upper bound being an estimate of the maximum similarity between non-processed features of the vector x and non-processed features of the other vectors y, the non-processed features being features that have not been used to calculate the partial similarity scores; as long as the upper bound is greater than or equal to a similarity threshold, repeating the operations of determining a partial similarity score and determining an upper bound; determining that the upper bound is lower than the similarity threshold; based on determining that the upper bound is lower than the similarity threshold, updating scores only for vectors y having a non-zero partial similarity score; and identifying a user represented by a vector y with a non-zero score as a user colluding with the user represented by the vector x using the score for the vector y.
18. The storage device of claim 17 , wherein identifying a user represented by a vector y with a non-zero score as a user colluding with the user represented by the vector x comprises computing a similarity score between the vector x and the vector y using the score for the vector y and identifying the user represented by the vector y as a user colluding with the user represented by the vector x if the similarity score satisfies the similarity threshold.
19. The storage device of claim 17 , wherein the operations further comprise: computing a similarity estimate between the vector x and the vector y using the score for the vector y; if the similarity estimate satisfies a similarity threshold, computing a similarity score between the vector x and the vector y; and determining that the similarity score satisfies the similarity threshold, wherein identifying a user represented by the vector y as a user colluding with the user represented by the vector x is further based on determining that the similarity score satisfies the similarity threshold.
20. The storage device of claim 19 , wherein the similarity estimate is based a sum of the score for the vector y and a product of (i) a minimum size of at least one of the vector x or a set of the non-processed features of the vector y, (ii) a maximum weight of the vector x, and (iii) a maximum weight of the vector y.
21. The storage device of claim 19 , wherein the similarity estimate between the vector x and any vector y is greater than or equal to the similarity score between the vector x and the vector y.
22. The storage device of claim 19 , wherein the operations further comprise: computing the similarity score between the vector x and the vector y only if the similarity estimate satisfies the similarity threshold.
23. The storage device of claim 17 , wherein the operations further comprise: determining a size threshold minsize, wherein minsize is a function of the vector x and the similarity threshold; and for each feature of the vector x, removing a corresponding indexed feature of vector y from the index if a size of the vector y is not at least equal to the size threshold.
24. The storage device of claim 17 , wherein the content item is a markup language document.
Cooperative Patent Classification codes for this invention. Click any code to explore related patents in that topic.
August 19, 2011
May 15, 2012
Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.