Legal claims defining the scope of protection, as filed with the USPTO.
1. A system comprising: one or more devices to: identify sets of related documents; selectively classify the sets of related documents as bounce pads based on redirects associated with documents in the sets of related documents, the one or more devices, when selectively classifying the sets of related documents being to: determine a first value representing a quantity of documents, in the sets of related documents, that are sources of the redirects, determine a second value representing a quantity of the documents, in the sets of related documents, that are not sources of the redirects, compute a redirect score based on the first value and the second value, and use the computed redirect score to selectively classify the sets of related documents as the bounce pads; compile a list of bounce pads based on one or more of the sets of related documents that are classified as bounce pads; identify a cluster of duplicate documents; determine whether a particular document in the cluster of duplicate documents corresponds to a bounce pad in the list of bounce pads; select one of the documents in the cluster of duplicate documents as representative of the cluster without considering the particular document when the particular document corresponds to a bounce pad in the list of bounce pads; and index the selected document.
2. The system of claim 1 , where, when selectively classifying the sets of related documents, the one or more devices are further to: select one of the sets of related documents, identify documents in the selected set of related documents that are sources of the redirects, identify organizations that are targets of the redirects, determine a redirect score based on a quantity of the identified documents that are sources of the redirects, determine a spam score based on a quantity of the organizations that are targets of the redirects, and selectively classify the selected set of related documents as a bounce pad based on the redirect score and the spam score.
3. The system of claim 1 , where, when identifying the sets of related documents, the one or more devices are further to: identify documents associated with a common web site, a common directory, a common subdirectory, a common host, a common domain name, or a common organization as the set of related documents.
4. The system of claim 1 , where, when selecting the one of the documents in the cluster of duplicate documents as representative of the cluster, the one or more devices are further to: create a ranked list of the documents in the cluster of duplicate documents, and move the particular document toward a bottom of the ranked list when the particular document corresponds to a bounce pad in the list of bounce pads.
5. The system of claim 1 , where the one or more devices are further to: determine a measure of quality associated with each document in the cluster of duplicate documents; and rank the documents in the cluster of duplicate documents based on the measure of quality associated with each of the documents.
6. The system of claim 1 , where, when selectively classifying the sets of related documents as bounce pads, the one or more devices are further to: determine whether the documents, in the sets of related documents, are associated with redirects based on at least one of: an address of each document in the sets of related documents, content of the each document in the sets of related documents, or metadata of the each document in the sets of related documents.
7. A method comprising: identifying, by a server device, a cluster of duplicate documents; determining, by the server device, a measure of quality associated with each document in the cluster of duplicate documents; ranking, by the server device, the documents in the cluster of duplicate documents based on the measure of quality associated with each of the documents; determining, by the server device, that a particular document in the cluster of duplicate documents corresponds to a bounce pad, in a list of bounce pads, based on redirects, the bounce pad redirecting users away from the particular document to another document, determining that the particular document in the cluster of duplicate documents corresponds to a bounce pad in a list of bounce pads including: determining a first value representing a quantity of documents, in the cluster of duplicate documents, that are sources of the redirects, determining a second value representing a quantity of the documents, in the cluster of duplicate documents, that are not sources of the redirects, computing a redirect score based on the first value and the second value, and using the computed redirect score to determine that the particular document, in the cluster of duplicate documents, corresponds to a bounce pad; selecting, by the server device, one of the documents in the cluster of duplicate documents as representative of the cluster without considering the particular document; and indexing, by the server device, the selected document.
8. The method of claim 7 , further comprising: compiling the list of bounce pads, where compiling the list of bounce pads includes: identifying a set of related documents, identifying documents in the set of related documents that are sources of redirects, identifying organizations that are targets of the redirects, determining a redirect score based on a quantity of the identified documents that are sources of the redirects, determining a spam score based on a quantity of the organizations that are targets of the redirects, selectively classifying a document, in the set of related documents, as a bounce pad based on at least one of the redirect score or the spam score, and adding information relating to the document to the list of bounce pads when the document is classified as a bounce pad.
9. The system of claim 8 , where, when identifying the documents in the selected set of documents that are sources of the redirects, the one or more devices are further to: analyze information associated with a document, in the set of related documents, to determine whether the document is a source of a redirect, where the analyzed information includes an address of the document, a content of the document, or metadata associated with the document.
10. The method of claim 7 , where determining the measure of quality includes: determining a link-based score associated with each document in the cluster of duplicate documents as the measure of quality associated with the document.
11. The method of claim 7 , further comprising: identifying sets of related documents; selectively classifying the sets of related documents as bounce pads based on redirects associated with documents in the sets of related documents; and compiling the list of bounce pads based on one or more of the sets of related documents that are classified as bounce pads.
12. The method of claim 11 , where identifying sets of related documents includes: identifying documents associated with a common web site, a common directory, a common subdirectory, a common host, a common domain name, or a common organization as one of the sets of related documents.
13. The method of claim 7 , where selecting one of the documents in the cluster of duplicate documents as representative of the cluster includes: creating a ranked list of the documents in the cluster of duplicate documents, and ranking the particular document toward a bottom of the ranked list when the particular document corresponds to a bounce pad in the list of bounce pads.
14. The method of claim 7 , where determining that the particular document in the cluster of duplicate documents corresponds to the bounce pad includes: determining whether the particular document, in the cluster of duplicate documents, is associated with redirects based on at least one of: an address of the particular document, content of the particular document, or metadata of the particular document.
15. A non-transitory computer-readable memory device storing instructions executable by one or more processors, the instructions comprising: one or more instructions to identify sets of related documents; one or more instructions to classify the sets of related documents as bounce pads based on redirects associated with documents in the sets of related documents, the one or more instructions to classify the sets of related documents as bounce pads including: one or more instructions to determine a first value representing a quantity of documents, in the sets of related documents, that are sources of the redirects, one or more instructions to determine a second value representing a quantity of the documents, in the sets of related documents, that are not sources of the redirects, one or more instructions to compute a redirect score based on the first value and the second value, and one or more instructions to use the computed redirect score to selectively classify the sets of related documents as the bounce pads; one or more instructions to compile a list of bounce pads based on one or more of the sets of related documents that are classified as bounce pads; one or more instructions to identify a cluster of duplicate documents; one or more instructions to determine that a particular document in the cluster of duplicate documents corresponds to a bounce pad in the list of bounce pads; one or more instructions to select one of the documents in the cluster of duplicate documents as representative of the cluster without considering the particular document when the particular document corresponds to a bounce pad in the list of bounce pads; and one or more instructions to index the selected document.
16. The computer-readable memory device of claim 15 , where the one or more instructions to classify include: one or more instructions to select one of the sets of related documents, one or more instructions to identify documents in the selected set of related documents that are sources of the redirects, one or more instructions to identify organizations that are targets of the redirects, one or more instructions to determine a redirect score based on a quantity of the identified documents that are sources of the redirects, one or more instructions to determine a spam score based on a quantity of the organizations that are targets of the redirects, and one or more instructions to selectively classify the selected set of related documents as a bounce pad based on the redirect score and the spam score.
17. The computer-readable memory device of claim 15 , where the one or more instructions to identify the sets of related documents include: one or more instructions to identify documents associated with a particular web site, a particular directory, a particular subdirectory, a particular host, a particular domain name, or a particular organization as one of the sets of related documents.
18. The computer-readable memory device of claim 15 , where the one or more instructions to select the one of the documents in the cluster of duplicate documents as representative of the cluster include: one or more instructions to create a ranked list of the documents in the cluster of duplicate documents, and one or more instructions to rank the particular document toward a bottom of the ranked list when the particular document corresponds to a bounce pad in the list of bounce pads.
19. The computer-readable memory device of claim 15 , further comprising: one or more instructions to determine a measure of quality associated with each document in the cluster of duplicate documents; and one or more instructions to rank the documents in the cluster of duplicate documents based on the measure of quality associated with each of the documents.
20. The computer-readable memory device of claim 15 , where the one or more instructions to classify the sets of related documents as bounce pads include: one or more instructions to determine whether the documents, in the sets of related documents, are associated with redirects based on at least one of: an address of each document in the sets of related documents, content of the each document in the sets of related documents, or metadata of the each document in the sets of related documents.
Unknown
August 27, 2013
Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.