US-8886648

System and method for computation of document similarity

PublishedNovember 11, 2014

Assigneenot available in USPTO data we have

Inventorsnot available in USPTO data we have

Technical Abstract

A method for searching for documents may include receiving information for a source document, the information including at least one topic and a weight for each topic, where the topic relates to content of the source document, and the weight represents how strongly the topic is associated with the source document, accessing an index containing topics and references to documents containing those topics and selecting a set of documents, where each document in the set is associated with at least one of the topics in the source document, generating similarity scores based on the weight of a topic in the source document and a weight of the same topic in each document within the set of documents having that topic, and selecting a subset of documents from the set of documents based on the similarity scores.

Patent Claims

22 claims

Legal claims defining the scope of protection, as filed with the USPTO.

1. A computer-implemented method for searching for documents, comprising: enabling a user to select a source document; receiving information for the source document, the information including at least one topic and a weight for each topic, where the topic relates to content of the source document, and the weight represents how strongly the topic is associated with the source document; accessing an index, stored in a memory, containing a plurality of topics and references to documents, each entry in the index storing a topic of the plurality of topics with a list of each of the documents containing the topic; selecting a set of documents by comparing the at least one topic of the source document to the entries of the index to obtain each list in the index that is stored with one or more entries in the index matching the at least one topic of the source document, each document of the set of documents having a respective document signature formed of at least one topic relating to content of the document and a weight for each topic; generating similarity scores based on the weight of a topic in the source document and a weight of the same topic in each document within the set of documents having that topic as indicated by the document signature; selecting a subset of documents from the set of documents based on the similarity scores; and outputting an identity of the subset of documents for display to the user.

2. The method of claim 1 , wherein a sum of weights for all topics in one document equals 1.0 or 100%.

3. The method of claim 1 , wherein the step of generating includes computing a normalized cosine similarity of weights for each topic.

4. The method of claim 1 , wherein the step of generating is repeated for each topic in the source document.

5. The method of claim 1 , wherein outputting the identity of the subset comprises outputting a graphical representation of the subset of documents.

6. The method of claim 1 , wherein the subset comprises a cluster of documents associated with one or more best similarity scores.

7. The method of claim 1 , further comprising outputting a representation of corresponding similarity scores for display to the user with the identity of the subset of documents.

8. The method of claim 7 , wherein the representation comprises a graphical representation of corresponding similarity scores.

9. The method of claim 1 , further comprising outputting the identity of one or more highest weighted overlapping topics to a user.

10. The method of claim 1 , wherein the subset of documents is limited to a maximum predetermined number of documents.

11. The method of claim 1 , wherein the subset of documents is selected based on a predetermined threshold similarity score.

12. The method of claim 1 , wherein the similarity scores comprise normalized similarity scores.

13. A computer-implemented system for searching for documents, comprising: a memory storing an index containing a plurality of topics and references to documents, each entry in the index storing a topic of the plurality of topics with a reference to each of the documents containing the topic; and a processor programmed to: enable a user to select a source document; receive information for the source document, the information including at least one topic and a weight for each topic, where the topic relates to content of the source document, and the weight represents how strongly the topic is associated with the source document; access the index; select a set of documents in the index by comparing the at least one topic of the source document to the entries of the index to obtain each reference in the index that is stored with one or more entries in the index matching the at least one topic of the source document, each of the set of documents having a respective document signature formed of at least one topic relating to content of the document and a weight for each topic; generate similarity scores based on the weight of a topic in the source document and a weight of the same topic in each document within the set of documents having that topic as indicated by the document signature; select a subset of documents from the set of documents based on the similarity scores; and output an identity of the subset of documents for display to the user.

14. The system of claim 13 , wherein to generate similarity scores the processor is programmed to compute a normalized cosine similarity of weights for each topic.

15. The system of claim 13 , wherein to output the identity of the subset the processor is programmed to output a graphical representation of the subset of documents.

16. The system of claim 13 wherein the processor is programmed to: output a representation of corresponding similarity scores for display to the user with the identity of the subset of documents.

17. The system of claim 16 , wherein the representation comprises a graphical representation of corresponding similarity scores.

18. The system of claim 13 wherein the processor is programmed to: output an identity of one or more highest weighted overlapping topics to a user.

19. The system of claim 13 , wherein the subset of documents is selected based on a predetermined threshold similarity score.

20. The system of claim 13 , wherein the similarity scores comprise normalized similarity scores.

21. A non-transitory computer storage medium having computer executable instructions which when executed by a computer cause the computer to perform operations comprising: enabling a user to select a source document; receiving information for the source document, the information including at least one topic and a weight for each topic, where the topic relates to content of the source document, and the weight represents how strongly the topic is associated with the source document; accessing an index, stored in a memory, containing a plurality of topics and references to documents, each entry in the index storing a topic of the plurality of topics with a reference to each of the documents containing the topic; selecting a set of documents by comparing the at least one topic of the source document to the entries of the index to obtain each reference in the index that is stored with one or more entries in the index matching the at least one topic of the source document, each of the set of documents having a respective document signature formed of at least one topic relating to content of the document and a weight for each topic; generating similarity scores based on the weight of a topic in the source document and a weight of the same topic in each document within the set of documents having that topic as indicated by the document signature; selecting a subset of documents from the set of documents based on the similarity scores; and outputting an identity of the subset of documents for display to the user.

22. The computer storage medium of claim 21 , wherein the step of generating includes computing a normalized cosine similarity of weights for each topic.

Classification Codes (CPC)

Cooperative Patent Classification codes for this invention. Click any code to explore related patents in that topic.

G06F

Patent Metadata

Filing Date

January 31, 2012

Publication Date

November 11, 2014

Want to explore more patents?

Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.

Browse All Patents Try Prior Art Search