Topic Word Generation Method and System

PublishedDecember 18, 2012

Assigneenot available in USPTO data we have

InventorsFraser Shein Tom Nantais Dan Li

Technical Abstract

Patent Claims

28 claims

Legal claims defining the scope of protection, as filed with the USPTO.

1. A method of generating topic words from at least one seed word and a collection of electronic documents comprising the steps of: a. identifying keywords in each document that are indicative of the topic of the document; b. evaluating the relevance of each of the documents to the at least one seed word; c. identifying at least one key topic document that is relevant to the at least one seed word; d. selecting a subset of the documents, referred to as topic documents, by an iterative process starting with the selection of the at least one key topic document and then selecting other documents if their keywords are sufficiently similar to the keywords contained in the previously selected topic documents; and e. extracting a set of topic words from the topic documents, wherein the steps of the method are performed by a computer processor running software.

2. The method of claim 1 wherein each document comprises an index, and the evaluation step comprises producing a sorted list of document indices that include the at least one seed word based on relevance to the at least one seed word, and to identify at least one key topic document index that is highly relevant to at least one seed word.

3. The method of claim 2 wherein the relevance of a document index is evaluated by comparison of the at least one seed word to the title of each document and keywords contained within each document index.

4. The method of claim 1 wherein in the step of identifying keywords, words contained in a pre-defined dictionary are excluded from consideration as keywords.

5. The method of claim 1 wherein in the step of identifying keywords, words that serve structural purposes are excluded from consideration as keywords.

6. The method of claim 1 wherein in the step of evaluating the relevance of each of the documents, documents determined to be unlikely to pertain to a single topic are eliminated prior to identifying the at least one key topic document.

7. The method of claim 1 wherein the relevance of a document to the at least one seed word is determined based on the frequency of occurrence of the seed words in the title of the document and within the document.

8. The method of claim 1 wherein the at least one key topic document is chosen to be the document that is most relevant to the at least one seed word if that document is sufficiently relevant to the at least one seed word and otherwise all documents with at least a pre-defined level of relevance to the at least one seed word are chosen to be key topic documents.

9. The method of claim 1 wherein the topic documents include the at least one key topic document and the other topic documents are selected by an algorithm that considers each document, one at a time, in declining order of relevance to the at least one seed word, and selects a document as a topic document if it contains at least a predefined percentage of keywords that occur as keywords of the previously selected topic documents.

10. The method of claim 1 wherein at least one document in the collection of documents is obtained from a specified source.

11. The method of claim 1 wherein the extracted topic words are further processed to eliminate redundant topic words having common morphological roots.

12. The method of claim 1 , wherein the collection of documents is pre-processed such that each document in the collection of documents shares a commonality.

13. A memory having recorded thereon statements and instructions for execution by a computer to carry out the method of claim 1 .

14. A method comprising transmitting over a communications medium computer-executable instructions for causing a computer system programmed thereby to perform the method of claim 1 .

15. The method of claim 1 wherein the at least one seed word is obtained by analyzing user-entered text.

16. A computer system for extracting topic words from electronic documents based on at least one seed word comprising: a. a programmable computer processor; b. a memory readable by the processor; and c. software stored in the memory for execution by the processor, the software comprising: i. a keyword identification module for identifying keywords in each document that are indicative of the topic of the document; ii. an evaluation module for evaluating the relevance of each of the documents to the at least one seed word; iii. a key topic document identification module for identifying at least one key topic document that is relevant to the at least one seed word; iv. a selection module for selecting a subset of the documents, referred to as topic documents, by an iterative process starting with the at least one key topic document and then selecting other documents if their keywords are sufficiently similar to the keywords contained in the previously selected topic documents; and v. an extraction module for extracting a set of topic words from the topic documents.

17. The system of claim 16 wherein each document comprises an index, and the evaluation module produces a sorted list of document indices that include the at least one seed word based on relevance to the at least one seed word, and identifies at least one key topic document index that is highly relevant to at least one seed word.

18. The system of claim 17 wherein the relevance of a document index is evaluated by comparison of the at least one seed word to the title of each document and keywords contained within each document index.

19. The system of claim 16 wherein in the step of identifying keywords, the keyword identification module excludes words contained in a pre-defined dictionary from consideration.

20. The system of claim 16 wherein in the step of identifying keywords, the keyword identification module excludes words that serve structural purposes.

21. The system of claim 16 wherein the system further comprises a filtering module for eliminating documents determined to be unlikely to pertain to a single topic prior to the execution of the keyword identification module.

22. The system of claim 16 wherein the relevance of a document to the at least one seed word is determined by the evaluation module based on the frequency of occurrence of the seed words in the title of the document and within the document.

23. The system of claim 16 wherein the at least one key topic document is chosen by the key topic document identification module to be the document that is most relevant to the at least one seed word if that document is sufficiently relevant to the at least one seed word and otherwise all documents with at least a pre-defined level of relevance to the at least one seed word are chosen to be key topic documents.

24. The system of claim 16 wherein the topic documents are selected by the selection module to include the at least one key topic document and the other topic documents are selected by an algorithm that considers each document, one at a time, in declining order of relevance to the at least one seed word, and selects a document as a topic document if it contains at least a predefined percentage of keywords that occur as keywords of the previously selected topic documents.

25. The system of claim 16 wherein at least one document in the collection of documents is obtained from a specified source.

26. The system of claim 16 wherein the extracted topic words are further processed to eliminate redundant topic words having common morphological roots.

27. The system of claim 16 , wherein the collection of documents is pre-processed such that each document in the collection of documents shares a commonality.

28. The system of claim 16 wherein the at least one seed word is obtained by analyzing user-entered text.

Patent Metadata

Filing Date

Unknown

Publication Date

December 18, 2012

Inventors

Fraser Shein

Tom Nantais

Dan Li

Want to explore more patents?

Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.

Browse All Patents Try Prior Art Search