A system and method for performing discovery of digital information in a subject area is provided. Each of topics in a subject area, training material for the topics, and a corpus comprising digital information are designated. Topic models for each of the topics are built. The topic models are evaluated against the training material. The digital information from the corpus is organized by the topics using the topic models into an evergreen index.
Legal claims defining the scope of protection, as filed with the USPTO.
1. A system for performing discovery of digital information in a subject area, comprising: an information collection maintained in a storage device; and a computer comprising a processor and memory within which code for execution by the processor is stored, comprising: a user interface of the computer configured to designate each of topics in a subject area, training material for the topics, and a corpus comprising electronically-stored digital information; a topic modeler configured to build candidate topic models on the computer, comprising: a seed word selector configured to select seed words for each of the topics, and a pattern generator configured to generate patterns from the seed words for each topic as candidate topic models for that topic; an index trainer to evaluate the topic models against the training material comprising: a pattern tester configured to match the patterns in each candidate topic model to the training material and to rate the candidate topic model based on topical prediction; and an index builder configured to build an evergreen index comprising topic models for each of the topics by pairing each topic to the candidate topic model that was best rated.
2. A system according to claim 1 , further comprising: topic models further comprising: a performance evaluator configured to favor the candidate topic models for the selected topic based on how well each candidate topic model performs by assigning a higher score to those candidate topic models, which match all of the on topic information for the selected topic and do not match all of the off topic information for the selected topic.
3. A system according to claim 1 , further comprising: a finite state topic modeler configured to form each topic model as a predicate in a finite state language, and to apply each predicate to the corpus as a query that returns those places in the corpus matched by the predicate.
4. A system according to claim 1 , further comprising: a classifier configured to classify articles in the corpus against the topic models in the evergreen index; and a display configured to present through the user interface the classified articles in the evergreen index.
5. A system according to claim 1 , wherein one or more subtopics under at least one of the topics are hierarchically structured.
6. A system according to claim 1 , further comprising: a set of positive training examples maintained in the storage device and comprising articles in the corpus corresponding to correct citations for each topic model.
7. A system according to claim 6 , further comprises: a set of negative training examples maintained in the storage device and comprising articles in the corpus corresponding to incorrect citations for each topic model.
8. A system according to claim 6 , further comprising: a characteristic word selector configured to identify basis words that are characteristic of each topic and based on the articles in the positive training examples set, and to combine one or more of the basis words as seed words into each of the topic models for the topic.
9. A system according to claim 8 , wherein additional words are included with the seed words comprising at least one of words in the topic and words proximate to labels in the corpus referenced by the citations of that topic.
10. A system according to claim 1 , wherein the digital information comprises one or more of printed documents, Web pages, and material written in a digital media.
11. A system according to claim 1 , further comprising: a structure evaluator configured to favor the candidate topic model for the selected topic comprising a simpler structure over those of the candidate topic for the selected topic comprising a complex structure by assigning a higher score to the simpler candidate topic models.
12. A system according to claim 1 , further comprising: a bias evaluator configured to bias the candidate topic models for the selected topic that have term overlap with topic labels associated with the selected topic by assigning a higher score to the overlapping candidate topic models.
13. A method for performing discovery of digital information in a subject area, comprising: designating through a user interface of a computer a corpus comprising electronically-stored digital information, which are maintained in a storage device; selecting one or more topics and training material for the selected topics comprising on topic information and off topic information; building candidate topic models on the computer comprising: selecting seed words for each of the selected topics; and generating patterns from the seed words for each topic as candidate topic models for that topic; evaluating the candidate topic models for each selected topic against the training material comprising: matching the patterns in each candidate topic model to the training material; rating each candidate topic model for the selected topic comprising: assigning a higher score to each candidate topic model that matches the on topic information for the selected topic; assigning a lower score to each candidate topic model that does not match the on topic information for the selected topic; assigning a higher score to each candidate topic model that does not match the off topic information for the selected topic; and assigning a lower score to each candidate topic model that matches the off topic information for the selected topic; and building an evergreen index comprising topic models for each of the selected topics by pairing each topic to the candidate topic model that has the best overall score.
14. A method according to claim 13 , further comprising: favoring the candidate topic models for the selected topic based on how well each candidate topic model performs by assigning a higher score to those candidate topic models, which match all of the on topic information for the selected topic and do not match all of the off topic information for the selected topic.
15. A method according to claim 13 , further comprising: forming each topic model as a predicate in a finite state language; and applying each predicate to the corpus as a query that returns those places in the corpus matched by the predicate.
16. A method according to claim 13 , further comprising: classifying articles in the corpus against the topic models in the evergreen index; and presenting the classified articles in the evergreen index.
17. A method according to claim 13 , further comprising: hierarchically structuring one or more subtopics under at least one of the selected topics.
18. A method according to claim 13 , further comprising: specifying the on topic information by defining a set of positive training examples maintained in the storage device and comprising articles in the corpus corresponding to correct citations for each topic model.
19. A method according to claim 18 , further comprises: specifying the off topic information by defining a set of negative training examples maintained in the storage device and comprising articles in the corpus corresponding to incorrect citations for each topic model.
20. A method according to claim 18 , further comprising: identifying basis words that are characteristic of each topic and based on the articles in the positive training examples set; and combining one or more of the basis words as seed words into each of the topic models for the topic.
21. A method according to claim 20 , further comprising: including additional words with the seed words comprising at least one of words in the topic and words proximate to labels in the corpus referenced by the citations of that topic.
22. A method according to claim 13 , wherein the digital information comprises one or more of printed documents, Web pages, and material written in a digital media.
23. A method according to claim 13 , further comprising: favoring the candidate topic model for the selected topic comprising a simpler structure over those of the candidate topic for the selected topic comprising a complex structure by assigning a higher score to the simpler candidate topic models.
24. A method according to claim 13 , further comprising: biasing the candidate topic models for the selected topic that have term overlap with topic labels associated with the selected topic by assigning a higher score to the overlapping candidate topic models.
25. An apparatus for performing discovery of digital information in a subject area, comprising: means for designating through a user interface of a computer a corpus comprising electronically-stored digital information, which are maintained in a storage device; means for selecting one or more topics and training material for the selected topics comprising on topic information and off topic information; means for building candidate topic models on the computer comprising: means for selecting seed words for each of the selected topics; and means for generating patterns from the seed words for each topic as candidate topic models for that topic; means for evaluating the candidate topic models for each selected topic against the training material comprising: means for matching the patterns in each candidate topic model to the training material; means for rating each candidate topic model for the selected topic comprising: means for assigning a higher score to each candidate topic model that matches the on topic information for the selected topic; means for assigning a lower score to each candidate topic model that does not match the on topic information for the selected topic; means for assigning a higher score to each candidate topic model that does not match the off topic information for the selected topic; and means for assigning a lower score to each candidate topic model that matches the off topic information for the selected topic; and means for building an evergreen index comprising topic models for each of the selected topics by pairing each topic to the candidate topic model that has the best overall score.
Cooperative Patent Classification codes for this invention. Click any code to explore related patents in that topic.
August 12, 2008
April 24, 2012
Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.