Systems and Methods for Determining Relevant Information Based on Document Structure

PublishedJune 15, 2010

Assigneenot available in USPTO data we have

InventorsMartin H. Van Den Berg Giovanni L. Thione Livia Polanyi Eleanor G. Rieffel Patrick Chiu+1 more

Technical Abstract

Patent Claims

14 claims

Legal claims defining the scope of protection, as filed with the USPTO.

1. A method of determining information relevant to a location within a first document, the method comprising: receiving a selection of the first document, the first document being received through an input and output interface of a computer; identifying at least two structural elements in the first document having a dominance relationship, the identifying being performed by one or more processors of the computer; receiving a selection of a first location in the first document from a user through the input and output interface; determining surrounding structural elements surrounding the first location, the determining comprising selecting from the at least two structural elements; characterizing the surrounding structural elements by the one or more processors; characterizing one or more non-surrounding structural elements from among the at least two structural elements not determined to be the surrounding structural elements by the one or more processors; characterizing surrounding phrase for frequency of occurrence of a plurality of first terms by the one or more processors; characterizing non-surrounding phrases in the first document for the occurrence of the plurality of the first terms by the one or more processors, the non-surrounding phrases being phrases in the first document other than the surrounding phrase; associating one or more second documents with the surrounding structural elements based on the characterization of the surrounding structural elements and the one or more non-surrounding structural elements by the one or more processors, wherein the one or more second documents are determined as being similar to the surrounding structural elements and being dissimilar to the one or more non-surrounding structural elements; creating representative vectors based on the frequency of occurrence of the first terms in the surrounding structural elements, performing latent semantic analysis (LSA) on the surrounding structural elements, the surrounding structural elements are determined based on explicit or implicit information, the implicit information is determined based on theory of analysis, the theory of analysis is at least one of: Linguistic Discourse Model (LDM), Universal Linguistic Discourse Model (ULDM), Discourse Structures Theory (DST), Rhetorical Structures Theory (RST), and Structure Discourse Representation Theory (SDRT), the characterizing of the surrounding structural elements is based on similarity of the representative vectors, the representative vectors are used to select additional documents that are similar in meaning to the surrounding structure elements but are dissimilar to the non-surrounding structure elements, wherein the additional documents are in association with the first location; and removing a second group of the one or more second documents from among first groups of the one or more second documents to obtain a third group of the one or more documents, wherein the removing is based on the characterizing the surrounding structure elements.

2. The method of claim 1 , in which the first location is selected based on at least one of: manually and programmatic control.

3. The method of claim 2 , in which the manual selection of the first location is based on at least one of: implicit and explicit user input.

4. The method of claim 1 , in which the second documents comprise human sensible information.

5. The method of claim 4 , in which the human sensible information is at least one of textual, audio and video information.

6. The method of claim 1 , in which the first document comprises at least one of textual, audio and video information.

7. The method of claim 1 , wherein the associating second documents with the surrounding structural element comprises: determining third documents being similar to the surrounding structural element; and removing, from among the third documents, fourth documents being similar to the non-surrounding structural elements to obtain the second documents.

8. An apparatus for determining relevant information comprising: one or more processors; an input/output circuit that retrieves a first document from a document repository responsive to a user selection; a document structure manager that identifies at least two structural elements in the first document having a dominance relationship; input and output interface that receives a selection of a first location in the first document; a structural element manger identifies surrounding structural elements surrounding the selected first location and one or more non-surrounding structural elements from among the at least two structural elements; a characterization manger characterizes the surrounding structural elements and the one or more non-surrounding structural elements from among the at least two structural elements that is not determined to be the surrounding structural elements; the characterization manger characterizes surrounding phrase for frequency of occurrence of a plurality of first terms; the characterization manger further characterizes non-surrounding phrases in the first document for the occurrence of the plurality of the first terms, the non-surrounding phrases being phrases in the first document other than the surrounding phrase; and a readable program code for: associating one or more second documents with the surrounding structural elements based on the characterization of the surrounding structural elements and the one or more non-surrounding structural elements, wherein the one or more second documents are determined as being similar to the surrounding structural elements and being dissimilar to the one or more non-surrounding structural elements; creating representative vectors based on the frequency of occurrence of the first terms in the surrounding structural elements, performing latent semantic analysis (LSA) on the surrounding structural elements, the surrounding structural elements are determined based on explicit or implicit information, the implicit information is determined based on theory of analysis, the theory of analysis is at least one of: Linguistic Discourse Model (LDM), Universal Linguistic Discourse Model (ULDM), Discourse Structures Theory (DST), Rhetorical Structures Theory (RST), and Structure Discourse Representation Theory (SDRT), the characterizing of the surrounding structural elements is based on similarity of the representative vectors, the representative vectors are used to select additional documents that are similar in meaning to the surrounding structure elements but are dissimilar to the non-surrounding structure elements, wherein the additional documents are in association with the first location; and removing a second group of the one or more second documents from among first groups of the one or more second documents to obtain a third group of the one or more documents, wherein the removing is based on the characterizing the surrounding structure elements.

9. The apparatus of claim 8 , in which the first location is selected based on at least one of: manually and programmatically.

10. The apparatus of claim 8 , in which additional documents comprise human sensible information.

11. The apparatus of claim 10 , in which the human sensible information is at least one of textual, audio and video information.

12. The apparatus of claim 8 , in which the determined document comprises at least one of textual, audio and video information.

13. A computer readable storage medium comprising computer readable program code embodied on the computer readable storage medium, the computer readable program code useable to program a computer for performing the steps of: receiving a selection of a first document, the first document being received through an input and output interface of a computer; identifying at least two structural elements in the first document having a dominance relationship, the identifying being performed by one or more processors of the computer; receiving a selection of a first location in the first document from a user through the input and output interface; determining surrounding structural elements surrounding the first location, the determining comprising selecting from the at least two structural elements; characterizing the surrounding structural elements by the one or more processors; characterizing one or more non-surrounding structural elements from among the at least two structural elements not determined to be the surrounding structural elements by the one or more processors; characterizing surrounding phrase for frequency of occurrence of a plurality of first terms by the one or more processors; characterizing non-surrounding phrases in the first document for the occurrence of the plurality of the first terms by the one or more processors, the non-surrounding phrases being phrases in the first document other than the surrounding phrase; associating one or more second documents with the surrounding structural elements based on the characterization of the surrounding structural elements and the one or more non-surrounding structural elements by the one or more processors, wherein the one or more second documents are determined as being similar to the surrounding structural elements and being dissimilar to the one or more non-surrounding structural elements; creating representative vectors based on the frequency of occurrence of the first terms in the surrounding structural elements, performing latent semantic analysis (LSA) on the surrounding structural elements, the surrounding structural elements are determined based on explicit or implicit information, the implicit information is determined based on theory of analysis, the theory of analysis is at least one of: Linguistic Discourse Model (LDM), Universal Linguistic Discourse Model (ULDM), Discourse Structures Theory (DST), Rhetorical Structures Theory (RST), and Structure Discourse Representation Theory (SDRT), the characterizing of the surrounding structural elements is based on similarity of the representative vectors, the representative vectors are used to select additional documents that are similar in meaning to the surrounding structure elements but are dissimilar to the non-surrounding structure elements, wherein the additional documents are in association with the first location; and removing a second group of the one or more second documents from among first groups of the one or more second documents to obtain a third group of the one or more documents, wherein the removing is based on the characterizing the surrounding structure elements.

14. A method for retrieving information relevant to a word within a document, the method comprising: retrieving, responsive to a first user input, a first document from a document repository saved on a database coupled to a computer, the first document including a plurality of phrases; determining the plurality of phrases in the first document by one or more processors of the computer, the determining comprising selecting from at least two structural documents; selecting, responsive to a second user input, a first word within the first document by the one or more processors, the first user input and the second user input being received through an input and output interface of the computer; determining a first phrase that includes the first word as a surrounding phrase by the one or more processors; characterizing the surrounding phrase for frequency of occurrence of a plurality of first terms by the one or more processors; characterizing non-surrounding phrases in the first document for the occurrence of the plurality of the first terms by the one or more processors, the non-surrounding phrases being phrases in the first document other than the surrounding phrase; finding a first group of one or more documents being similar to the surrounding phrase based on the characterization of the surrounding phrase; finding within the first group of the one or more documents, a second group of the one or more documents being similar to the non-surrounding phrases, the second group of one or more documents being similar to both the surrounding phrase and the non-surrounding phrases; associating the one or more documents with surrounding structural elements based on characterization of the surrounding structural elements and one or more non-surrounding structural elements by the one or more processors, wherein the one or more documents are determined as being similar to the surrounding structural elements and being dissimilar to the one or more non-surrounding structural elements; creating representative vectors based on the frequency of occurrence of the first terms in the surrounding structural elements, performing latent semantic analysis (LSA) on the surrounding structural elements, the surrounding structural elements are determined based on explicit or implicit information, the implicit information is determined based on theory of analysis, the theory of analysis is at least one of: Linguistic Discourse Model (LDM), Universal Linguistic Discourse Model (ULDM), Discourse Structures Theory (DST), Rhetorical Structures Theory (RST), and Structure Discourse Representation Theory (SDRT), the characterization of the surrounding structural elements is based on similarity of the representative vectors, the representative vectors are used to select additional documents that are similar in meaning to the surrounding structure elements but are dissimilar to the non-surrounding structure elements, wherein the additional documents are in association with the first location; removing a second group of the one or more documents from among first groups of the one or more documents to obtain a third group of the one or more documents, wherein the removing is based on the characterization of the surrounding structure elements; and outputting the third group of the one or more documents to the user on the input and output interface.

Patent Metadata

Filing Date

Unknown

Publication Date

June 15, 2010

Inventors

Martin H. Van Den Berg

Giovanni L. Thione

Livia Polanyi

Eleanor G. Rieffel

Patrick Chiu

Bee Yian Liew

Want to explore more patents?

Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.

Browse All Patents Try Prior Art Search