US-8914720

Method and system for constructing a document redundancy graph

PublishedDecember 16, 2014

Assigneenot available in USPTO data we have

Inventorsnot available in USPTO data we have

Technical Abstract

A system and method for constructing a document redundancy graph with respect to a document set. The redundancy graph can be constructed with a node for each paragraph associated with the document set such that each node in the redundancy graph represents a unique cluster of information. The nodes can be linked in an order with respect to the information provided in the document set and bundles of redundant information from the document set can be mapped to individual nodes. A data structure (e.g., a hash table) of a paragraph identifier associated with a probability value can be constructed for eliminating inconsistencies with respect to node redundancy. Additionally, a sequence of unique nodes can also be integrated into the graph construction process. The nodes can be connected to the paragraphs associated with the document set via a hyperlink and/or via a label with respect to each node.

Patent Claims

18 claims

Legal claims defining the scope of protection, as filed with the USPTO.

1. A method for constructing a document redundancy graph, said method comprising: representing each paragraph associated with a document set as a node among a plurality of nodes, wherein each node among said plurality of nodes with respect to said redundancy graph represents a unique cluster of information related to said each paragraph; providing said each paragraph with a unique paragraph identifier; constructing a hash table of all paragraph identifiers comprising identifiers of all paragraphs reachable from said each paragraph; merging said plurality of nodes associated with redundant information by configuring said hash table with respect to a pair of paragraph identifiers in association with a probability value, wherein said probability value sorts a plurality of information matches in an order of decreasing certainty of common content, wherein a pair of said paragraph identifiers associated with an increased certainty of common content are selected to merge; and combining said plurality of nodes unique to a single document by expressing a pair of nodes with overlapping common content as a combined node, wherein said combined node comprises an empty intersection of said pair of nodes and comparing each paragraph identifier among said pair of paragraph identifiers to a probability value associated with an entry in said hash table in an order wherein said hash table eliminates inconsistency associated with said plurality of information matches.

2. The method of claim 1 further comprising configuring at least one paragraph identifier among said pair of paragraph identifiers to include a list of identifiers associated with at least one information element.

3. The method of claim 1 wherein merging said plurality of nodes associated with said redundant information further comprises: combining said plurality of nodes into a single node if an intersection of said document set reachable from each node is empty.

4. The method of claim 1 wherein merging said plurality of nodes associated with said redundant information further comprises: updating said hash table that describes information combinations after combining a pair of nodes.

5. The method of claim 1 wherein combining said plurality of nodes unique to said single document further comprises: setting a flag to indicate said node is a combined node if said hash table comprises said node.

6. The method of claim 1 wherein combining said plurality of nodes unique to said single document further comprises: initiating a chain node if said node follows said combined node by checking said flag in order to thereafter clear said flag.

7. The method of claim 6 wherein combining said plurality of nodes unique to said single document further comprises: adding said node to said chain node if said paragraph does not follow said combined node.

8. The method of claim 6 further comprising adding an edge to said redundant graph for every transition from said chain node to said combined node and vice versa.

9. The method of claim 1 further comprising linking said plurality of nodes with respect to said at least one paragraph via a hyperlink.

10. The method of claim 1 further comprising linking said plurality of nodes with respect to said at least one paragraph via a label.

11. The method of claim 10 wherein said label comprises at least one of the following types of data: a cryptic paragraph identifier; a summary associated with said paragraph; or a paragraph content.

12. A system for constructing a document redundancy graph, said system comprising: a processor; a data bus coupled to said processor; and a computer-usable mass storage device embodying computer code, said computer-usable medium being coupled to said data bus, said computer program code comprising instructions executable by said processor and configured for: representing each paragraph associated with a document set as a node among a plurality of nodes, wherein each node among said plurality of nodes with respect to said redundancy graph represents a unique cluster of information related to said each paragraph; providing said each paragraph with a unique paragraph identifier; constructing a hash table of all paragraph identifiers comprising identifiers of all paragraphs reachable from said each paragraph; merging said plurality of nodes associated with redundant information by configuring said hash table with respect to a pair of paragraph identifiers in association with a probability value, wherein said probability value sorts a plurality of information matches in an order of decreasing certainty of common content, wherein a pair of said paragraph identifiers associated with an increased certainty of common content are selected to merge; and combining said plurality of nodes unique to a single document by expressing a pair of nodes with overlapping common content as a combined node, wherein said combined node comprises an empty intersection of said pair of nodes and comparing each paragraph identifier among said pair of paragraph identifiers to a probability value associated with an entry in said hash table in an order wherein said hash table eliminates inconsistency associated with said plurality of information matches.

13. The system of claim 12 wherein said instructions are further configured for modifying at least one paragraph identifier among said pair of paragraph identifiers to include a list of identifiers associated with at least one information element.

14. The system of claim 12 wherein said instructions are further configured for adding an edge to said redundant graph for every transition from said chain node to said combined node and vice versa.

15. The system of claim 12 wherein said instructions are further configured for linking said plurality of nodes with respect to said at least one paragraph via a hyperlink.

16. The system of claim 12 wherein said instructions are further configured for linking said plurality of nodes with respect to said at least one paragraph via a label.

17. The system of claim 16 wherein said label comprises at least one of the following types of data: a cryptic paragraph identifier; a summary associated with said paragraph; or a paragraph content.

18. A computer-usable mass storage for constructing a document redundancy graph, said computer-usable mass storage storing computer program code, said computer program code comprising program instructions executable by a processor, said program instructions comprising: program instructions to represent each paragraph associated with a document set as a node among a plurality of nodes, wherein each node among said plurality of nodes with respect to said redundancy graph represents a unique cluster of information related to said each paragraph; program instructions to provide said each paragraph with a unique paragraph identifier; program instructions to construct a hash table of all paragraph identifiers comprising identifiers of all paragraphs reachable from said each paragraph; program instructions to merge said plurality of nodes associated with redundant information by configuring said hash table with respect to a pair of paragraph identifiers in association with a probability value, wherein said probability value sorts a plurality of information matches in an order of decreasing certainty of common content, wherein a pair of said paragraph identifiers associated with an increased certainty of common content are selected to merge; and program instructions to combine said plurality of nodes unique to a single document by expressing a pair of nodes with overlapping common content as a combined node, wherein said combined node comprises an empty intersection of said pair of nodes and comparing each paragraph identifier among said pair of paragraph identifiers to a probability value associated with an entry in said hash table in an order wherein said hash table eliminates inconsistency associated with said plurality of information matches.

Classification Codes (CPC)

Cooperative Patent Classification codes for this invention. Click any code to explore related patents in that topic.

G06F

Patent Metadata

Filing Date

July 31, 2009

Publication Date

December 16, 2014

Want to explore more patents?

Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.

Browse All Patents Try Prior Art Search