A document processor includes a parser that parses a document using a grammar having a set of terminal elements for labeling leaves, a set of non terminal elements for labeling nodes, and a set of transformation rules. The parsing generates a parsed document structure including terminal element labels for fragments of the document and a nodes tree linking the terminal element labels and conforming with the transformation rules. An annotator-annotates the document with structural information based on the parsed document structure.
Legal claims defining the scope of protection, as filed with the USPTO.
1. A document processor stored in a non-transitory medium comprising: a probabilistic classifier that classifies fragments of an input document respective to a set of terminal elements by assigning probability values for the fragments corresponding to elements of the set of terminal elements; a parser that defines a parsed document structure associating the input document fragments with terminal elements connected by links of non-terminal elements conforming with a probabilistic grammar defining transformation rules operating on elements selected from the set of terminal elements and a set of non-terminal elements, the parsed document structure being used to organize the input document, the parser including a joint probability optimizer that optimizes the parsed document structure respective to a joint probability of (i) the probability values of the associated terminal elements and (ii) probabilities of the connecting links of non-terminal elements derived from the probabilistic grammar; a classifier trainer that trains the probabilistic classifier respective to a set of training documents having pre-classified fragments; and a grammar derivation module that derives the probabilistic grammar from the set of training documents, each training document having a pre-assigned parsed document structure associating fragments of the training document with terminal elements connected by links of non-terminal elements.
2. The document processor as set forth in claim 1 , wherein the probabilistic grammar is a probabilistic context-free grammar and the joint probability optimizer employs a modified inside/outside optimization.
3. The document processor as set forth in claim 1 , wherein the computer is further programmed to implement: an XML document converter that converts the input document to an XML document having an XML structure generated in accordance with the parsed document structure.
4. The document processor as set forth in claim 3 , wherein the XML document includes a DTD based on the probabilistic grammar.
Cooperative Patent Classification codes for this invention. Click any code to explore related patents in that topic.
June 29, 2005
September 24, 2013
Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.