A system and method that converts legacy and proprietary documents into extended mark-up language format which treats the conversion as transforming ordered trees of one schema and/or model into ordered trees of another schema and/or model. In embodiments, the tree transformers are coded using a learning method that decomposes the converting task into three components which include path re-labeling, structural composition and input tree traversal, each of which involves learning approaches. The transformation of an input tree into an output tree may involve labeling components in the input tree with valid labels or paths from a particular output schema, composing the labeled elements into the output tree with a valid structure, and finding such a traversal of the input tree that achieves the correct composition of the output tree and applies structural rules.
Legal claims defining the scope of protection, as filed with the USPTO.
1. A method for converting a legacy or proprietary document into extensible mark-up language format, comprising: inputting a document having a proprietary format; converting the proprietary format document to a document having a standard representation format; preparing a structured representation of the standard representation format document with an exemplary source document schema; determining an exemplary target document schema; preparing a structured representation conforming with the exemplary target document schema; annotating a subset of the standard representation format document to define exemplary target structured representations that satisfy the exemplary target document schema; decomposing a two-dimensional representation of the source document schema using multiple one-dimensional methods; developing translation rules to instruct a parser to visit the two-dimensional representation of the source document schema to group and/or nest labeled elements in an exemplary output document schema; preparing a source document schema structured representation of the standard representation format source document; applying the translation rules to visit the two-dimensional representation of the source document schema and group and/or nest labeled elements in an output document in a target structured representation format; and converting the target structured representation format document to an output target representation document format.
2. The method of claim 1 , wherein the step of decomposing the two-dimensional structured representation of the source document schema using multiple one-dimensional methods comprises: deriving a set of valid input paths in the 2-D structured representation of the input schema and a set of valid output paths in the 2-D structured representation of the output schema; determining a mapping and/or re-labeling function from the input paths into the output paths, applying suitable classification rules; and determining a mapping and/or re-labeling function from the aforementioned classification rules to structural actions.
3. The method of claim 2 , wherein the step of determining a mapping and/or re-labeling function from the aforementioned classification rules to structural actions, comprises: establishing leaf extraction rules (triples) that include one dimensional (1-D) simple path leaf delimiter candidates, associated classification labels, and associated confidence levels; storing path candidates in a candidate index, such as, for example, in the form of a tree data structure, that can store all simple path (1-D) delimiter candidates, including reverse path delimiters, and information sufficient to determine the most discriminative delimiters, and that allows incremental accommodation of new annotated samples; and determining, for each leaf, which leaf delimiter candidate has the highest confidence level; determining a classification label for the leaf content based on the leaf delimiter candidate having the highest confidence level; and providing a classification label for each leaf with respect to which leaf delimiter has the highest confidence level to achieve a valid 2-D tree according to the output schema.
Cooperative Patent Classification codes for this invention. Click any code to explore related patents in that topic.
January 14, 2004
January 16, 2007
Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.