A system for improving retrieval augmentation for information extraction systems deployed over multiple contexts or fields. Representative documents for the context are obtained and used to modify the vector embedding. The documents may be used to generate a frequency term related to the frequency of word term within the context and an inverse prevalence term related to how unique it is for a document to include the word term. The frequency term and the inverse prevalence term are combined into a weight for the respective word term. Weights are used generate a vector text embedding for portions of the document by calculating a weighted average of the word terms in the portion of the document. Retrieval of relevant documents is based on the semantic comparison of the vector text embeddings and an embedding for the information to be extracted and is tailored to the context using the weights.
Legal claims defining the scope of protection, as filed with the USPTO.
. A system for information extraction using a retrieval augmentation architecture, the system comprising one or more processing circuits configured to:
. The system of, wherein the one or more processing circuits are configured to determine each weight of the plurality of weights based on a frequency element related to a number of times the corresponding word term is present in the one or more documents associated with the extraction group.
. The system of, wherein the one or more processing circuits are configured to determine each weight of the plurality of weights based on an information element related to a quantity of the one or more documents associated with the extraction group that include the corresponding word term.
. The system of, wherein the one or more processing circuits are configured to determine each weight of the plurality of weights based on at least one of:
. The system of, wherein the one or more processing circuits are configured to determine each weight of the plurality of weights based on at least one of:
. The system of, wherein the one or more processing circuits are configured to determine each weight of the plurality of weights based on a magnitude of an embedding vector for the corresponding word term from the plurality of embedding vectors.
. The system of, wherein the one or more processing circuits are configured to normalize an embedding vector for the corresponding word term from the plurality of embedding vectors.
. The system of, wherein the one or more processing circuits are configured to retrieve the one or more relevant chunks of the one or more chunks by performing a keyword search, wherein a quantity of matches in the keyword search is scaled by the weight for the corresponding word term.
. The system of, wherein the extraction data field is one of a plurality of extraction data fields of the extraction group related by a common purpose.
. The system of, wherein the one or more processing circuits are configured to:
. A method for information extraction using a retrieval augmentation architecture, the method comprising:
. The method of, wherein each weight of the plurality of weights is based on a frequency element related to a number of times the corresponding word term is present in the one or more documents associated with the extraction group.
. The method of, wherein each weight of the plurality of weights is based on an information element related to a quantity of the one or more documents associated with the extraction group that include the corresponding word term.
. The method of, wherein each weight of the plurality of weights is based on at least one of:
. The method of, wherein each weight of the plurality of weights is based on at least one of:
. The method of, wherein each weight of the plurality of weights is based on a magnitude of an embedding vector, from the plurality of embedding vectors, for the corresponding word term.
. The method of, further comprising retrieving, by the one or more processors, the one or more relevant chunks of the one or more chunks by performing a keyword search, wherein a quantity of matches in the keyword search is scaled by the weight for the corresponding word term.
. The method of, further comprising:
. A system for information extraction using a retrieval augmentation architecture, the system comprising one or more processing circuits configured to:
. The system of, wherein the one or more processing circuits are configured to determine each weight of the plurality of weights based on at least one of:
Complete technical specification and implementation details from the patent document.
This disclosure generally relates to using language models to extract information.
Retrieval augmented generation (RAG) systems can be integrated with large language models (LLMs) to provide context to the LLM. RAG systems provide documents to a large language model by finding documents having similar semantic meaning as the prompt.
The semantic meaning of words may change based on the context within which they are used. Some technological fields include a number of words that have a common meaning that is significantly different than the meaning of the word when used within the context of the technological field.
An embodiment relates to a system for information extraction using a retrieval augmentation architecture, the system includes one or more processing circuits configured to determine a plurality of weights based on one or more documents of a plurality of stored first documents associated with an extraction group, wherein each weight of the plurality of weights indicates an importance of a corresponding word term in the extraction group. The one or more processing circuits are also configured to generate, for each chunk of one or more chunks from one or more second documents, a vector text embedding based on a plurality of embedding vectors from an embedding model for a plurality of respective word terms of the chunk, wherein each one of the plurality of embedding vectors is multiplied by a weight of the plurality of weights for the corresponding word term. The one or more processing circuits are also configured to retrieve one or more relevant chunks of the one or more chunks by comparing each vector text embedding to an extraction embedding for an extraction data field of the extraction group, wherein the extraction embedding is generated using the embedding model and the plurality of weights and store a response from a language model to a prompt including a request to extract the extraction data field and the one or more relevant chunks. This summary is illustrative only and not intended to be limiting.
Different types of businesses often carefully curate and extract a large volume of documents. For example, a large set of insurance documents or accounting documents may be sent to an insurance broker or a tax preparer, who then has the task of identifying and extracting relevant information from the documents. To improve efficiency, businesses have tried to automate this workflow by incorporating template-based information extraction or using rigid, rule-based methods that assume a specific structure within a document. For example, businesses often perform optical character recognition that uses the expected positioning of text on a document to both identify the document type and to further extract and annotate data from that document. As another example, businesses may assume that a document has text that is in a name-value pair format.
Template-based approaches may use trained humans to create each template. A human with detailed knowledge of the optical character recognition (OCR) system and/or document variability must review every document to specifically create sets of rules detailing exactly how to extract data from each of the documents. Template-based approaches also usually require trained humans to maintain each template. However, templates degrade in performance as documents change. While some variability can be explicitly declared in the template, any unaccounted-for changes usually require humans to modify a template to account for the differences or to create a new template.
Moreover, template-based approaches may require a multiplicity of templates to support multiple fields of use, groups of tasks, industries, sectors, projects, lines of business, etc. For example, each project may use different information that is to be extracted and/or identified. Supporting multiple templates for similar information can lead to significant development and hardware resource inefficiencies. For example, it may be necessary to develop, test, store, execute, and maintain rules for extracting the information from each template (e.g., for each project, etc.), leading to long development times and large, complex architectures with significant storage requirements.
An enhanced system may use a retrieval-augmented generation (RAG) architecture in order to extract data using language models (LMs). Documents from which data is to be ingested into the RAG architecture are searched semantically for portions of the documents that may relate to the information to be extracted. The portions (e.g., chunks) identified as relevant to the information may be provided to the LM for final extraction. RAG-based approaches often make use of two models (e.g., an embedding model and a generative model). The embedding model is used to generate an index for retrieval of portions of the documents. The index may include a vector embedding for each chunk of the documents that can be compared to a vector embedding of a prompt (e.g., a request to extract a particular data element) for retrieval. The prompt and any relevant chunks retrieved may be provided to the generative model, for example, a large language model (LLM), which extracts the information requested in the prompt.
Several technological challenges may arise from RAG-based information extraction. First, in order to generate an index for retrieval of the portions of the documents, it is important that the portions include the relevant information for information extraction. Documents including tables and/or other visual information can issues or induce failures. For example, if a document with a table is presented to an OCR system to generate the chunks, the text and the table may be intertwined in a single text-based output (e.g., in markdown language). Chunks generated directly from such text may lose semantic meaning as the textual and table-based information may refer to different topics and/or the sentence structure may be broken. Additionally, in a multiple select question, a question may be posed at the top of a page along with several potential answers. If the possible selections are long enough the chunking strategy may split some of the answers from the question producing multiple chunks for a single question. A chunk may be retrieved based on the semantic analysis of the question but not include the answer. In some scenarios, the LLM may realize the information is not provided and extraction may fail, or the LLM may generate a result based on its pre-training that is inaccurate and not supported by any of the documents (i.e., the LLM may hallucinate a response). Additionally, the chunk may include a potential response that was not selected. Without additional context, the LLM may extract an inaccurate response from the potential answer that was not selected.
Second, when a RAG-based information extraction system supports multiple fields of use, groups of tasks, industries, sectors, projects, lines of business, etc., embedding models may become inefficient. For example, a word may have a first meaning in one industry and a second meaning in another industry or some words may be highly relevant to one industry and less relevant to another industry. The word “premium,” for example, may mean high quality in a typical context, but may refer to a cost or payment in certain financial contexts. The word premium may additionally be more relevant for data extraction in some industries (e.g., in the insurance industry or other financial industries).
The present disclosure improves the technological field of RAG-based generative artificial intelligence (AI) systems with a hybrid RAG approach. Documents may be provided to an OCR system and may be broken into chunks. Each chunk may also be stored with a mapping to the page, document, image-based file, etc. that includes the text from the chunk. When a chunk is retrieved (e.g., by semantic search and/or keyword search) the mapping can be used to retrieve the image or image-based file having the text from the chunk (e.g., the whole page, etc.). The image may then be provided to the MMLM for data extraction.
Improvements using the hybrid approach are two-fold. First, the hybrid approach improves traditional RAG approaches by using an MMLM to extract information. The MMLM can extract information from context related to the layout of the text and/or images on the screen. In addition, the MMLM can derive information from non-characters within the document. For example, the MMLM may determine the selection from a set of possible predefined answers or responses (e.g., based on markings on the document) and extract information included in the selected response. The hybrid approach using an MMLM can significantly reduce the number of inaccurate extractions (e.g., hallucinations, incorrect data, etc.). Second, the hybrid approach avoids the computationally demanding need to generate an index using an MMLM in enhanced RAG systems using MMLMs. In particular, pages, images, forms, etc. may only be presented to the MMLM if they are retrieved, potentially eliminating the need to process each document using an MMLM, and saving significant computational effort.
To further lessen the amount of data that is processed by the MMLM and thus reduce the computational effort of the system, the documents and/or the chunks can be flagged (e.g., indicated, marked, etc.) for processing by an MMLM or an LLM. The flag or other indication for MMLM processing may be based on a document type, for example. Chunks from first document types (e.g., specifications, emails, etc.) may indicate LLM processing, whereas chunks from second document types (e.g., tables, questionnaires, bar charts, etc.) may indicate MMLM processing. In some embodiments, documents with tables are converted to a markdown language to facilitate separating the table information from the text-based information. Separation of the text from the tables can improve extraction results from an LLM, improving upon traditional approaches even when the MMLM is not used during extraction.
Additional technological solutions are disclosed for systems that support multiple fields of use, groups of tasks, industries, sectors, projects, lines of business, etc. Systems and methods described herein utilize embeddings tailored to the industries, etc. supported. In some embodiments, extraction groups are maintained that include a customized chunking strategy, an embedding strategy, word weights, a retrieval strategy, an extraction strategy, and/or other configuration parameters that adjust the functioning of the RAG architecture to a particular industry. Advantageously, each industry may ingest documents using a different tailored word embedding based on the relative importance of words to that particular industry, thereby creating a searchable index for chunk retrieval for the industry. The same tailored word embedding may be used to embed the prompt and perform a semantic comparison during retrieval of the document or portion thereof. Documents index and retrieved using a tailored embedding taking into account word meaning or important are more likely relevant, leading to improved accuracy. Further, with improved accuracy, it may be possible to reduce the number of documents or portions thereof provided to the LLM and/or the MMLM while still maintaining high extraction accuracy, thereby reducing the computations performed by the language models.
As a result of the improvements to the RAG-based generative AI systems and methods described herein, a larger portion of the data to be populated can be accurately determined and extracted from the documents, leading to a reduction in labor associated with data correction. The present disclosure leads to an improvement in the functioning of the computer hardware executing the LLM in the form of enhanced accuracy that reduces the need for reprocessing of prompts and/or retrieval of additional documents thereby reducing computational effort of the LLM.
Data Extraction/Population System
shows a data extraction and population systemconfigured to leverage a language models (LM), for example, one or more large language models (LLMs), one or more multi-modal language models (MMLMs), etc. to extract data from documents and populate data elements (e.g., of a data model, ontological data store, etc.) according to some embodiments. The data extraction and population systemis shown to include one or more UI clients, one or more data sources, an OCR system, one or more LLMs, one or more MMLMs, one or more text embedders, and a data extraction manager systemcommunicably connected via a network.shows a non-limiting example of a possible configuration of the data extraction and population system. It is contemplated that the various components of the data extraction and population systemmay be distributed across discrete systems and/or hardware in different ways. For example, a large language modeland a text embeddermay be configured within the same hardware or same node in a computer cluster or the data extraction manager systemmay be distributed across multiple elements of computer hardware.
In some embodiments, the general operation of the data extraction and population systemis to extract data from documents and populate various data elements, according to some embodiments. The data extraction manager systemmay gather documents from the one or more data sourcesand generate a searchable index of documents or portions thereof from the one or more data sourcesusing the text embedder. The index generation may be based on the semantic meaning of the documents from the one or more data sources, allowing comparison between the entries of the index and a prompt for data (e.g., the prompt also embedded by the text embedder). To populate the data elements, the data extraction manager systemmay generate prompts for the data, identify relevant portions of the documents by searching the index, and provide both the prompt and the relevant portions of the documents to an LM (e.g., the one or more LLMsand/or the one or more MMLMs). The LM may then process the prompt with the provided portions of the document to extract (e.g., identify, parse, summarize, combine, generate, etc.) the data requested by the prompt so that the data extraction manager systemcan store the data (e.g., in an object, a data model, ontological model, an ontological data store, etc.).
In some embodiments, the index is created (e.g., documents from the one or more data sourcesare ingested) using the OCR systemand the text embedder. These documents, however, may have significant information included within the context of the text. For example, information may be included in the text layout, the relationship between the text and figures, markings, or other visual data, tabular data, etc. After retrieval, the data extraction and population systemmay be configured to prompt a MMLM of the one or more MMLMswith the document or portion thereof that was determined to include relevant text. In some embodiments, the data extraction and population systemstores an indication (e.g., flag, etc.) with the text used to generate the index that indicates if the text is to be processed by an LLM of the one or more LLMsor by an MMLM. Indicating certain text to be processed by the one or more MMLMsor the one or more LLMsprovides additional efficiency for the hybrid RAG approach by using the more computationally expensive MMLM only when required.
In some embodiments, the data extraction and population systemgathers large amounts of data from the one or more data sources. The one or more data sourcesmay be internal (e.g., on the company intranet) or external (e.g., stored on another company's web server). The one or more data sourcesmay include dedicated databases for particular types of data or webpages from which documents may be compiled, scraped, etc. The one or more data sourcesmay include documents (e.g., files, records, reports, articles, forms, data, etc.). The documents in the database may contain text, tables, columns, rows, charts, graphics, images, and/or other content. The documents may include PDF files or other image-based files for which the text of the document is not readily available for searching, copying, etc. Such image-based files may be processed by the OCR systemprior to processing by other components of the data extraction and population system. The documents may include a variety of content such as, for example, in the insurance industry, applications, broker correspondence, financials, summary of claims, historical claims filed under business insurance policies (“Loss Run”), questionnaires, forms, applications, and historical claim losses.
The one or more data sourcesmay include image-based documents. Image-based documents may include text, tables, columns, rows, charts, graphics, images, and/or other content. The content of an image-based document may include location information. The location information may relate to a layout indicating the visual appearance of the document and the respective content. For example, image-based documents may include document images (e.g., photographs of documents, scans of documents, bitmap images, portable network graphics, screenshots, etc.), digital documents that include visual content (e.g., PDFs, word-processing documents, webpages, tables, spreadsheets, etc.), and/or digital documents that are entirely or mostly text but include layouts that convey information (e.g., multi-column formatted documents, technical manuals, resumes, profiles, legal documents, contracts, computer, agendas, transcripts, poems, multiple choice questionnaires, etc.). In some embodiments, the documents are processed a portion at a time (e.g., a paragraph, a column, a page, etc.)
In some embodiments, the one or more data sourcesmay include documents that have been filled in (e.g., completed, etc.) by a person digitally or by hand. For example, the one or more data sourcesmay include surveys, applications, forms, questionnaires, registrations, etc. Such documents may include a request for information and a location for a response. The documents may include a request for information along with a list of predefined and/or selectable answers. The document may include one or more multiple choice questions. For example, the document may include questions with selectable answers on the Likert scale, true/false questions, selectable numerical ranges. In some embodiments, the document includes a predefined space (e.g., location, area, etc.) within which the respondent is to enter a response.
A respondent may be sent the document (with requests for information) from the one or more data sources. The document may be sent via a postal service, electronic mail, a website, a facsimile machine, etc. The respondent may supply answers to the requests for information in the document electronically and/or in writing. Responses may be provided by entering a response in the predefined space (e.g., digitally or handwritten). In some embodiments, requests with selectable answers (e.g., multiple choice questions) may include responses for which the respondent has marked (e.g., digitally or by hand) the response to the request. For example, the respondent may add a mark proximate the selected response, encircle the selected response, fill in a bubble (e.g., any closed shape such as oval, square, etc.) near the selected response, etc.
In some embodiments, the one or more data sourcesare configured to receive from the respondents completed (e.g., the response has been provided) documents. For example, the one or more data sourcesmay include an automated email system that, when an email is received, the email is automatically processed by the data extraction manager system. Additionally or alternatively, one or more data sourcesmay include an API to which the respondent can upload a scan, an image, and/or a file of completed documents. In some embodiments, the one or more data sourcesmay notify (e.g., inform, communicate, update, etc.) the data extraction manager systemthat a new document has been received. For example, the data extraction manager systemmay subscribe to notifications from the one or more data sources. Additionally or alternatively, the data extraction manager systemmay periodically poll the one or more data sourcesto determine if new documents have been received.
The OCR systemmay be configured to convert the contents of the document to plain text. The OCR systemmay include, for example, any commercially available OCR system. Additionally or alternatively, the OCR systemmay be a component of the data extraction manager system(e.g., using available OCR software). The system may use this type of private OCR systemfor increased security. The text extraction tool may convert an image-based document (e.g., PDF file, PostScript, tagged image file format (TIFF), etc.) plain text that can be processed by a computer (e.g., the American Standard code for Information Interchange (ASCII)). In some embodiments, the plain text is stored in a plain text file format for later processing. For example, the plain text may be stored in plain text file formats such as TXT or markup languages such as hypertext markup language (HTML), JavaScript Object Notation (JSON), extensible markup language (XML), tau epsilon chi (TeX), etc. (e.g., into a text format (e.g., JSON). JSON is a text format that is completely language independent, but uses conventions that are familiar to programmers. JSON may also be better than OCR because JSON retains positional relationships in the text (positional encoding).
The documents processed by the OCR systemmay include non-text-based information (e.g., charts, graphs, trend lines, flow charts, or other graphical elements) and/or special text structures (e.g., tables, rows, columns, etc.). This information may be recognized by the OCR systemas different from the text of the body of the document and may indicate the presence of special structures (e.g., non-text-based information and/or special text structures) in the output.
The OCR systemmay return output in the JSON text format. The output may include an object for any special structures in the document with a key-value pair for the location of the special structure within the original document. The key-value pair for the location may include, for example, the X-Y position of each of the four corners for each of the tables in the document or the X-Y position of each cell in the tables, or the key-value pair for the location may include the two X limits of the table and the two Y limits of the table. Each PDF analyzed by a text extraction tool may have the same orientation and coordinates. The X-Y positions may describe a table, row structure, column structure, and/or cell structure.
In some embodiments, the OCR systemreturns an output with tables inline with the text using a markdown language. The system may use the same markdown symbols to indicate different locations or different markdown symbols to indicate different locations. For example, the first appearance of the markdown symbol indicates the start (or top) of a table and a second appearance of the same markdown symbol indicates the end (or bottom) of the table. The markdown symbols may also indicate a first (e.g., left) side of the table and a second (e.g., right) side of the table. Markdown symbols (e.g., within text) may provide characteristics of the table. The markdown system may provide information to the system, so the system may render the table. For example, the vertical bar or pipe character, ‘l’, may be used to mark the start of a new column within a row of the table, and the vertical bar followed by a newline character (e.g., ‘|/n’) may be used to represent a new row. The markdown language may also use hyphen characters, ‘-’, to separate a header row from a content row within a table. When analyzing the position of each cell, the system may consider each cell as having a single row of text, regardless of the number of lines of text in each cell. For more information about markdown symbols, see www.markdownguide.org/extended-syntax/.
In some embodiments, the OCR systemreturns an output in a first format, and the data extraction manager systemmay convert the text into a second format (e.g., a common format) prior to processing by other components of the data extraction and population system. For example, the data extraction manager systemmay convert the JSON output (e.g., with location data) to markdown language that includes markdown symbols. The JSON web language may be translated to markdown text indicating one or more boundaries of the table. Modularity is provided by converting to a common text format (e.g., the markdown language) allowing the data extraction and population systemto substitute other various OCR systemsif there is a cost advantage, computational advantage, or an improvement by one provider of OCR technology.
In some embodiments, the OCR systemis configured to recognize a layout of a document being processed (e.g., ingested, etc.). For example, the document may have more than one column and/or switch between different layout types (e.g., one column to two columns). Recognizing the layout of the document may allow the OCR systemto recognize characters and convert them to text in reading order. The OCR systemmay maintain the semantic content included in word ordering by recognizing such layouts and adjusting appropriately. The OCR systemmay be configured to recognize figures. The OCR systemmay not extract any text from figures. For example, text from within a figure may not share semantic meaning with nearby text. Retrieval could be compromised because the text from the figure may be incorrectly included in determining a vector embedding for the text. Additionally or alternatively, the text from figures may be included. In some embodiments, the data extraction manager systemcan select if text from figures should or should not be included in the output from the OCR system. For example, the data extraction manager systemmay determine if text from figures is to be included in the output from the OCR systembased on document type and/or downstream processing selections (e.g., if the document will be processed by an MMLM).
In some embodiments, the OCR systemis able to distinguish the difference between handwriting (e.g., handwritten characters) and typeset (e.g., printed characters). The OCR systemmay output the handwritten characters and the typeset (e.g., from a computer or scan from a printed document) in format that allows the data extraction manager systemto have knowledge of what information was typeset and what information was handwritten. For example, the OCR systemmay include multiple outputs, use markup, and/or generate an output using any other suitable method for providing information to the data extraction manager systemrelated to which text was typeset and which text was converted from handwritten characters.
The OCR systemmay be configured to recognize whether the document would benefit from being processed by the one or more MMLMs. For example, the OCR systemmay detect figures, tables, annotations, and/or other content that may benefit from image-based (e.g., visual, etc.) processing. The OCR systemmay communicate the existence of such indicators to the data extraction manager systemso that the data extraction manager systemcan determine whether the document is to be processed by the one or more MMLMs(e.g., based on a criterion) or the OCR systemmay indicate to the data extraction manager systemthat the document would benefit from processing by the one or more MMLMsdirectly. In some embodiments, the OCR systemor data therefrom is used to determine if the one or more MMLMsare to be used during ingestion (e.g., index generation, vector embedding) and/or if the one or more MMLMsare to perform data extraction (e.g., after an appropriate document or portion thereof is retrieved).
In some embodiments, the data extraction manager systemis configured to perform some or all of the features of the OCR system. The data extraction manager systemmay be configured to recognize the layout of the document, to recognize figures, and/or to recognize handwritten characters as described previously. The data extraction manager systemmay communicate such information to the OCR systemto facilitate more efficient character recognition (e.g., text generation, conversion, text extraction, etc.). For example, the OCR systemmay be configured to translate only certain areas of a document or page, thus allowing the data extraction manager systemto provide certain layout information to the OCR system.
The data extraction manager systemmay be configured to coordinate the operations of the data extraction and population system. For example, the data extraction manager systemmay initiate (e.g., at the request of a user of the one or more UI clients) document gathering from the one or more data sources. The data extraction manager systemmay communicate (e.g., send, deliver, transmit, etc.) the PDFs or other image-based documents to the OCR systemfor conversion to plain text. The data extraction manager systemmay separate the document text from the tabular information before chunking (e.g., splitting text into word lengths that are suitable for retrieval augmentation of, for example, 500 words, 1000 words, 1000 characters, etc.). The data extraction manager systemmay communicate the chunks (both tabular chunks and text chunks) to the text embedderto build an index for semantic search.
Upon receiving a request from a user of the one or more UI clients, the data extraction manager systemmay generate several prompts for data extraction (e.g., identification, summarization, generation, etc.) for processing by LMs (e.g., one or more LLMsand/or one or more MMLM). In some embodiments, the data extraction manager systemis configured to embed each prompt (e.g., using the text embedderor similar embedding model) and compare the prompt vector embedding to that of the index to identify and retrieve potentially related or relevant chunks (e.g., portions of the documents). The prompts, along with the identified relevant chunks, may be communicated to the LMs by the data extraction manager system. In some embodiments, the data extraction manager systemis also configured to store the results of a prompt from the LMs. Thereby, the data extraction manager systemmanages the population of the particular data elements by retrieving both structured and unstructured data, text, tables, etc. from various sources across the local intranet or the internet.
The data extraction manager systemmay also generate user interfaces for the data extraction and population system. For example, the data extraction manager systemmay communicate instructions (e.g., JavaScript, Cascading Style Sheets, etc.) to generate a user interface to the one or more UI clients. The user interface may provide interactive capability with the systems of the data extraction and population system. For example, the user interface may provide the ability to initiate data population, configure the data to populate or extract, view results, trace errors, view source material, and/or other interactions that may be appropriate for a particular use case.
The text embeddermay be configured to generate a vector embedding for a chunk of text. The vector embedding may refer to a vector representation of the semantic content of the chunk of text. Vectorization gives text numerical values that can be searched, with computational efficiency, for similarity (e.g., using a distance metric); thereby, text with similar semantic content can be identified for retrieval. Similar words would have similar numerical values. For example, hot and cold may have vectors pointing in different directions. The system may not find the word “cat”, but with vectors, the system will determine that lion is similar to cat or big+cat. The text embeddermay be trained to understand the meaning of the words (female+king=queen).
After the vectors are created, the text embeddermay communicate the vector embeddings of the text chunks to the data extraction manager systemfor storage in an object (e.g., a vector store). In some embodiments, the text embeddermay be included as a component of the data extraction manager system.
The LLMmay be any type of artificial intelligence (AI) configuration. For example, the LLMmay include generative pre-trained transformers (GPT), bidirectional encoder representations from transformers (BERT), text-to-text transfer transformers (T5), recurrent neural networks (RNN), or any other AI architecture suitable for a large language model. The LLMmay be configured to output a text response from a textual prompt. For example, the LLMmay convert text of a prompt into tokens representing a unit of information (e.g., a character, word, prefix, punctuation, etc.) and use the input sequence tokens to predict each output word (or token) consecutively. The prompt communicated to the LLMmay include chunks from the documents gathered from the one or more data sourcesso that the LLMis able to use that information to generate its response. For example, the LLMmay be provided a prompt including a request to determine the range of the market capitalization of a company over the last 6 months and one or more table chunks or text chunks that include information that may be relevant for such a question.
The LLMmay be a publicly available LLM such as Claude. The LLMmay be pre-trained on massive corpora of text data, allowing it to learn the statistical properties of language and predict output text based on the prompt. In some embodiments, the LLMmay be fine-tuned, for example, to extract specific data from tabular and/or textual input. Fine-tuning a LLM may refer to the process of taking a pre-trained model and further training it on a specific dataset to adapt it to a particular task or domain. Fine-tuning may allow the LLMto leverage its existing knowledge while improving its performance on the new, specialized data. For example, by focusing on the correlations found in the particular task or domain.
The one or more MMLMsmay be designed to process and/or integrate information from various modalities of input (e.g., text, images, audio, video, etc.). In some embodiments, the input layer of the one or more MMLMsincludes a channel for each available modality. For example, there may be an audio channel and an image channel. The image channel may also support text represented visually in the document (e.g., on a page, etc.). The one or more MMLMsmay encode the different modalities into a common format that can be processed by one or more hidden layers within the one or more MMLMs. For example, the one or more MMLMsmay include convolutional layers for imaged-based data and/or transformer layers or other attention mechanisms to process textual data. The one or more MMLMsmay also include layers that combine (e.g., fuse, integrate, etc.) information across different input modes to generate an output. The output may include similar modalities as the input data. For example, the output may include text, images, audio, video, and/or other relevant formats based on the task and/or the prompt to the one or more MMLMs.
The one or more MMLMsmay be configured to use the image-based input modality to better understand context of any text on the page. For example, image-based input to the one or more MMLMsmay allow the one or more MMLMsto understand the flow (e.g., reading order) of the text within a document. The image-based input may also allow the one or more MMLMsto recognize relationships between figures and/or tables and text within a document. The image based one or more MMLMsmay be configured to segment various areas of the document or a page within the document based on relationships between the text, figures, and/or other visual cues. For example, the one or more MMLMsmay distinguish handwritten characters from typeset.
In some embodiments, the documents processed by the data extraction and population systeminclude forms, applications, surveys, etc. for which the document or portion thereof (e.g., page, section, etc.) includes a request for information. The document or portion thereof may also include one or more predefined responses. For example, the document or portion thereof may include multiple-choice, multiple-select, and/or ranking type questions. The one or more MMLMsmay be configured to recognize the selections of predefined responses from the respondent to the request for information. For example, the one or more MMLMsmay recognize circles around text, check marks, filled in boxes or bubbles, as a selection of the related text. In some embodiments, the MMLM is configured (e.g., trained, fine-tuned, etc.) to determine the portion of the text that represents the request for information (e.g., the question, survey directions, etc.) and determine the text that represents the predefined responses. The one or more MMLMsmay be configured or prompted to process (e.g., consider) this information separately when generating a response.
In some embodiments, the one or more MMLMsare used during document ingestion. The data extraction manager systemand/or the OCR systemmay be configured to recognize that the document includes images, figures, layouts, tables, and/or other content that may benefit from processing. For example, the data extraction manager systemmay consider a trade-off between the added cost and computations of using the one or more MMLMsagainst the potential for improved retrieval (and therefore extraction) accuracy if the one or more MMLMsare used. In some embodiments, the data extraction manager systemmay request the one or more MMLMsto create a vector embedding of the document or portion thereof (e.g., page, paragraph, section, etc.). Additionally or alternatively, the data extraction manager systemmay request the one or more MMLMsto generate a summary (e.g., a text-based summary) of the document or portion thereof. After a summary of the document or portion thereof is generated the one or more LLMsmay be used to create a vector embedding for the index.
The one or more UI clientsmay provide users, administrators, and/or developers of the data extraction and population systemaccess to its features. In some embodiments, the one or more UI clientsare used to generate a user interface that allows for interaction with the components of the data extraction and population system. For example, the one or more UI clientsmay be used to initiate data population, configure the data to populate or extract, view results, trace errors, view source material, and/or other interactions that may be appropriate for a particular use case. The one or more UI clientsprovide various inputs (e.g., selecting user interface objects, entering text into fields, etc.) and various outputs (e.g., display, print, email, or transmission to another system) to/from the data extraction and population system.
The networkcan include routers, switches, antennas, computers, and any other hardware required to communicate information between the components of the data extraction and population system(e.g., from the data extraction manager systemto the one or more LLMsor the one or more MMLMs). A portion of the networkcan be wireless and/or a portion of the networkcan be wired. The networkcan include one or more networks with routers to facilitate data transfer between the different networks.
In one use case where the data extraction and population systemis particularly useful is to extract data for the underwriting process of insurance policies. For example, directors and officers liability insurance and/or environmental insurance require extracting large amounts of information for which there is no central repository. The information may be collected about the company, the directors and officers, and/or any business locations. Manually searching for this information is error prone and requires a large time investment for the underwriters. Moreover, much of the data that is to be extracted for insurance underwriting may be found in financial tables of image-based documents (e.g., PDFs) making the systems and methods of separating tabular information and text information described herein particularly useful in such scenarios.
Continuing with the example of insurance underwriting, the user of the data extraction and population systemmay be an insurance underwriter. They may have a specially curated set of data elements that they require to perform the underwriting process of different types of insurance policies. A type of insurance policy may be considered a task for which the data extraction and population systemis configured to populate the data elements of an ontological data store related to that type of insurance policy. The insurance policy may be associated with one subject (e.g., companies, people, buildings, etc.) for which the insurance policy is to be underwritten. After data is populated, the underwriter may review the information and or generate a report. For regulatory purposes, the data used to generate the report may require citation to the source of the information. Systems and methods described herein may allow for such traceability.
shows a block diagram of the data extraction manager system, according to some embodiments. In some embodiments, the data extraction manager systemis configured to coordinate the processes performed by the data extraction and population systemduring the data extraction and population. The data extraction manager systemofis shown as a single entity (e.g., hardware). However, it is contemplated that the components and/or instruction sets included in the data extraction manager systemcould be distributed over any number of computer hardware devices and in any manner of architecture (e.g., local network, cloud-based, etc.).
The data extraction manager systemis shown to include a communications interface, and one or more processing circuitshaving one or more processorsand memory.
Unknown
May 26, 2026
Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.