An illustrative embodiment provides a computer-implemented method. The method comprises using a processor set to receive a number of documents from a plurality of data sources. The processor set annotates entities in the first subset of documents from the number of documents. The processor set splits the number of documents into a number of chunks. The processor sets indexes the number of chunks according to an index structure to generate a number of indexed chunks. The processor set generates a number of prompt examples for each document in the second subset of documents based on the number of indexed chunks and annotated entities from the first subset of documents. The processor set generates a prompt for extracting entities using a large language model based on the number of prompt examples.
Legal claims defining the scope of protection, as filed with the USPTO.
. A computer implemented method, comprising:
. The computer implemented method of, further comprising:
. The computer implemented method of, further comprising:
. The computer implemented method of, wherein the prompt further comprises general instructions for the large language model, and definitions and entity schema for the number of prompt examples for each document in the second subset of documents.
. The computer implemented method of, further comprising:
. The computer implemented method of, wherein the number of indexed chunks is stored in a vector database.
. A computer system, comprising:
. The computer system of, wherein the operations further comprise:
. The computer system of, wherein the operations further comprise:
. The computer system of, wherein the operations further comprise:
. The computer system of, wherein the number of indexed chunks is stored in a vector database.
. A computer program product comprising:
. The computer program product of, wherein the operations further comprise:
. The computer program product of, wherein the operations further comprise:
. The computer program product of, wherein the prompt further comprises general instructions for the large language model, and definitions and entity schema for the number of prompt examples for each document in the second subset of documents.
. The computer program product of, wherein the operations further comprise:
. The computer system of, wherein the prompt further comprises general instructions for the large language model, and definitions and entity schema for the number of prompt examples for each document in the second subset of documents.
Complete technical specification and implementation details from the patent document.
The present disclosure relates generally to generating prompt examples, and more specifically to generating prompt examples for extracting entities from documents.
Prompts refer to the input text or questions provided to a deep learning model such as a large language model to generate a response. A prompt serves as a starting point for the model's processing and can be as simple as a word, a phrase, or as complex as detailed instructions or questions.
In this case, prompts are critical in guiding how deep learning models such as large language models generate content because the prompts set the context and define the scope of the response. In other words, large language models rely heavily on the prompts to interpret user intent and users can ensure that the large language models produce results that are useful by clearly framing questions or instructions in the prompts.
An illustrative embodiment provides a computer-implemented method. The method comprises using a processor set to receive a number of documents from a plurality of data sources. The number of documents comprises a first subset of documents and a second subset of documents. The processor set annotates entities in the first subset of documents from the number of documents. The processor set splits the number of documents into a number of chunks. Each chunk from the number of chunks comprises a portion of textual information from a document in the number of documents. The processor sets indexes the number of chunks according to an index structure to generate a number of indexed chunks. Chunks with annotated text from the first subset of documents are indexed according to the index structure. The processor set generates a number of prompt examples for each document in the second subset of documents based on the number of indexed chunks and annotated entities from the first subset of documents. The processor set generates a prompt for extracting entities using a large language model based on the number of prompt examples.
Another illustrative embodiment provides a computer system. The system comprises a processor set, a set of one or more computer-readable storage media, and program instructions stored on the set of one or more storage media to cause the processor set to perform operations comprising receiving a number of documents from a plurality of data sources; annotating entities in the first subset of documents from the number of documents. The number of documents comprises a first subset of documents and a second subset of documents. The processor set annotates entities in the first subset of documents from the number of documents. The processor set splits the number of documents into a number of chunks. Each chunk from the number of chunks comprises a portion of textual information from a document in the number of documents. The processor sets indexes the number of chunks according to an index structure to generate a number of indexed chunks. Chunks with annotated text from the first subset of documents are indexed according to the index structure. The processor set generates a number of prompt examples for each document in the second subset of documents based on the number of indexed chunks and annotated entities from the first subset of documents. The processor set generates a prompt for extracting entities using a large language model based on the number of prompt examples.
Another illustrative embodiment provides a computer program product. The computer program product comprises a set of one or more computer-readable storage media, and program instructions stored in the set of one or more storage media to perform operations comprising using a processor set to receive a number of documents from a plurality of data sources; annotating entities in the first subset of documents from the number of documents. The number of documents comprises a first subset of documents and a second subset of documents. The processor set annotates entities in the first subset of documents from the number of documents. The processor set splits the number of documents into a number of chunks. Each chunk from the number of chunks comprises a portion of textual information from a document in the number of documents. The processor sets indexes the number of chunks according to an index structure to generate a number of indexed chunks. Chunks with annotated text from the first subset of documents are indexed according to the index structure. The processor set generates a number of prompt examples for each document in the second subset of documents based on the number of indexed chunks and annotated entities from the first subset of documents. The processor set generates a prompt for extracting entities using a large language model based on the number of prompt examples.
The features and functions can be achieved independently in various embodiments of the present disclosure or may be combined in yet other embodiments in which further details can be seen with reference to the following description and drawings.
The illustrative embodiments recognize and take into account a number of considerations. For example, the illustrative embodiments recognize and take into account that named entity recognition is a task of extracting information that seeks to locate and classify named entities. The illustrative embodiments recognize and take into account that named entity recognition is a challenging task that traditionally requires a large amount of labeled training data to achieve high performance.
The illustrative embodiments recognize and take into account that few-shot learning is a type of machine learning that has the ability to train a machine learning model to recognize named entities with only a few examples for each entity.
The illustrative embodiments also recognize and take into account that a Large Language Model-Retrieval Augmented Generation (LLM-RAG) is a technique that combines the power of large language models with retrieval-based methods to generate high-quality text. LLM-RAG models can be fined-tuned on a small amount of data and can generate text that is coherent and contextually relevant.
Thus, illustrative embodiments of the present invention provide a computer implemented method, computer system, and computer program product for generating a model for extracting entities from a document. The method comprises using a processor set to receive a number of documents from a plurality of data sources. The number of documents comprises a first subset of documents and a second subset of documents. The processor set annotates entities in the first subset of documents from the number of documents. The processor set splits the number of documents into a number of chunks. Each chunk from the number of chunks comprises a portion of textual information from a document in the number of documents. The processor set indexes the number of chunks according to an index structure to generate a number of indexed chunks. Chunks with annotated text from the first subset of documents are indexed according to the index structure. The processor set generates a number of prompt examples for each document in the second subset of documents based on the number of indexed chunks and annotated entities from the first subset of documents. The processor set generates a prompt for extracting entities using a large language model based on the number of prompt examples.
With reference to, a pictorial representation of a network of data processing systems is depicted in which illustrative embodiments may be implemented. Network data processing systemis a network of computers in which the illustrative embodiments may be implemented. Network data processing systemcontains network, which is the medium used to provide communications links between various devices and computers connected together within network data processing system. Networkmight include connections, such as wire, wireless communication links, or fiber optic cables.
In the depicted example, server computerand server computerconnect to networkalong with storage unit. In addition, client devicesconnect to network. In the depicted example, server computerprovides information, such as boot files, operating system images, and applications to client devices. Client devicescan be, for example, computers, workstations, or network computers. As depicted, client devicesincludes client computers,, and. Client devicescan also include other types of client devices such as mobile phone, tablet, and smart glasses.
In this illustrative example, server computer, server computer, storage unit, and client devicesare network devices that connect to networkin which networkis the communications media for these network devices. Some or all of client devicesmay form an Internet of things (IoT) in which these physical devices can connect to networkand exchange information with each other over network.
Client devicesare clients to server computerin this example. Network data processing systemmay include additional server computers, client computers, and other devices not shown. Client devicesconnect to networkutilizing at least one of wired, optical fiber, or wireless connections.
Program code located in network data processing systemcan be stored on a computer-recordable storage medium and downloaded to a data processing system or other device for use. For example, the program code can be stored on a computer-recordable storage medium on server computerand downloaded to client devicesover networkfor use on client devices.
In the depicted example, network data processing systemis the Internet with networkrepresenting a worldwide collection of networks and gateways that use the Transmission Control Protocol/Internet Protocol (TCP/IP) suite of protocols to communicate with one another. At the heart of the Internet is a backbone of high-speed data communication lines between major nodes or host computers consisting of thousands of commercial, governmental, educational, and other computer systems that route data and messages. Of course, network data processing systemalso may be implemented using a number of different types of networks. For example, networkcan be comprised of at least one of the Internet, an intranet, a local area network (LAN), a metropolitan area network (MAN), or a wide area network (WAN).is intended as an example, and not as an architectural limitation for the different illustrative embodiments.
With reference now to, an illustration of a block diagram of an entity extraction environment is depicted in accordance with an illustrative embodiment. In this illustrative example, entity extraction environmentincludes components that can be implemented in hardware such as the hardware shown in network data processing systemin.
In this illustrative example, entity extraction systemin entity extraction environmentextracts named entities from documentsusing machine intelligence. In this illustrative example, entity extraction systemincludes computer systemwhich includes entity extractor. Entity extractoris located in computer system.
Entity extractorcan be implemented in software, hardware, firmware, or a combination thereof. When software is used, the operations performed by entity extractorcan be implemented in program instructions configured to run on hardware, such as a processor unit. When firmware is used, the operations performed by entity extractorcan be implemented in program instructions and data and stored in persistent memory to run on a processor unit. When hardware is employed, the hardware can include circuits that operate to perform the operations in entity extractor.
In the illustrative examples, the hardware can take a form selected from at least one of a circuit system, an integrated circuit, an application specific integrated circuit (ASIC), a programmable logic device, or some other suitable type of hardware configured to perform a number of operations. With a programmable logic device, the device can be configured to perform the number of operations. The device can be reconfigured at a later time or can be permanently configured to perform the number of operations. Programmable logic devices include, for example, a programmable logic array, a programmable array logic, a field programmable logic array, a field programmable gate array, and other suitable hardware devices. Additionally, the processes can be implemented in organic components integrated with inorganic components and can be comprised entirely of organic components excluding a human being. For example, the processes can be implemented as circuits in organic semiconductors.
As used herein, “a number of” when used with reference to items, means one or more items. For example, “a number of operations” is one or more operations.
Further, the phrase “at least one of,” when used with a list of items, means different combinations of one or more of the listed items can be used, and only one of each item in the list may be needed. In other words, “at least one of” means any combination of items and number of items may be used from the list, but not all of the items in the list are required. The item can be a particular object, a thing, or a category.
For example, without limitation, “at least one of item A, item B, or item C,” may include item A, item A and item B, or item B. This example also may include item A, item B, and item C, or item B and item C. Of course, any combination of these items can be present. In some illustrative examples, “at least one of” can be, for example, without limitation, two of item A; one of item B; and ten of item C; four of item B and seven of item C; or other suitable combinations.
Computer systemis a physical hardware system and includes one or more data processing systems. When more than one data processing system is present in computer system, those data processing systems are in communication with each other using a communications medium. The communications medium can be a network. The data processing systems can be selected from at least one of a computer, a server computer, a tablet computer, or some other suitable data processing system.
As depicted, computer systemincludes processor setthat is capable of executing program instructionsimplementing processes in the illustrative examples. In other words, program instructionsare computer-readable program instructions.
As used herein, a processor unit in processor setis a hardware device and is comprised of hardware circuits such as those on an integrated circuit that respond to and process instructions and program code that operate a computer. A processor unit can be implemented using processor setin. When processor setexecutes program instructionsfor a process, processor setcan be one or more processor units that are in the same computer or in different computers. In other words, the process can be distributed between processor seton the same or different computers in computer system.
Further, processor setcan be of the same type or different types of processor units. For example, processor setcan be selected from at least one of a single core processor, a dual-core processor, a multi-processor core, a general-purpose central processing unit (CPU), a graphics processing unit (GPU), a digital signal processor (DSP), or some other type of processor unit.
As depicted, computer systemincludes machine intelligence. Machine intelligencecan include machine learning modelsand machine learning algorithms. Machine learning modelsis a branch of artificial intelligence (AI) that enables computers to detect patterns and improve performance without direct programming commands. Rather than relying on direct input commands to complete a task, machine learning modelsrelies on input data. The data is fed into the machine, one of machine learning algorithmsis selected, parameters for the data are configured, and the machine is instructed to find patterns in the input data through optimization algorithms. The data model formed from analyzing the data is then used to predict future values.
Machine intelligenceis continuously refined over time through trial and error. Equivalence of assets or products can be effectively performed by supervised machine learning so that products or assets that do not match descriptively can nevertheless be matched. Over time, the data model from machine learning can provide a greater degree of flexibility in matching machine intelligence.
Machine intelligencecan be implemented using one or more systems such as an artificial intelligence system, a neural network, a generative neural network, a Bayesian network, an expert system, a fuzzy logic system, a genetic algorithm, or other suitable types of systems. Machine learning modelsand machine learning algorithmsmay make computer systema special purpose computer for extracting entities from documents.
Machine learning modelsinvolves using machine learning algorithmsto build computation models based on samples of data. The samples of data used for training are referred to as training data or training datasets. Machine intelligencecan make predictions without being explicitly programmed to make these predictions. Machine intelligencecan be used for training and retraining computation models for a number of different types of applications. These applications include, for example, medicine, financial services, healthcare, speech recognition, computer vision, or other types of applications.
In this illustrative example, machine learning modelscan include a number of models. For example, machine learning modelscan include a deep learning model such as large language model. In this illustrative example, large language modelis a type of machine learning model designed to understand, generate, and manipulate human language.
In this illustrative example, machine learning algorithmscan include supervised machine learning algorithms and unsupervised machine learning algorithms. Supervised machine learning can train machine learning models using data containing both the inputs and desired outputs. Examples of machine learning algorithms include XGBoost, K-means clustering, and random forest.
In this illustrative example, computer systemincludes documentsreceived from data sources. In this illustrative example, data sourcesare locations or platforms from which data is collected, gathered,, or generated for analysis, processing, or storage. For example, data sourcescan be external databases that are used for storing documents. In this illustrative example, optical character recognition is used on documentsto convert images or scanned documents into text.
In this illustrative example, documentsfurther includes first subset of documentsand second subset of documents. Documentsinclude a number of entities. Entities are pieces of information that can be extracted from documents. For example, entities can be tables, sections, people, organizations, locations, addresses, dates, or quantities. In this illustrative example, entities can also be referred to as named entities.
In this illustrative example, documents in documentscan also have document types. In this illustrative example, entity extractorcreates a new document type when a new business requirement is received. For example, document types can include W8bene, W9, Loan Prospectus, or Credit Agreements. In other words, document types are classifications of documents that are associated with specific entities and structures to present those specific entities.
In this illustrative example, entities in first subset of documentsare annotated to generate annotated entitiesand annotated texts. In this illustrative example, annotation of entities in first subset of documentscan be achieved in a number of ways. For example, annotated entitiesand annotated textscan be generated by manually labelling entities in first subset of documentsusing special symbols, syntax, or characters. In this illustrative example, annotated textsare portions of text that included annotated information such as annotated entities.
Entity extractorsplits all documents in documentsinto a number of chunks. In this illustrative example, each chunk in chunksinclude a portion of textual information from a document in documents. For example, chunkincludes textual informationfrom documentin documents. It should be understood that each document in documentsis split into a number of chunks. In other words, chunks other than chunkmay also be corresponded to other textual information in document. In this illustrative example, chunkscan further be converted into embeddings using a pre-trained language model from machine learning models.
In this illustrative example, entity extractorindexes chunksor embeddings of chunksaccording to index structureto generate indexed chunks. Index structureis an organization of data such that data indexed according to index structurecan be stored to facilitate fast search, retrieval, and access in database or other data systems.
In addition, chunks in chunksthat include annotated texts from annotated textsare also indexed according to index structure. In this illustrative example, indexed chunkscan be stored in a vector database and chunks with annotated texts can be flagged to indicate that those chunks can be used as examples for few-shot learning for machine learning models.
As depicted, indexed chunkscan be used as training data for training machine learning modelssuch as large language modelin machine intelligence. In this illustrative example, machine intelligencecan utilize few-shot learning to learn how to identify entities from documentsbased on indexed chunks from indexed chunksthat include annotated texts, which include information associated with annotated entitiesand annotated textsfrom first subset of documents. As a result, large language modelcan be used for identifying entities for documentsafter training.
In this illustrative example, entity extractorcan generate prompt examplesfor each document in second subset of documentsbased on indexed chunksand annotated entitiesfrom first subset of documents. In this illustrative example, a number of prompt examples are generated for each document in second subset of documents.
Prompt examplescan be generated in a number of ways. For example, entity extractorcan fetch a number of first indexed chunksfor first documentfrom second subset of documents. In other words, the number of first indexed chunksinclude unannotated textual information from first documentin second subset of documents.
In this illustrative example, entity extractorcan use prompt examplesto generate synthetic data by replacing entities with different values. For example, entity extractorcan utilize existing dictionary based mapping of entities which can be replaced in existing indexes to create variations. Such synthetic data can be used to enrich existing dataset and potentially finetuning machine learning modelssuch as large language modelfor better accuracy and efficiency.
In this illustrative example, entity extractorcan identify third subset of documentsfrom first subset of documentsbased on first indexed chunksusing first vector searching technique. In this illustrative example, first vector searching techniqueis a method used to search data in a vector space.
For example, first vector searching techniquecan be a similarity analysis such as cosine similarity or Jaccard similarity that can be used for identifying documents with similar chunks. In other words, third subset of documentsare documents in first subset of documentsthat have corresponding chunks that are similar to first indexed chunks.
In this illustrative example, annotated chunksare identified for third subset of documentsfor entities included in first indexed chunks. In other words, annotated chunksare chunks that include entities from third subset of documents.
Entity extractorcan identify a number of second indexed chunksfrom first indexed chunksbased on second vector searching techniquethat is performed between first indexed chunksand annotated chunks. As depicted, second vector searching techniqueis a method used to search data in a vector space. For example, second vector searching techniquecan be a similarity analysis such as cosine similarity or Jaccard similarity that can be used for identifying documents with similar chunks.
In this illustrative example, entity extractorcan further rank second indexed chunksbased on scores obtained from second vector searching technique. In this example, entity extractorcan take the top indexed chunks from second indexed chunksfor further processing.
Unknown
May 26, 2026
Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.