Patentable/Patents/US-12619819-B2
US-12619819-B2

Identifying provenance information of a data item generated by a generative machine learning model

PublishedMay 5, 2026
Assigneenot available in USPTO data we have
Inventorsnot available in USPTO data we have
Technical Abstract

Metadata may be identified for text generated by a generative machine learning model. A text is obtained and a weighting scheme determine for performing similarity analysis. Different similarity analysis techniques are performed that compare the text with representations of texts in the training data set for the generative machine learning model. Final similarity scores are generated that combine the different similarity analysis techniques according to the weighting scheme and are used to select metadata to provide that is relevant to the text.

Patent Claims

Legal claims defining the scope of protection, as filed with the USPTO.

1

. A system, comprising:

2

. The system of, wherein the metadata identification system is configured to:

3

. The system of, wherein the metadata identification system is configured to:

4

. The system of, wherein the metadata identification system is implemented as part of a code development service of a provider network, wherein the code was generated to perform a refactoring task for an input code provided to the code development service.

5

. A method, comprising:

6

. The method of, further comprising:

7

. The method of, further comprising:

8

. The method of, wherein the weighting scheme is determined based, at least in part, on one or more similarity parameters received at the metadata identification system for performing a similarity search for relevant metadata for the data item.

9

. The method of, wherein the respective final similarity scores are generated according to a weighted average of the similarities.

10

. The method of, wherein one of the different similarity techniques is token-based similarity technique that generates tokens of the data item for comparison with token representations of the plurality of data items.

11

. The method of, wherein one of the different similarity techniques is semantic similarity technique that generates an embedding of text for comparison with embeddings of the plurality of data items.

12

. The method of, wherein further metadata for another one of the one or more representations of the plurality of data items is provided based on the respective final similarity scores, and wherein the metadata and the further metadata are ordered in a display for the data item according to the respective final similarity scores.

13

. The method of, wherein the metadata identification system is implemented as part of a provider network service for text generated by the provider network service.

14

. One or more non-transitory, computer-readable storage media, storing program instructions that when executed on or across one or more computing devices cause the one or more computing devices to implement:

15

. The one or more non-transitory, computer-readable storage media of, storing further programming instructions that when executed, cause the one or more computing devices to further implement:

16

. The one or more non-transitory, computer-readable storage media of, wherein one of the different similarity techniques is a structure-based similarity technique that generates graph structure of the text for comparison with graph structure representations of the plurality of texts.

17

. The one or more non-transitory, computer-readable storage media of, wherein further metadata for another one of the one or more representations of the plurality of texts is provided based on the respective final similarity scores, and wherein the metadata and the further metadata are refined according to one or more relevancy parameters.

18

. The one or more non-transitory, computer-readable storage media of, wherein the weighting scheme is determined based, at least in part, on one or more similarity parameters received at a metadata identification system for performing a similarity search for relevant metadata for the text.

19

. The one or more non-transitory, computer-readable storage media of, wherein the text is code and wherein one of the similarity techniques is a version control similarity technique that compares a summary generated of the code with descriptions of committed code changes.

20

. The one or more non-transitory, computer-readable storage media of, wherein the one or more computing devices are implemented as part of a code development service of a provider network and wherein the text is code generated by the code development service.

Detailed Description

Complete technical specification and implementation details from the patent document.

Large language models (LLMs) and other generative machine learning models expand the capabilities of different systems to interact with and respond to text and other data items across a wide variety of subjects. For instance, to provide competency across a number of subjects, LLMs are trained using large amounts of text data. Accordingly, text generated by LLMs may draw upon many different sources.

While embodiments are described herein by way of example for several embodiments and illustrative drawings, those skilled in the art will recognize that the embodiments are not limited to the embodiments or drawings described. It should be understood, that the drawings and detailed description thereto are not intended to limit embodiments to the particular form disclosed, but on the contrary, the intention is to cover all modifications, equivalents and alternatives falling within the spirit and scope as defined by the appended claims. The headings used herein are for organizational purposes only and are not meant to be used to limit the scope of the description or the claims. As used throughout this application, the word “may” is used in a permissive sense (i.e., meaning having the potential to), rather than the mandatory sense (i.e., meaning must). Similarly, the words “include”, “including”, and “includes” mean including, but not limited to.

Various techniques for identifying provenance information of a data item generated by a generative machine learning model are described herein. Generative machine learning models refer to machine learning techniques that model different types of data in order to perform various data generative tasks given a prompt. For example, language models, such as Large language models (LLMs) are one type of generative machine learning model that refer to machine learning techniques applied to model language, which may include natural language (e.g., human speech) and machine-readable language (e.g., programming languages, scripts, code representations, etc.). A language model is a type of artificial intelligence (AI) model that is trained on textual data to generate coherent and contextually relevant text. A “large” language model refers to a language model that has been trained on an extensive dataset and has a high number of parameters, enabling them to capture complex language patterns and perform a wider range of tasks. Large language models are designed to handle a wide range of natural language processing tasks, such as text completion, translation, summarization, and even conversation. The specific parameter count required for a model to be considered a “large” language model can vary depending on context and technological advancements. However, traditionally, large language models have millions to billions of parameters.

Language models may take inputs of language prompts (potentially with additional relevant data) and generate corresponding language outputs. Language models are widely adaptable to many different language processing scenarios. For example, a language model can be trained to translate a given input text from one language to another. In another example, a language model could be trained to summarize, analyze, or other perform other language processing tasks that generate output language based on given input language, such as chatting or following instructions. Some language models can generate a large amount of new text given a prompt with broad parameters, such as a prompt to generate a story given a brief description of scenario, characters, or facts.

Language models are a form of machine learning that provides language processing capabilities with wide applicability to a number of different systems, services, or applications. Machine learning refers to a discipline by which computer systems can be trained to recognize patterns through repeated exposure to training data. In unsupervised learning, a self-organizing algorithm learns previously unknown patterns in a data set without any provided labels. In supervised learning, this training data includes an input that is labeled (either automatically, or by a human annotator) with a “ground truth” of the output that corresponds to the input. A portion of the training data set is typically held out of the training process for purposes of evaluating/validating performance of the trained model. The use of a trained model in production is often referred to as “inference,” during which the model receives new data that was not in its training data set and provides an output based on its learned parameters. The training and validation process may be repeated periodically or intermittently, by using new training data to refine previously learned parameters of a production model and deploy a new production model for inference, in order to mitigate degradation of model accuracy over time.

There have been many developments in large-scale machine learning and deep learning models. For example, GPT-3 is trained on 570 GB of text and consists of 175 billion parameters. While large models may have state-of-the-art performance, in various scenarios it may be desirable to deploy a smaller model. Knowledge distillation is a technique that transfers knowledge from a complex neural network (the “teacher model”) to a simpler one (the “student model”). The teacher model is trained on labeled data, and the student model is trained to mimic the teacher's behavior using unlabeled data of “soft targets”, which are probability distributions indicating the teacher's confidence in its predictions. By minimizing the difference between the student's predictions and the teacher's soft targets, the student model can learn from the teacher's knowledge and achieve better performance, even with fewer parameters. In some embodiments, a generative machine learning model may be a student model (or a teacher model).

For language models, the “inference” may be the output of the language model predicted by the language model to satisfy the new data given as a language prompt. A prompt may be an instruction and/or input text in one (or more) languages. Different language models may be trained to handle varying types of prompts. Some language models may be generally trained across a wide variety of subjects and then later fine-tuned for use in specific applications and subject areas. Fine-tuning refers to further training performed on a given machine learning model that may adapt the parameters of the machine learning model toward specific knowledge areas or tasks through the use of additional training data. For example, a language model may be trained to recognize patterns in text and generate text predictions across many different scientific areas, literature, transcribed human conversations, and other academic disciplines and then later fine-tuned to be optimized to perform language tasks in a specific area, such as code-based tasks, like code suggestion, code refactoring, or other code generation scenarios, as discussed in detail below with regard to.

Because language models may draw upon a wide variety of data sources when generating output text, it may be difficult for an end user of the language model's output to determine a source (or influence) of the generated output text in the original (or fine-tuning) training data. There are some scenarios where information associated with the source (or provenance) of the generated output text may be applicable to uses of the output text generated by a language model. In one example, in which a language model is trained for code generation tasks such as refactoring, users may desire to be alerted to any outputs that are potentially similar or verbatim matches to open source training code, so that they can review the training code to see whether it is helpful for their use case and/or to determine the requirements of its open source license. Generally, code may be subject to one of various types of software licenses which may implicate the ability to use it in different contexts or impose subsequent obligations as a result of its use. In another example, users may desire to see provenance data so that they can fact check the output, as language models may sometimes “hallucinate”—that is, generate plausible sounding but factually incorrect responses.

The present disclosure addresses the above concerns, among others, by providing the ability to robustly identify outputs which potentially match training data (even where the output may be in a different format or language than the similar training data) and surfacing provenance information to users. Metadata descriptive of the source (e.g., identifying licenses for using the source of the code and thus potentially the generated code) of the output text may be applicable to the generated output text. In these and other scenarios, it may be highly desirable to identify metadata descriptive of text in training data for a language model that may also be applicable to and descriptive of output text generated by language models. In this way, end users can make informed decisions when using output text generated by language models.

While similarity techniques to compare text which could be used to help identify the source(s) of output text generated by a language model exist, different similarity techniques have different strengths and weakness. Moreover, different categories of language upon which output text predictions are based may have category-specific concerns that may make it desirable to provide similarity analyses that are adaptable to these category-specific concerns without sacrificing accuracy (which in turn affects the quality of metadata identified as relevant to output text predictions).

As an example, consider code generated using a language model. During language model inference (or after), relevant metadata search for the generated code may include analyzing the generated code snippets to determine their origins (including the author, organization and license) at either the function or line level. To identify potential matches between the generated code and the code base, similarity analysis techniques such as token-based, Abstract Syntax Tree (AST)-based structure, or embedding-based semantic code similarity analysis can be performed against a database of known code bases. By doing so, the origin of the code may be determined, including various metadata descriptive of the origin, such as author and license information.

There are different factors to consider when performing different similarity analyses on code. For instance, the use of token-based and AST-based techniques (or other graph-structure comparison techniques) to determine generated code similarity with source code may be subject to the following considerations. Language dependency is one area of consideration. Token-based and AST-based methods may be limited to a specific programming language, even though an embedding-based approach is language-agnostic. When the same part of the code logic was implemented by a different language, it may be hard to detect the similarity and thus identify relevant metadata. Another area of consideration is false positives. A false positive may be a falsely identified similarity when one does not actually exist due to similar design patterns and coding styles, but not actually the logic of the generated code and the source of the code to which it is being compared. Another area of consideration for similarity techniques is obfuscation. For example, some code may be intentionally obfuscated by developers (e.g., according to naming conventions or lack of code comments). It may be difficult for similarity analyses that use grammar-based tokens or AST analysis. Another area of consideration may be lack of context. Token-based and AST-based similarity techniques may focus on the code structure and syntax, but not consider the context in which the code was written or the purpose it serves. This can make it difficult to detect similar code accurately, especially in cases where code has been modified or repurposed from its original intended use.

In the example of code generation tasks, other similarity analyses techniques, such as embedding-based techniques and version control-based techniques, may be subject to different considerations. For example, one area of consideration may be vocabulary mismatch. Embedding-based approaches may rely on a pre-trained language model that has been trained on a large corpus of code. If the code uses domain-specific terminology that is not well-represented in the training data, the embedding-based approach may not be able to capture the nuances of the code, leading to inaccurate similarity scores. Another area of consideration may be the lack of structural information. While embedding-based approaches can capture semantic information about the code, they may not be able to capture structural information, such as the order and relationship of tokens in the code. This can lead to false positives, where code that has a similar meaning but a different structure is flagged as similar. Another area of consideration is model bias. The quality of the embedding-based approach is heavily dependent on the quality of the pre-trained model. If the model has biases towards certain programming languages, coding styles, or programming paradigms, the similarity scores may not be accurate. Another area of consideration is lack of transparency. While embedding-based approaches can provide useful similarity scores, it can be difficult to understand how the scores were generated and why certain code fragments were flagged as similar. This lack of transparency can make it difficult to troubleshoot false positives or assess the accuracy of the similarity scores.

In various embodiments, techniques for identifying metadata descriptive of a data item generated by a generative machine learning model can address the various technical challenges presented for providing relevant metadata for text generated by a language model that is accurate for the text using a combination of similarity techniques to address the various different strengths and weakness offered by individual similarity techniques. In this way, end users (e.g., computer systems or humans) can make informed decisions regarding the use of text generated by a language model when performing different tasks. It may be apparent that such techniques improve the performance of computer-related technologies that incorporate the use of language models to perform a variety of different tasks by providing relevant metadata for the output of language models. This relevant metadata may be determined according to the increased accuracy of combining similarity techniques discussed below. Such techniques may also provide similar performance improvements to systems that incorporate other generative machine learning models that generate non-text data items.

Consider the code generation scenario again. Techniques for identifying metadata descriptive of text generated by a large language model can offer several improvements to challenges of recognizing similar code and thus determining relevant metadata for code. For example, the techniques that follow can provide more accurate results than any single approach alone. Each approach has its own strengths and weaknesses, and by leveraging the strengths of each, the combined approach can provide a more complete picture of code similarity. False positives can be reduced, for instance. Combining similarity techniques can provide broader coverage across different programming languages, coding styles, and domains. This can improve the effectiveness of code similarity across a wider range of code bases. Obfuscation and plagiarism can make it difficult to accurately identify similar code using any single approach. The combined approach can help address these challenges by leveraging multiple sources of information and detecting patterns that may be hidden in any one source. Version control analysis can provide valuable context about the history of code changes, which can be used to improve code attribution. By combining version control analysis with code similarity analysis, the combined approach can provide a more complete understanding of the code and its origins.

is a logical block diagram illustrating identifying metadata descriptive of text generated by a generative machine learning model (e.g., an LLM), according to some embodiments. Metadata identification systemmay be implemented as a stand-alone system, service, or application that identifies relevant metadata informationdescribing a given data item (e.g., text)generated by an generative machine learning model. Generative machine learning model(e.g., an LLM or other language mode) may have been trainedusing training data setwhich may include a number of data items (e.g., texts). Metadatadescriptiveof data items (e.g., texts)may be maintained and may include various information descriptive of the source, use limitations, requirements, or guidelines, or any other information that may inform how, where, and when generated data item (e.g., text)based on a similar data item (e.g., text)can be used. Metadatamay be static (e.g., fixed description information, such as source, authorship, or license information) or may be mutable (e.g., description information may change, such as the commentary explaining the relevance of a text, like a code comment, could change over time). Training data setcan include texts for a variety of use cases or scenarios in addition to or instead of code generation. For example, the training data set may be TV/movie scripts, books, or song lyrics. Accordingly, metadatacould be used to describe the source (e.g., authors, creators, writers, etc.) of texts, it may be required to identify similar training data so that authors, creators, or writers can be compensated.

Metadata identification systemmay implement multiple different similarity analyses,,, and, which may perform different types of similarity analysis (e.g., using the example techniques discussed below with regard to) using generated data itemand representationsof data items(e.g., comparing directly with the texts or with embeddings, profiles, or other intermediate representations of texts). Similarity scoringmay combine the various results of similarity analyses, using a weighting scheme that indicates how to combine the different individual similarity scores generated from the different similarity analysesinto a final similarity score between generated textand representations of data item, as discussed in detail below with regard to. As discussed in detail below with regard to, similarity analysis may be informed by search parameters or other information that may be included in a metadata search request. In some embodiments, a requirement, request, or feature may be enforced for generated textsuch that metadata identification systemdoes not find any metadata describing the generated text, implying that the generated text is unique.

The final combined scores determined at similarity scoringmay then be provided to metadata lookup, which may use a similarity score threshold or other criteria to obtainthe metadata for one (or more) similar data itemsto generated data itemto provide as metadata describing the generated data item, in some embodiments.

Please note that previous descriptions are not intended to be limiting, but are merely provided as an example of an metadata identification system, generative machine learning model, training data set, or metadata. Various other embodiments may also implement these techniques, as discussed in detail below.

The specification next includes a general description of a provider network, which may implement a code development service that may implement identifying metadata descriptive of code generated by a large language model. Then various examples of a code development service are discussed, including different components/modules, or arrangements of components/module that may be employed as part of implementing a code development service in the provider network. A number of different methods and techniques for identifying metadata descriptive of text generated by a large language model are then discussed, some of which are illustrated in accompanying flowcharts. Finally, a description of an example computing system upon which the various components, modules, systems, devices, and/or nodes may be implemented is provided. Various examples are provided throughout the specification.

is a logical block diagram illustrating a provider network that implements different services including a code development service that may implement the disclosed techniques for identifying metadata descriptive of code generated by a large language model, according to some embodiments. A provider network(which may, in some implementations, be referred to as a “cloud provider network” or simply as a “cloud”) refers to a pool of network-accessible computing resources (such as compute, storage, and networking resources, applications, and services), which may be virtualized or bare-metal. The provider networkcan provide convenient, on-demand network access to a shared pool of configurable computing resources that can be programmatically provisioned and released in response to customer commands. These resources can be dynamically provisioned and reconfigured to adjust to variable load.

The provider networkcan be formed as a number of regions, where a region is a separate geographical area in which the cloud provider clusters data centers. Each region can include two or more availability zones connected to one another via a private high speed network, for example a fiber communication connection. An availability zone (also known as an availability domain, or simply a “zone”) refers to an isolated failure domain including one or more data center facilities with separate power, separate networking, and separate cooling from those in another availability zone. Preferably, availability zones within a region are positioned far enough away from one other that the same natural disaster should not take more than one availability zone offline at the same time. Customers can connect to availability zones of the provider networkvia a publicly accessible network (e.g., the Internet, a cellular communication network). Regions are connected to a global network which includes private networking infrastructure (e.g., fiber connections controlled by the cloud provider) connecting each region to at least one other region. The provider networkmay deliver content from points of presence outside of, but networked with, these regions by way of edge locations and regional edge cache servers. This compartmentalization and geographic distribution of computing hardware enables the provider networkto provide low-latency resource access to customers on a global scale with a high degree of fault tolerance and stability.

As noted above, provider networkmay implement various computing resources or services, such as code development service, and other service(s)which may be any other type of network based services, including various other types of storage (e.g., database service or an object storage service), compute, data processing, machine learning, analysis, communication, event handling, visualization, and security services not illustrated).

In various embodiments, the components illustrated inmay be implemented directly within computer hardware, as instructions directly or indirectly executable by computer hardware (e.g., a microprocessor or computer system), or using a combination of these techniques. For example, the components ofmay be implemented by a system that includes a number of computing nodes (or simply, nodes), each of which may be similar to the computer system embodiment illustrated inand described below. In various embodiments, the functionality of a given system or service component (e.g., a component of code development service) may be implemented by a particular node or may be distributed across several nodes. In some embodiments, a given node may implement the functionality of more than one service system component (e.g., more than one data store component).

Code development servicemay be implemented by provider network, in some embodiments. Code development servicemay implement various features for writing code for different systems, applications, or devices, providing features to recommend, identify, review, build, and deploy code. For example, code development servicemay implement development environment. Code development environmentmay offer various code entry tools (e.g., text, diagram/graphics based application development) to specify, invoke, or otherwise write (or cause to be written) code for different hardware or software applications.

Code development servicemay implement code suggestionwhich may implement various computing resources to host and implement LLM code generationin a scalable fashion to deliver on-demand code suggestions across large numbers of clients using high-powered machine learning models, such as LLMs or other generative machine learning language models, for high-quality code suggestion results. For example, code suggestionmay implement workload balancing and request management features to handle and return code suggestions in a timely manner to provide real-time code suggestions with little or no apparent latency to code generation handling(within or without provider network).

Similarly, in various embodiments, code development servicemay implement code translation. Code translationmay implement various computing resources to host and implement LLM code generationin a scalable fashion to deliver on-demand code translations from one programming language to another programming language across large numbers of clients using high-powered machine learning models, such as LLMs or other generative machine learning language models, for high-quality code translation results. For example, code translationmay implement workload balancing and request management features to handle and return code translations in a timely manner to provide real-time code translations with little or no apparent latency to code generation handling(within or without provider network).

Similarly, in various embodiments, code development servicemay implement code refactoring. Code refactoringmay implement various computing resources to host and implement LLM code generationin a scalable fashion to deliver on-demand code refactoring replacements to restructure code (e.g., to rewrite code from one programming framework to another) across large numbers of clients using high-powered machine learning models, such as LLMs or other generative machine learning language models, for high-quality code refactoring results. For example, code refactoringmay implement workload balancing and request management features to handle and return code translations in a timely manner to provide real-time code translations with little or no apparent latency to code generation handling(within or without provider network).

In various embodiments, the code generation tasks discussed above, code suggestion, code translation, and code refactoring, may generate code based on text input in development environmentor(e.g., utilizing a plug-in or other connection which may provide real-time analysis and suggestion of code as the code is entered into the development environmentor) or some other interface (e.g., via client(s)utilizing a natural language interface to “translate” or “request” code generation tasks). These tasks may use separate models, such as separate LLMs as depicted in, or may use one common LLM, to generate code (e.g., with different prompts to perform the different tasks). The LLMs may be trained on a large corpus of code which may include code repositories or snippets from a variety of sources. Depending on the source or owner, the code may be subject to certain licenses or provide other relevant information which may guide the code's usage or reproduction. Since an LLM can sometimes reproduce verbatim, or close to verbatim, matches to the training data, metadata describing the original source may also be provided as part of the generated code (e.g., as illustrated in). Generated code metadata identification, as discussed in detail below with regard to, may provide relevant metadata, which may be stored in one or more data stores acting as metadata repositories, identified as descriptive of source(s) of the generated code for one of the various code generation tasks.

Code development servicemay implement (or have access to) code repositories. Code repositoriesmay store various code files, objects, or other code that may be interacted with by various other features of code development service(e.g., development environmentto write, build, compile, and/or test code). Code repositoriesmay implement various version and/or other access controls to track and/or maintain consistent versions of collections of code for various development projects, in some embodiments. In some embodiments, code repositories may be stored or implemented external to provider network(e.g., hosted in private networks or other locations).

Code development servicemay implement an interface to access and/or utilize various features of code development service. Such an interface may include various types of interfaces, such as a command line interface, graphical user interface, and/or programmatic interface (e.g., Application Programming Interfaces (APIs)) in order to perform requested operations, including operations of development environment. An API refers to an interface and/or communication protocol between a client and a server, such that if the client makes a request in a predefined format, the client should receive a response in a specific format or initiate a defined action. In the cloud provider network context, APIs provide a gateway for customers to access cloud infrastructure by allowing customers to obtain data from or cause actions within the cloud provider network, enabling the development of applications that interact with resources and services hosted in the cloud provider network. APIs can also enable different services of the cloud provider network to exchange data with one another.

Generally speaking, clientsandmay encompass any type of client configurable to submit network-based requests to provider networkvia network, including requests for services (e.g., a request for code search or suggestion, etc.). For example, a given clientmay include a suitable version of a web browser, or may include a plug-in module or other type of code module that may execute as an extension to or within an execution environment provided by a web browser. Alternatively, a clientmay encompass an application (or user interface thereof), a media application, an office application or any other application that may make use of resources in provider networkto implement various applications. In some embodiments, such an application may include sufficient protocol support (e.g., for a suitable version of Hypertext Transfer Protocol (HTTP)) for generating and processing network-based services requests without necessarily implementing full browser support for all types of network-based data. That is, clientmay be an application may interact directly with provider network. In some embodiments, clientmay generate network-based services requests according to a Representational State Transfer (REST)-style network-based services architecture, a document- or message-based network-based services architecture, or another suitable network-based services architecture.

In some embodiments, clients(and) may provide access to provider networkto other applications in a manner that is transparent to those applications. For example, clientmay integrate with code development service. However, the operating system or file system may present a different storage interface to applications, such as a conventional file system hierarchy of files, directories and/or folders. In such an embodiment, applications may not need to be modified to make use of the storage system service model. Instead, the details of interfacing to the data storage service may be coordinated by clientand the operating system or file system on behalf of applications executing within the operating system environment.

Clientsandmay convey network-based services requests to and receive responses from provider networkvia network. In various embodiments, networkmay encompass any suitable combination of networking hardware and protocols necessary to establish network-based-based communications between clientsand provider network. For example, networkmay generally encompass the various telecommunications networks and service providers that collectively implement the Internet. Networkmay also include private networks such as local area networks (LANs) or wide area networks (WANs) as well as public or private wireless networks. For example, both a given clientand provider networkmay be respectively provisioned within enterprises having their own internal networks. In such an embodiment, networkmay include the hardware (e.g., modems, routers, switches, load balancers, proxy servers, etc.) and software (e.g., protocol stacks, accounting software, firewall/security software, etc.) necessary to establish a networking link between given clientand the Internet as well as between the Internet and provider network. It is noted that in some embodiments, clientsmay communicate with provider networkusing a private network rather than the public Internet.

In some embodiments, provider networkmay include the hardware (e.g., modems, routers, switches, load balancers, proxy servers, etc.) and software (e.g., protocol stacks, accounting software, firewall/security software, etc.) necessary to establish a networking links between different components of provider network, such as virtualization hosts, control plane components as well as external networks(e.g., the Internet). In some embodiments, provider networkmay employ an Internet Protocol (IP) tunneling technology to provide an overlay network via which encapsulated packets may be passed through the internal network using tunnels. The IP tunneling technology may provide a mapping and encapsulating system for creating an overlay network and may provide a separate namespace for the overlay layer and the internal network layer. Packets in the overlay layer may be checked against a mapping directory to determine what their tunnel target should be. The IP tunneling technology provides a virtual network topology; the interfaces that are presented to clientsmay be attached to the overlay network so that when a clientprovides an IP address that they want to send packets to, the IP address is run in virtual space by communicating with a mapping service that knows where the IP overlay addresses are.

is a logical block diagram illustrating interactions to request metadata for generated code, according to some embodiments. Various code generation tasks, as depicted in, may submit requests to generated code metadata identificationin order to provide relevant metadata along with generated code. Requests may be received through an API or other type of interface implemented for generated code metadata identification. As indicated at, a request for metadata searchmay be received. The request may include the generated text (e.g., a snippet of code, a line, function, paragraph, or a file). In some embodiments, further similarity parameters may be specified. Such similarity parameters may include information that may guide the search for similar sources to provide relevant metadata, including categories or descriptions of code functionality, programming languages, frameworks, software libraries, among other information. In some embodiments, relevancy parameters applicable to the metadata to be returned may be included, such as types of metadata to include (e.g., license information, author information, etc.).

As indicated at, metadata resultsmay be provided that include identified metadata, in some embodiments. These results may be ranked or ordered according to similarity scores and/or relevancy (e.g., as determined according to relevancy parameters). As indicated atan updated metadata search may be supported to refine results, which may include different/more similarity parameters (or if not originally included, initial similarity parameters). Similarly, updatemay include different/more relevancy parameters (or if not originally included, initial relevancy parameters). These may be applied to return further metadata resultswhich are refined according to the parameters. In some embodiments, requests that select metadatawhich may confirm one out of multiple presented metadata, in some embodiments, which may be useful for training or adjusting weighting schemes as discussed below with regard to.

is a logical block diagram illustrating generated code metadata identification, according to some embodiments. In some embodiments, weighting scheme selectionmay be implemented. In some embodiments, a common weighting scheme for similarity analysis may be implemented (e.g., a global weighting scheme trained or determine as indicated in). In other embodiments, multiple similarity analysis weighting schemesmay be available, one of which may be dynamically selected in order to provide an optimal weighting scheme for a particular metadata search. For example, as discussed below with regard to, similarity parameters may be provided which can be used to map to a particular similarity analysis weighting scheme according to the respective values (e.g., a refactoring analysis may have a different similarity analysis weighting scheme than code suggestion). Similarity analysis weighting schemesmay be statically defined, in some embodiments, or may change over time (e.g., in response to feedback or the addition of new similarity analyses).

Once a weighting scheme is selected, similarity scoringmay use the similarity analysis weighting scheme to conduct and combine the results of multiple similarity analyses. For example, the weighting scheme may identify two specific similarity analyses to perform and then how much to weight each one. Different types of similarity analysesmay be performed, such as token similarity type analysis, structured similarity type analysis, and semantic similarity type analysis. As noted above, each may have different strengths, which the selected weighting scheme may consider as part of the indicated weights for combined similarity scoring at.

illustrates an example of a weighted average indexing scheme. If three types of similarity analysis are being used (which may not be the case in all weighting schemes), then each may generate a raw similarity score which may then be multiplied by their respective weights. These products may be summed to indicate a total weight. Likewise, the weights may be summed to indicate a total weight. Together the weighted average similarity may be determined by dividing the total similarity by the total weight. As an example of the final similarity score using the above similarity scheme, the similarity for two different texts, Sand Sare depicted. Although token similarity analysis, structured similarity analysis, and semantic similarity analysis are discussed, these types of similarity techniques are exemplary. Some similarity analysis techniques may include aspects of multiple types in their implementation (e.g., a similarity technique may include both token-based aspects and graph-based aspects of structured similarity).

Metadata lookupmay utilize a similarity threshold selection, either to apply a statically defined similarity threshold, or one determine dynamically for a particular metadata search (e.g., according to similarity parameters which may indicate the relative strength or weakness of the search to perform). Code metadata indexmay include an index to various metadata for different sources that may be present in or used to train an LLM used to generate the code. Code metadata indexcan be used to receive the metadata for those sources identified as similarity according to similarity threshold selectionand returned as part of a metadata result.

While weighting schemes for combining different similarity analyses can be determined using linear programming techniques to solve for the weighting scheme as an optimization problem, machine learning techniques may also be applied, in some embodiments.are logical block diagrams illustrating different scenarios for reinforcement learning to train weighting schemes, in some embodiments. In, weighting scheme training, which may perform reinforcement learning techniques, may be used to train or adapt weighting schemeapplied by generated code metadata identificationto a specific account(or project) of code development service. The one or more clients that receive generated code may provide feedback for identified metadata (e.g., by selecting the most relevant metadata or indicating no relevant metadata was provided). Weighting scheme may utilize this information to generate updated weighting schemewhich may ultimately be deployed to generated code metadata identification.

illustrates a global reinforcement learning scenario. For example, for weighting schemethat is applied to identify generate code metadatafor multiple accountsthroughwith multiple clientsthrough, separate feedback for identified metadata, as indicated atandmay be provided that are used to update the global updated weighting schemewhich can be deployed at generated code metadata identification.

are example user interfaces for providing identified metadata for generated code, according to some embodiments. Interface(e.g., an integrated development environment or various other types of interfaces) may implement a code editorelement which displays code that may be generated for various code generation tasks as discussed above. For the generated code, generated code metadatamay be displayed for the generated code (e.g., when hovering over the code with a cursor or when accepting the generate code for inclusion) in. As part of this metadata, elementmay be selectable to refine the search results (e.g., by providing additional parameters, as discussed above). As illustrated in, multiple generated metadata code options may be displayed, as indicated at,, and, in some embodiments. These may be arranged or ordered by relevancy. In some embodiments, the provided metadata may be sorted or filtered by relevancy parameters. In some embodiments, elements to select the relevant metadata may be provided, as indicated at,, and. Such selections may be used to provide feedback, as discussed above.

The examples of identifying metadata descriptive of code generated by a large language model discussed above with regard tohave been given in regard to one example of a code development service. Various other types of text generation systems may implement these techniques which seek to provide relevant metadata for text generated by a large language model.is a high-level flowchart illustrating techniques and methods to implement identifying metadata descriptive of text generated by a large language model, according to some embodiments. These techniques, as well as the techniques discussed below with regard to, may be implemented using various components of a provider network as described above with regard toor other types or systems implementing an LLM.

As indicated at, a data item generated by a generative machine learning model trained using a machine learning technique applied to training including multiple data items may be obtained, according to some embodiments. Generative machine learning models may include LLMs as well as any other machine learning model trained to generate text in response to some input. For instance, while not strictly “large” in terms of model size (e.g., number of model weights) or training data set size, there are many types of machine learning models (e.g., deep neural networks) that can generate text. Generative machine learning models that model language may model language other than “natural language” (e.g., human language), but may model programming languages, or other systems of symbols/representation for conveying information. Generative machine learning models may model non-textual information (e.g., image, video or audio) and thus may generate non-text data items (e.g., images, video, and/or audio). Techniques similar to those discussed above for, as well asbelow, could be applicable to determine relevant metadata for non-text data items as well. As discussed above with regard to, one example of generated text may be code generated for various code development tasks. Other text generation tasks include text summaries, generating descriptions, stories, reports, or other text formats using text that was used to train the generative machine learning model.

As indicated at, a weighting scheme may be determined for performing similarity analysis with respect to the data item generated by the generative machine learning model, according to some embodiments. For example, the weighting scheme may be a statically assigned weighting scheme applicable across multiple different data item generation requests. In some embodiments, the weighting scheme may be dynamically determined (e.g., according to the techniques discussed below with regard to).

As indicated at, different similarity analysis techniques may be performed that compare the data item with representation(s) of the texts of the training data set, according to some embodiments. As noted above with regard to, different types of similarity techniques (which may be applicable to more than code text similarity), such as semantic similarity techniques which may generate an embedding (e.g., a feature vector) using a neural network trained to encode the features of the generated data item and the perform a comparison of the embedding with other embeddings (as representations) generated for other data items. Cosine similarity or other vector comparisons in the latent space of the embeddings can be used to determine similarity. Token similarity techniques may parse generated text data items into a set of tokens, the order of which may be encoded and compared with the encoded set of tokens for other text data items. Structural techniques may take graphs of text, images, or other types of data items, such as abstract syntax trees or other graph representations, and perform structural similarity comparisons. Version-based similarity may examine the submission of data items (e.g., code changes) along with a rationale or description of the text submission to use for comparing the rationale/description with a summary generated by a machine learning model of the generated data items.

Some combination of such techniques may be performed to generate their own respective similarity scores. As indicated at, respective final similarity scores may be generated between the data item and the representation(s) of the data items according to the weighting scheme, which indicates respective weights for combining individual similarity scores generated by the different similarity analysis techniques into the respective final similarity scores, according to some embodiments. As discussed above with regard to, different weight values may increase or decrease the influence of individual similarity techniques in order to account for their respective strengths.

As indicated at, one representation of the representation(s) may be selected according to the respective final similarity scores between the data item and the representation(s) of the data items, according to some embodiments. For example, a minimum score threshold may be implemented or a highest N number of similar representations may be selected. The score threshold may be determined similar to the weighting scheme (e.g., dynamically, through training, or applied statically).

Patent Metadata

Filing Date

Unknown

Publication Date

May 5, 2026

Inventors

Unknown

Want to explore more patents?

Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.

Citation & reuse

Analysis on this page is generated by Patentable — an AI-powered patent intelligence platform. AI-generated summaries, explanations, and analysis may be reused with attribution and a visible link back to the canonical URL below. Patent abstracts and claims are USPTO public domain.

Cite as: Patentable. “Identifying provenance information of a data item generated by a generative machine learning model” (US-12619819-B2). https://patentable.app/patents/US-12619819-B2

© 2026 Patentable. All rights reserved.

Patentable is a research and drafting-assistant tool, not a law firm, and does not provide legal advice. Documents we generate are drafts for review by a licensed patent attorney.

Identifying provenance information of a data item generated by a generative machine learning model | Patentable