Systems, methods, and apparatus for self-evolving decoding at inference. In an aspect, operations include processing, by a Large Language Model (LLM) of N layers, an input by an inference operation of the LLM; obtaining, from the LLM, logitsof an evolution layer of the LLM, the evolution layer being subsequent to a first layer of the LLM; for a plurality of layers that occur before the evolution layer, processing the logits of the layer with the logits of the evolution layer to generate an approximated gradient; based on the approximated gradient and the logits of the evolution layer, generating adjusted logits for the evolution layer; and processing the adjusted logits for the evolution layer to generate an output for the LLM.
Legal claims defining the scope of protection, as filed with the USPTO.
. A computer-implemented method, comprising:
. The computer-implemented method of, wherein obtaining, from the LLM, logitsof an evolution layer of the LLM, comprises obtaining, from the LLM, logitsof a final layer of the LLM, wherein the final layer of the LLM is the evolution layer.
. The computer-implemented method of, wherein for the plurality of layers that occur before the evolution layer, processing the logits of the layer with the logits of the evolution layer to generate an approximated gradient comprises for each layer of the plurality of layers that occur before the evolution layer, processing the logits of the layer with the logits of the evolution layer to generate an approximated gradient.
. The computer-implemented method of, wherein generating adjusted logits for the evolution layer comprises:
. The computer-implemented method of, further comprising:
. The computer-implemented method of, wherein processing the logits of the layer with the logits of the evolution layer to generate an approximated gradient comprises processing a proper subset of the logits of the layer, the proper subset of the logits of the layer corresponding to a set of top k logits of a layer, where k is a value that is fewer than a total number of logits in the layer.
. The computer-implemented method of, wherein processing a proper subset of the logits of the layer comprises processing the proper subset of the logits of the layer that correspond to the top k logits of the evolution layer.
. The computer-implemented method of, wherein for each of a plurality of layers that occur before the evolution layer, processing the logits of the layer with the logits of the evolution layer to generate an approximated gradient comprises for each layer from an initial layer to the evolution layer, processing the logits of the layer with the logits of the evolution layer to generate the approximated gradient.
. The computer-implemented method of, wherein obtaining, from the LLM, logitsof an evolution layer of the LLM, comprises obtaining, from the LLM, logitsof a layer of the LLM that is prior to the final layer of the LLM.
. A system comprising one or more computers and one or more storage devices storing instructions that are operable, when executed by the one or more computers, to cause the one or more computers to perform operations comprising:
. The system of, wherein obtaining, from the LLM, logitsof an evolution layer of the LLM, comprises obtaining, from the LLM, logitsof a final layer of the LLM, wherein the final layer of the LLM is the evolution layer.
. The system of, wherein for the plurality of layers that occur before the evolution layer, processing the logits of the layer with the logits of the evolution layer to generate an approximated gradient comprises for each layer of the plurality of layers that occur before the evolution layer, processing the logits of the layer with the logits of the evolution layer to generate an approximated gradient.
. The system of, wherein generating adjusted logits for the evolution layer comprises:
. The system of, the operations further comprising:
. The system of, wherein processing the logits of the layer with the logits of the evolution layer to generate an approximated gradient comprises processing a proper subset of the logits of the layer, the proper subset of the logits of the layer corresponding to a set of top k logits of a layer, where k is a value that is fewer than a total number of logits in the layer.
. The system of, wherein processing a proper subset of the logits of the layer comprises processing the proper subset of the logits of the layer that correspond to the top k logits of the evolution layer.
. The system of, wherein for each of a plurality of layers that occur before the evolution layer, processing the logits of the layer with the logits of the evolution layer to generate an approximated gradient comprises for each layer from an initial layer to the evolution layer, processing the logits of the layer with the logits of the evolution layer to generate the approximated gradient.
. The system of, wherein obtaining, from the LLM, logitsof an evolution layer of the LLM, comprises obtaining, from the LLM, logitsof a layer of the LLM that is prior to the final layer of the LLM.
. A computer storage medium encoded with instructions that, when executed by one or more computers, cause the one or more computers to perform operations comprising:
. The computer storage medium of, wherein obtaining, from the LLM, logitsof an evolution layer of the LLM, comprises obtaining, from the LLM, logitsof a final layer of the LLM, wherein the final layer of the LLM is the evolution layer.
Complete technical specification and implementation details from the patent document.
This application claims the benefit of priority under 35 U.S.C. § 119(e) to U.S. Provisional Application No. 63/650,866, filed on May 22, 2024, the contents of which are hereby incorporated by reference.
A significant challenge associated with Large Language Models (LLMs) is their tendency to hallucinate or distort truth, resulting in output that is not factual. This failure undermines the reliability and trust of LLMs. To improve LLM factuality, several methods have been proposed, such as using retrieval techniques and external knowledge bases, fine tuning the model for better alignment, or employing ensemble learning with multiple models. Despite these efforts, there are still large gaps in understanding and improving LLM truthfulness.
In many cases LLMs have learned the factual content (based on extensive pre-training datasets), but they still fail to produce the correct answer when a user queries the model.
This specification describes systems and methods relating to LLMs, and in particular, to reducing hallucination or other factual errors in LLMs.
To enhance the reliability and truthfulness of large language models (LLMs), a Self-Evolution Decoding (SED) decoding strategy is used. The SED may also be referred to as a Self Logits Evolution Decoding (SLED) decoding strategy. The SED decoding does not rely on external knowledge bases or require additional fine-tuning, and is operative during inference. The SED decoding enhances the quality of LLM outputs by optimizing an implicit objective function using the inherent self-evolution of hidden states of LLMs. This approach allows for an ongoing refinement of outputs during inference, akin to further training, thus providing improved accuracy and interpretability over conventional decoding methods. Additionally, because the operations occur during inference, there is no need for model retraining. Moreover, the operations as implemented are optimized for efficiency and thus do not significantly impact inference time of the underlying LLM.
In an implementation, a computer-implemented method comprises processing, by a Large Language Model (LLM) of N layers, an input by an inference operation of the LLM; obtaining, from the LLM, logitsof an evolution layer of the LLM, the evolution layer being subsequent to a first layer of the LLM; for a plurality of layers that occur before the evolution layer, processing the logits of the layer with the logits of the evolution layer to generate an approximated gradient; based on the approximated gradient and the logits of the evolution layer, generating adjusted logits for the evolution layer; and processing the adjusted logits for the evolution layer to generate an output for the LLM. Other embodiments of this aspect include corresponding methods, apparatus, and computer programs, configured to perform the actions of the methods, encoded on computer storage devices.
In an implementation in combination with the above, obtaining, from the LLM, logitsof an evolution layer of the LLM, comprises obtaining, from the LLM, logitsof a final layer of the LLM, wherein the final layer of the LLM is the evolution layer.
In an implementation of any of the above, for the plurality of layers that occur before the evolution layer, the operation of processing the logits of the layer with the logits of the evolution layer to generate an approximated gradient comprises for each layer of the plurality of layers that occur before the evolution layer, processing the logits of the layer with the logits of the evolution layer to generate an approximated gradient.
In an implementation of any of the above, the operations include for each layer of a plurality of layers that occur before the evolution layer, processing a distribution of the layer and a distribution of the final layer to determine a distance from the approximated gradient; and for each layer of the plurality of layers that occur before the evolution layer, determining, for the layer, weights that are indicative of how closely a difference between the logits of the layer and the logits of the final layer align with the approximated gradient.
In an implementation of any of the above, the operations include determining, based on the weights, a weighted average for each layer; and adjusting the logits of the evolution layer based, in part, on the weighted averages determined for the layers.
In an implementation of any of the above, processing the logits of the layer with the logits of the evolution layer to generate an approximated gradient comprises processing a proper subset of the logits of the layer, the proper subset of the logits of the layer corresponding to a set of top k logits of a layer, where k is a value that is fewer than a total number of logits in the layer.
In an implementation of any of the above processing a proper subset of the logits of the layer comprises processing the proper subset of the logits of the layer that correspond to the top k logits of the evolution layer.
In an implementation of any of the above, for each of a plurality of layers that occur before the evolution layer, processing the logits of the layer with the logits of the evolution layer to generate an approximated gradient comprises for each layer from an initial layer to the evolution layer, processing the logits of the layer with the logits of the evolution layer to generate the approximated gradient.
In an implementation of any of the above, obtaining, from the LLM, logitsof an evolution layer of the LLM, comprises obtaining, from the LLM, logitsof a final layer of the LLM, wherein the final layer of the LLM is the evolution layer.
The language model can be any appropriate language model neural network that receives an input sequence made up of text tokens selected from a vocabulary and auto-regressively generates an output sequence made up of text tokens from the vocabulary. For example, the language model can be a Transformer-based language model neural network or a recurrent neural network-based language model.
The above implementations may realize one or more of the following advantages. The decoding strategy supports a wide range of model families, including more advanced architectural configurations such as the mixture of experts (MoE). The decoding strategy scales efficiently and does not significantly impact inference time.
The decoding strategy is versatile across a variety of tasks, such as multiple-choice questions, open-ended generation, and adaptations to chain-of-thought reasoning tasks.
The decoding strategy can be flexibly combined with other decoding methods, enhancing their performance and broadening the scope of its applicability. This interoperability facilitates tailored deployment in systems that require specific decoding enhancements.
The decoding strategy mitigates repetition issue existing in previous factuality methods, ensuring the fluency and high quality of responses. The decoding strategy also realizes negligible additional computational costs, suitable for real-time applications.
The decoding strategy provides a new interpretable framework for inference-time computing algorithms, enhancing the development and interpretability of advanced factuality decoding.
The details of one or more embodiments of the subject matter described in this specification are set forth in the accompanying drawings and the description below. Other features, aspects, and advantages of the subject matter will become apparent from the description, the drawings, and the claims.
Like reference numbers and designations in the various drawings indicate like elements.
Described below is a self-evolving decoding system that enhances the truthfulness of LLMs without relying on external knowledge bases or requiring further fine-tuning or training. From an optimization perspective, the system leverages the latent knowledge embedded within the LLM by contrasting the output logits from an evolution layer with those from layers that precede the evolution layer. In some implementations, the evolution layer is a final layer of the LLM. The system then utilizes an approximate gradient approach to enable latent knowledge to guide the self-refinement of outputs, thereby effectively improving factual accuracy.
is a block diagramof a factual decoding process. A model is trained on a real-world factuality distribution. For example, the real-world factuality distributionmay be ground truth data of a training corpus. After training, the model generates an output distributionat inference time. The factual decoding process leverages the latent knowledgeof the LLM during inference and adjusts the output distributionof the model. The goal of the adjustment is to increase accuracy of the model and reduce hallucinations.
The decoding process ofis a strategy framework for improving the LLM factuality. Decoding focuses on how the model selects a next token during the generation process, which can significantly influence the factual accuracy of the output. The decoding processes of this framework can be cost-effective since they do not rely on external knowledge and no additional training is required. Furthermore, decoding methods can be synergistically combined with other techniques aimed at improving the LLM factuality, such as retrieving information from external knowledge bases, various fine-tuning strategies for better alignment, or ensemble learning methods.
A common issue with LLMs is learning factual content based on extensive pretraining or fine-tuning, yet failing to produce the correct answer when a user queries the model. Factuality decoding reveals what the model implicitly “knows.” As summarized in, the output distribution is derived by applying the softmax function to the output logits from an evolution layer, which is typically the final layer. Other layers, such as a penultimate layer, can also be used, however.
During the training phase, this distribution is optimized based on the real-world factuality distribution represented by the training dataset. However, during the inference phase, the LLM output might still contain factual errors, which implies a discrepancy between the output distribution and the real-world factuality distribution. While the real-world distribution remains inaccessible during the inference phase, the model's latent knowledge may have implicitly learned some factual content correctly during the training phase. The subject matter of this disclosure is a decoding strategy that effectively harnesses the latent knowledge embedded within LLMs to refine the output distribution (logits) during inference. This decoding strategy is illustrated in, which is a system block diagram of a self-evolution decoding system.
The self-evolution decoding systemleverages the latent knowledge within LLMs by contrasting an evolution layer's logits with logits of earlier layers. For the remainder of this description, the evolution layer is the final layer of the LLM. However, a layer before the final layer of the LLM can also be used as the evolution layer, so long as the layer so used has preceding layers for comparison. Thus, while the term “final layer” is used in this description, it is to be understood that an earlier layer, such as a penultimate layer, or an even earlier layer, can also be used for evaluation.
During the decoding process, as LLMs progress from early to final layers, they progressively incorporate factual information stored in each layer into the output. The systemtracks this evolution process to access latent knowledge within LLMs, and enables the self-evolution of the output distribution further to align it more closely with real-world facts. However, the latent knowledge within LLMs, while valuable, may not always be accurate. Thus, in some implementations, the system, instead of simply replacing the original outputs with this latent knowledge, integrates it into the original logits through an operation similar to a gradient descent over the output logits during the inference time. This operation reduces divergence between the latent knowledge distribution and the output distribution, effectively balancing the two and mitigating potential drawbacks such as overfitting or biased outputs.
As illustrated in, an input “The capital of British Columbia province is” is input to an LLMwith N layers. An initial distribution for output logitslogitsreveals an incorrect prediction of “Vancouver,” as indicated by the most prominent probability.
As will be described in more detail below, a self-logits evolution decoding processis used to harness the latent knowledgePto generate self-evolved logits. The self-evolved logits are then processed to generate an adjusted output distributionthat reveals a correct prediction of “Victoria.”
A large language model, equipped with N layers and a vocabulary V=[v, v, . . . , v], typically generates text in the next-token prediction fashion. For each given prefix, the model computes the logits at the final (N-th) layer, logits(,, . . . ,), which are obtained by applying a linear transformation to the hidden states of the final layer, projecting the high-dimensional hidden state vectors onto the space of the vocabulary size. Subsequently, the output distributionat the final (N-th) layer for the next token is derived by applying softmax function on the logits:
Similarly, logits are derived from early layers by applying the same linear transformation above to their hidden states. Thus, for any early layer n(n<N), the logits are denoted as: logits(, . . . ,), and the corresponding distribution is denoted as:
To improve factual accuracy, the correct token vreceives a higher value of logitsto ensure a higher probability value pin the output distribution. This essentially aligns the model's output distributionclosely with the real-world factuality distribution. This goal can be expressed as optimization, and a variety of optimization can be used. One example optimization is the following loss functionregarding the logits:
The loss function of equation (1) is also referred to as a logits evolution.
The training of LLMs aims at minimizing the divergence (typically the KL divergence, as the training loss function is often the cross-entropy loss) between the ground truthand the output distribution. During the training phase, the logits evolution is driven externally by the real-world distributionpresented in the training dataset, and the corresponding solution is logits=logits. However,is not accessible during the inference phase. To address this, the systemutilizes the model's latent knowledge to estimate, which enables self-evolution of the logits. This estimation is denoted asand the self-logits evolution can be achieved by the following gradient-descent operation:
The parameter α defines an Evolution Rate that governs the magnitude of adjustments applied to logitsin the direction of the gradient ∇KL(,). The evolution rate can be set based on empirical data, sensitivity requirements, or any other appropriate factor.
In some implementations, to derive theas the estimation of the real-world distribution,is estimated by tracking logits evolution through the layers. The systemleverages the difference between each early layer's logits and the final layer's logit, i.e., logits−logits, to approximate the gradient of KL(,) at logits=logits. The systemthen estimatesbased on this approximation.
As described above, a trained model generates a final layer logits logits=logitsbecause the final layer's logitsdirectly engage with the real-world distributionthrough the loss function in training. This leads to the implication that the final logits logitsare a better solution than the logits from an early layer logits n, given KL(,)<KL(,). Thus, by contrasting the final layer's logits with the early layer's logits, the direction (orientation) of logits−logitscan approximately align with the direction of the gradient
Accordingly, for each early layer n, the following function of cosine similarity is maximized to derive the
as the estimate thereal:
The cosine similarity measures the similarity between logits−logitsand ∇KL(,), and thus is a distance of the difference between the approximation of the gradient and the logits vectors. In particular, the similarity is a measure of the difference of distributions of the layer and a distribution of the final layer to the approximated gradient.
Unknown
November 27, 2025
Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.