Patentable/Patents/US-20250322212-A1

US-20250322212-A1

Dynamic Determination of Inference-Time Parameters

PublishedOctober 16, 2025

Assigneenot available in USPTO data we have

Inventorsnot available in USPTO data we have

Technical Abstract

A method for dynamic determination of inference-time parameters to control the stochastic generation process of a generative neural network. The method may include dynamically determining for an inference request, at least from operational context information, at least one of the inference-time parameters.

Patent Claims

Legal claims defining the scope of protection, as filed with the USPTO.

. A computer-implemented method for dynamic determination of inference-time parameters to control the stochastic generation process of a generative neural network, the method comprising:

. The method of, wherein the generating of output data by the generative neural network follows a stochastic process, wherein the determined at least one of the inference-time parameters control generation of the output data.

. The method of, wherein the generative neural network is applied iteratively to generate the output data, the iterative application of the generative neural network being modulated by the determined at least one of the inference-time parameters.

. The method of, wherein the operational context information comprises one or more of:

. The method of, wherein the inference request is one from a particular sequence of inference requests, the operational context information comprising a sequence identifier identifying the particular sequence among a plurality of sequence of inference requests.

. The method of, wherein the inference request comprises a prompt for use as input to the generative neural network, the at least one inference-time parameter being derived further from the prompt.

. The method of, wherein the inference request comprises a prompt, wherein

. The method of, wherein a time and/or memory use depend on the determined inference-time parameter, wherein

. The method of, wherein the inference-time parameters comprise one or more of sampling parameters selected from: Temperature, Top-k Sampling parameter, Top-p Nucleus Sampling parameter, a classifier-free guidance scale.

. The method of, comprising:

. A system comprising:

. The system of, wherein the generating of output data by the generative neural network follows a stochastic process, wherein the determined at least one of the inference-time parameters control generation of the output data.

. The system of, wherein the generative neural network is applied iteratively to generate the output data, the iterative application of the generative neural network being modulated by the determined at least one of the inference-time parameters.

. The system of, wherein the operational context information comprises one or more of:

. The system of, wherein the inference request comprises a prompt, wherein

. One or more non-transitory computer storage media encoded with instructions that, when executed by one or more computers, cause the one or more computers to perform a method comprising:

. The media of, wherein the generating of output data by the generative neural network follows a stochastic process, wherein the determined at least one of the inference-time parameters control generation of the output data.

. The media of, wherein the generative neural network is applied iteratively to generate the output data, the iterative application of the generative neural network being modulated by the determined at least one of the inference-time parameters.

. The media of, wherein the operational context information comprises one or more of:

. The media of, wherein the inference request comprises a prompt, wherein

Detailed Description

Complete technical specification and implementation details from the patent document.

The present application claims priority to, and the benefit of, European Patent Application No. 24170362.8, filed on 15 Apr. 2024, which is hereby incorporated by reference for all purposes.

The presently disclosed subject matter relates to a method for dynamic determination of inference-time parameters, a system for dynamic determination of inference-time parameters, and a computer storage medium.

The field of machine learning has seen significant advancements, with technologies now integrating into commercial solutions. Artificial Neural Networks, for instance, can easily generate text or image outputs. Large Language Models (LLMs) have proven adept at structuring, summarizing, and interpreting long texts or data inputs. Furthermore, LLMs are used to generate programming code in a variety of programming languages. Likewise, image generation is applied in a range of projects, e.g., generation of synthetic test data and training data.

Utilizing a generative neural network, such as a Large Language Model, requires the configuration of inference-time parameters (sometimes referred to as meta-parameters) to control the characteristics of the models' responses. For example, a commonly employed parameter is the sampling temperature. Higher values, such as 0.8, will make the output more random, while lower values, like 0.2, will make it more deterministic. There are many other parameters that can be used to control the generation of output in a generative neural network.

Currently, the setting of inference-time parameters, including those that define the behavior of the output such as sampling temperature, prompt weighting, and maximum token count, is done directly in the call to the model or configured beforehand by the system or application making the call. This places the responsibility for determining the appropriate inference-time parameters at the caller.

Having suitable inference-time parameters is important because suboptimal inference-time parameters may lead to outputs that are unsatisfactory or even incorrect. Setting these parameters requires an understanding of the model's behavior, which is challenging for non-experts. Moreover, as the same generative neural network may be used for a variety of different applications, inference-time parameters may have to change from call to call.

Accordingly, there is a need for a system that determines inference-time parameters.

It would be advantageous to have an improved way to set inference-time parameters for a generative neural network.

A method for dynamic determination of inference-time parameters, a system for dynamic determination of inference-time parameters, and a computer storage medium are described in the accompanying claims. Specific embodiments of the invention are set forth in the dependent claims.

An embodiment of a method for dynamic determination of inference-time parameters to control the stochastic generation process of a generative neural network, may include dynamically determining for an inference request, at least from operational context information, at least one of the inference-time parameters. The inference request may be for a request for application of a generative neural network on an input, sometimes referred to as a prompt. Generative neural network produces outputs using a stochastic generation process which is controlled by the one or more inference-time parameters.

Once the inference-time parameters are determined, the output data is generated by the generative neural network controlled at least by the determined at least one of the inference-time parameters.

A system for determining inference-time parameters may be an electronic device, e.g., a computer. A further aspect is a method for determining inference-time parameters. An embodiment of the method may be implemented on a computer as a computer implemented method, or in dedicated hardware, or in a combination of both. Executable code for an embodiment of the method may be stored on a computer program product. Examples of computer program products include memory devices, optical storage devices, integrated circuits, servers, online software, etc. Preferably, the computer program product comprises non-transitory program code stored on a computer readable medium for performing an embodiment of the method when said program product is executed on a computer.

In an embodiment, the computer program comprises computer program code adapted to perform all or part of the steps of an embodiment of the method when the computer program is run on a computer. Preferably, the computer program is embodied on a computer readable medium.

Another aspect of the presently disclosed subject matter is a method of making the computer program available for downloading. This aspect is used when the computer program is uploaded into a server, and when the computer program is available for downloading from such a server.

The following list of references and abbreviations corresponds to-,-, and is provided for facilitating the interpretation of the drawings and shall not be construed as limiting the claims.

While the presently disclosed subject matter is susceptible of embodiment in many different forms, there are shown in the drawings and will herein be described in detail one or more specific embodiments, with the understanding that the present disclosure is to be considered as exemplary of the principles of the presently disclosed subject matter and not intended to limit it to the specific embodiments shown and described.

In the following, for the sake of understanding, elements of embodiments are described in operation. However, it will be apparent that the respective elements are arranged to perform the functions being described as performed by them.

Further, the subject matter that is presently disclosed is not limited to the embodiments only, but also includes every other combination of features described herein or recited in mutually different dependent claims.

The field of machine learning has seen significant advancements, with technologies now integrating into commercial solutions. Artificial Neuronal Networks, for instance, can easily generate text or image outputs. Large Language Models (LLMs) have proven adept at structuring, summarizing, and interpreting long texts or data inputs. Furthermore, LLMs are used to generate programming code in a variety of programming languages. Likewise, image generation is applied in a range of projects, e.g., generation of synthetic test data and training data.

However, utilizing a Large Language Model still necessitates the configuration of inference-time parameters (sometimes referred to as meta-parameters) to control the characteristics of the models' responses. For example, a commonly employed parameter is the sampling temperature. Higher values, such as 0.8, will make the output more random, while lower values, like 0.2, will make it more deterministic. There are many other parameters that can be used to control the generation of output in a generative neural network.

The appropriate parametrization depends on the specific use case. For instance, a temperature value of 0.1 might be suitable for summarizing technical data, e.g., measured from a manufacturing process, whereas a higher temperature could be preferable for text completions that require more creativity, e.g., generating computer code to solve a particular programming problem.

Especially in digital assistants and other chat-like environments, the intent of the end user can vary widely among these examples. Consequently, there is a need to devise a solution that overcomes the limitations of fixed inference-time parameter values, which can lead to suboptimal experiences and, in some instances, incorrect results.

There is no established state of the art addressing this issue. Rather than relying on hardcoded parameters for an LLM completion, embodiments calculate inference-time parameters in real-time during execution, depending on the particular request possibly other available context.

schematically shows an example of an embodiment of a generative neural network request system. Shown is a client device, configuration device, and generative neural network device, which may be part of a generative neural network request system.

Client deviceis configured to make an inference request for a generative neural network. For example, client devicemay run an application for computer code generation; For example, so-called AI pair programming

In AI pair programming artificial intelligence is integrated within the development environment to assist programmers by offering real-time code suggestions, completing functions, code review, and/or generating documentation. Other applications are possible, and further examples are given herein.

Configuration deviceis configured to dynamically determine one or more of inference-time parameters that control the stochastic generation process of a generative neural network. Whereas conventionally, the inference-time parameters are fixed, or are set by the calling application, configuration devicedetermines inference-time parameter(s) for a particular inference request. For example, one or more inference-time parameters are determined at inference-time. For example, one or more inference-time parameters are determined after receiving an inference request. Determining one or more inference-time parameters may comprise selecting the one or more inference-time parameters from multiple different one or more inference-time parameters, e.g., selecting a set of inference-time parameters from multiple sets of inference-time parameters. Determining one or more inference-time parameters may comprise applying a heuristic or a machine-learnable model to a generative neural network input, e.g., a prompt, and/or to a request context, e.g., information relating to an inference request.

Inference-time parameters should be distinguished from the trained neural network parameters themselves. The neural network parameters are not dynamically determined for a new inference request, but stay constant from request to request. Neural network parameters are determined in a training phase, while in a later inference phase the inference-time parameters are determined. Inference-time parameters control how neural network computations are performed and used, while the neural network parameters determine what the content of the neural network computations.

Generative neural network deviceis configured to generate output data from the generative neural network using inputs derived from the inference request. The generation process being controlled by the determined inference-time parameter(s). For example, generative neural network devicemay run an LLM (large language model) or an image generator, e.g., a diffusion model.

Client device, configurator device, and generative neural network devicemay be part of generative neural network request system. They may each be separate devices, e.g., connected through a computer network. Some of the devices may be combined however; For example, Client deviceand configurator devicemay be combined into one device; For example, configurator device, and generative neural network devicemay be combined into one device. For example, client device, configurator device, and generative neural network devicecould all three be combined into one device.

Client devicemay comprise a processor system, a storage, and a communication interface. Configuration devicemay comprise a processor system, a storage, and a communication interface. Generative neural network devicemay comprise a processor system, a storage, and a communication interface.

In the various embodiments of communication interfaces,and/or, the communication interfaces may be selected from various alternatives. For example, the interface may be a network interface to a local or wide area network, e.g., the Internet, a storage interface to an internal or external data storage, an application interface (API), etc.

Storage,andmay be, e.g., electronic storage, magnetic storage, etc. The storage may comprise local storage, e.g., a local hard drive or electronic memory. Storage,andmay comprise non-local storage, e.g., cloud storage. In the latter case, storage,andmay comprise a storage interface to the non-local storage. Storage may comprise multiple discrete sub-storages together making up storage,,.

Storage,andmay be non-transitory storage. For example, storage,andmay store data in the presence of power such as a volatile memory device, e.g., a Random Access Memory (RAM). For example, storage,andmay store data in the presence of power as well as outside the presence of power such as a non-volatile memory device, e.g., Flash memory. Storage may comprise a volatile writable part, say a RAM, a non-volatile writable part, e.g., Flash. Storage may comprise a non-volatile non-writable part, e.g., ROM.

The devices,andmay communicate internally, with each other, with other devices, external storage, input devices, output devices, and/or one or more sensors over a computer network. The computer network may be an internet, an intranet, a LAN, a WLAN, a WAN, etc. The computer network may be the Internet. The devices,andmay comprise a connection interface which is arranged to communicate within generative neural network request systemor outside generative neural network request systemas needed. For example, the connection interface may comprise a connector, e.g., a wired connector, e.g., an Ethernet connector, an optical connector, etc., or a wireless connector, e.g., an antenna, e.g., a Wi-Fi, 4G or 5G antenna.

The communication interfacemay be used to send or receive digital data, e.g., an inference request, and output data of the generative neural network. The communication interfacemay be used to send or receive digital data, e.g., an inference request, control data for the generative neural network. The communication interfacemay be used to send or receive digital data, e.g., control data and output data.

Client device, configuration device, and generative neural network devicemay have a user interface, which may include well-known elements such as one or more buttons, a keyboard, display, touch screen, etc. The user interface may be arranged for accommodating user interaction for performing an inference request.

The execution of devices,andmay be implemented in a processor system. The devices,andmay comprise functional units to implement aspects of embodiments. The functional units may be part of the processor system. For example, functional units shown herein may be wholly or partially implemented in computer instructions that are stored in a storage of the device and executable by the processor system.

The processor system may comprise one or more processor circuits, e.g., microprocessors, CPUs, GPUs, etc. Devices,andmay comprise multiple processors. A processor circuit may be implemented in a distributed fashion, e.g., as multiple sub-processor circuits. For example, devices,andmay use cloud computing.

Typically, the client device, configuration device, and generative neural network deviceeach comprise one or more microprocessors which execute appropriate software stored at the device; for example, that software may have been downloaded and/or stored in a corresponding memory, e.g., a volatile memory such as RAM or a non-volatile memory such as Flash.

Instead of using software to implement a function, the devices,and/ormay, in whole or in part, be implemented in programmable logic, e.g., as field-programmable gate array (FPGA). The devices may be implemented, in whole or in part, as a so-called application-specific integrated circuit (ASIC), e.g., an integrated circuit (IC) customized for their particular use. For example, the circuits may be implemented in CMOS, e.g., using a hardware description language such as Verilog, VHDL, etc. In particular, client device, configuration deviceand generative neural network devicemay comprise circuits, e.g., for cryptographic processing, and/or arithmetic processing.

In hybrid embodiments, functional units are implemented partially in hardware, e.g., as coprocessors, e.g., neural network coprocessors, and partially in software stored and executed on the device.

schematically shows an example of an embodiment of generative neural network request system. Generative neural network request systemmay comprise multiple client devices; shown are client devices.and.. Systemmay comprise one or more configuration devices; shown is configuration device. Systemmay comprise one or more generative neural network devices; shown is generative neural network device. The devices are connected through a computer network, e.g., the Internet. The client deviceand configuration devicemay be according to an embodiment.

For example, multiple client devices, e.g., client devices.and., may make requests to a single generative neural network. They may do so through a single configuration device. The multiple client devices may make requests of multiple generative neural networks, e.g., different ones, depending on the request. This request may be provided with inference-time parameter by the same configuration device, though multiple configuration devices could be used as well.

Below several further optional refinements, details, and embodiments are illustrated.

schematically shows an example of an embodiment of generative neural network request system. Generative neural network request systemcomprises a client device, a configurator device, and a generative neural network device, e.g., in an embodiment as described with reference to-

Client deviceis configured to issue an inference request, also referred to as a completion request. The inference request typically comprises a prompt.

A prompt may comprise a textual input given to a generative neural network, such as a Large Language Model or Image model. The prompt specifies the desired output of the neural network.

Inference requestmay comprise an optional system role. The system role is a typically textual input given to a generative neural network, typically as part of the prompt that specifies the intended behavior or function of the model. For example, a system role may be assistant, content creator, translator, etc.

Inference requestmay also indicate which generative neural network to use, e.g., in case there is more than one, or which version to use.

Patent Metadata

Filing Date

Unknown

Publication Date

October 16, 2025

Inventors

Unknown

Want to explore more patents?

Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.

Browse All Patents Try Prior Art Search