Attention-Based Decoder-Only Sequence Transduction Neural Networks

PublishedMay 13, 2025

Assigneenot available in USPTO data we have

InventorsNoam M. Shazeer Lukasz Mieczyslaw Kaiser Etienne Pot Mohammad Saleh Ben David Goodrich+2 more

Technical Abstract

Patent Claims

18 claims

Legal claims defining the scope of protection, as filed with the USPTO.

1. A method for processing an input sequence comprising a plurality of input tokens, the method comprising: at each of a plurality of generation time steps: generating a combined sequence for the generation time step that includes the input sequence followed by output tokens based on time step outputs that have already been generated as of the generation time step; processing the combined sequence using a self-attention decoder neural network that comprises a plurality of masked self-attention neural network layers, and wherein the self-attention decoder neural network is configured to process the combined to generate a time step output, wherein the masked self-attention neural network layers are masked such that the time step output depends only on the input sequence and the output tokens that have already been generated as of the generation time step and not on any output tokens that are after the last token that had already been generated; and generating an output image using the time step outputs generated at the plurality of generation time steps.

2. The method of claim 1, wherein the input sequence and the output tokens that have already been generated as of the generation time step are separated by a predetermined special separator token in the combined sequence.

3. The method of claim 1, wherein the plurality of masked self-attention neural network layers are masked multi-head attention layers.

4. The method of claim 1, wherein the plurality of masked self-attention neural network layers comprise at least one local attention layer, and wherein each local attention layer comprises a local attention sub-layer that is configured to: receive a layer input sequence comprising a plurality of layer inputs; divide the layer input sequence into a plurality of sub-sequences; generate, for sub-sequence, a sub-sequence output by performing self-attention on the layer inputs in the sub-sequence; and merge the sub-sequence outputs to generate a layer output sequence.

5. The method of claim 1, wherein the plurality of masked self-attention neural network layers comprise at least one memory-compressed attention layer, and wherein each memory-compressed attention layer comprises a memory-compressed sub-layer that is configured to: obtain an attention input comprising a plurality of keys, values, and queries; applying a strided convolution to the keys to generate a reduced set of keys; applying a strided convolution to the values to generate a reduced set of values; and generate a layer output sequence by performing self-attention using the reduced set of keys, the reduced set of values, and the plurality of queries.

6. The method of claim 5, wherein obtaining the attention input comprises: receiving a layer input sequence comprising a plurality of layer inputs; and projecting the layer input sequence into the keys, values, and queries using respective projection matrices.

7. The method of claim 1, wherein the self-attention decoder neural network includes one or more mixture-of-experts layers.

8. A system comprising one or more computers and one or more storage devices storing instructions that when executed by the one or more computers cause the one or more computers to perform operations for processing an input sequence comprising a plurality of input tokens, the operations comprising: at each of a plurality of generation time steps: generating a combined sequence for the generation time step that includes the input sequence followed by output tokens based on time step outputs that have already been generated as of the generation time step; processing the combined sequence using a self-attention decoder neural network that comprises a plurality of masked self-attention neural network layers, and wherein the self-attention decoder neural network is configured to process the combined sequence to generate a time step output, wherein the masked self-attention neural network layers are masked such that the time step output depends only on the input sequence and the output tokens that have already been generated as of the generation time step and not on any output tokens that are after the last token that had already been generated; and generating an output image using the time step outputs generated at the plurality of generation time steps.

9. The system of claim 8, wherein the masked self-attention neural network layers are masked such that the time step output depends only on the input sequence and the output tokens that have already been generated as of the generation time step and not on any output tokens that are after the last token that had already been generated.

10. The system of claim 8, wherein the input sequence and the output tokens that have already been generated as of the generation time step are separated by a predetermined special separator token in the combined sequence.

11. A method for generating an output sequence comprising a plurality of output tokens from an input sequence comprising a plurality of input tokens that represent pixels of an input image, the method comprising: at each of a plurality of generation time steps: generating a combined sequence for the generation time step that includes the input sequence followed by the output tokens that have already been generated as of the generation time step; and processing the combined sequence using a self-attention decoder neural network that comprises a plurality of masked self-attention neural network layers, and wherein the self-attention decoder neural network is configured to process the combined sequence to generate a time step output, wherein the masked self-attention neural network layers are masked such that the time step output depends only on the input sequence and the output tokens that have already been generated as of the generation time step and not on any output tokens that are after the last token that had already been generated; and determining an output token using the time step output.

12. The method of claim 11, wherein the output sequence comprises a text describing the input image.

13. The method of claim 11, wherein the masked self-attention neural network layers are masked such that the time step output depends only on the input sequence and the output tokens that have already been generated as of the generation time step and not on any output tokens that are after the last token that had already been generated in the output sequence.

14. The method of claim 11, wherein the input sequence and the output tokens that have already been generated as of the generation time step are separated by a predetermined special separator token in the combined sequence.

15. The method of claim 11, wherein the plurality of masked self-attention neural network layers are masked multi-head attention layers.

16. The method of claim 11, wherein the plurality of masked self-attention neural network layers comprise at least one local attention layer, and wherein each local attention layer comprises a local attention sub-layer that is configured to: receive a layer input sequence comprising a plurality of layer inputs; divide the layer input sequence into a plurality of sub-sequences; generate, for sub-sequence, a sub-sequence output by performing self-attention on the layer inputs in the sub-sequence; and merge the sub-sequence outputs to generate a layer output sequence.

17. The method of claim 11, wherein the plurality of masked self-attention neural network layers comprise at least one memory-compressed attention layer, and wherein each memory-compressed attention layer comprises a memory-compressed sub-layer that is configured to: obtain an attention input comprising a plurality of keys, values, and queries; applying a strided convolution to the keys to generate a reduced set of keys; applying a strided convolution to the values to generate a reduced set of values; and generate a layer output sequence by performing self-attention using the reduced set of keys, the reduced set of values, and the plurality of queries.

18. The method of claim 17, wherein obtaining the attention input comprises: receiving a layer input sequence comprising a plurality of layer inputs; and projecting the layer input sequence into the keys, values, and queries using respective projection matrices.

Patent Metadata

Filing Date

Unknown

Publication Date

May 13, 2025

Inventors

Noam M. Shazeer

Lukasz Mieczyslaw Kaiser

Etienne Pot

Mohammad Saleh

Ben David Goodrich

Peter J. Liu

Ryan Sepassi

Want to explore more patents?

Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.

Browse All Patents Try Prior Art Search