A method includes receiving training data that includes a set of transcribed speech utterances where each respective transcribed speech utterance is paired with a corresponding transcription. For each respective transcribed speech utterance, the method includes generating an encoded audio representation and an encoded textual representation, generating a higher order audio feature representation for a corresponding encoded audio representation, generating a higher order textual feature representation for a corresponding encoded textual representation, and determining a loss for the respective transcribed speech utterance based on the higher order audio feature representation and the higher order textual feature representation. The method also includes training a speech encoder and a text encoder of a correction model based on the loss determined for each transcribed speech utterance of the set of transcribed speech utterances.
Legal claims defining the scope of protection, as filed with the USPTO.
. A computer-implemented method that when executed on data processing hardware causes the data processing hardware to perform operations comprising:
. The computer-implemented method of, wherein:
. The computer-implemented method of, wherein the first and second stack of multi-head self-attention layers comprise a stack of transformer layers or a stack of conformer layers.
. The computer-implemented method of, wherein the operations further comprise:
. The computer-implemented method of, wherein the context data obtained from the user device comprises at least one of:
. The computer-implemented method of, wherein the training data further comprises a set of unspoken textual utterances, each unspoken textual utterance not paired with any corresponding spoken utterance.
. The computer-implemented method of, wherein the operations further comprise generating, using a text-to-speech model, a corresponding synthetic speech utterance for each unspoken textual utterance of the set of unspoken textual utterances.
. The computer-implemented method of, wherein the operations further comprise:
. The computer-implemented method of, wherein the operations further comprise:
. The computer-implemented method of, wherein the operations further comprise:
. A system comprising:
. The system of, wherein:
. The system of, wherein the first and second stack of multi-head self-attention layers comprise a stack of transformer layers or a stack of conformer layers.
. The system of, wherein the operations further comprise:
. The system of, wherein the context data obtained from the user device comprises at least one of:
. The system of, wherein the training data further comprises a set of unspoken textual utterances, each unspoken textual utterance not paired with any corresponding spoken utterance.
. The system of, wherein the operations further comprise generating, using a text-to-speech model, a corresponding synthetic speech utterance for each unspoken textual utterance of the set of unspoken textual utterances.
. The system of, wherein the operations further comprise:
. The system of, wherein the operations further comprise:
. The system of, wherein the operations further comprise:
Complete technical specification and implementation details from the patent document.
This disclosure relates to improving automatic speech recognition accuracy with multimodal embeddings search.
Automatic speech recognition (ASR), the process of taking an audio input and transcribing it into text, has greatly been an important technology that is used in mobile devices and other devices. In general, automatic speech recognition attempts to provide accurate transcriptions of what a person has said by taking an audio input (e.g., speech utterance) and transcribing the audio input into text. Modern ASR models continue to improve in both accuracy (e.g., a low word error rate (WER)) and latency (e.g., delay between a user speaking the transcription) based on ongoing development of deep neural networks. However, one challenge in developing deep learning-based ASR models is that parameters of the ASR models tend to over fit the training data, thereby resulting in the ASR models having difficulties generalizing unseen data when the training data is not extensive enough. Thus, some ASR models leverage additional transcriptions to correct any terms that the ASR model initially misrecognized.
One aspect of the disclosure provides a computer-implemented method that when executed on data processing hardware causes the data processing hardware to perform operations for improving automatic speech recognition accuracy with multimodal embeddings search. The operations include receiving training data that includes a set of transcribed speech utterances where each respective transcribed speech utterance is paired with a corresponding transcription. For each respective transcribed speech utterance of the set of transcribed speech utterances, the operations also include: generating, by a shared audio-text encoder of a speech recognition model, an encoded audio representation for the respective transcribed speech utterance and an encoded textual representation for a corresponding transcription of the respective transcribed speech utterance; generating, by a speech encoder of a correction model, a higher order audio feature representation for a corresponding encoded audio representation; generating, by a text encoder of the correction model, a higher order textual feature representation for a corresponding encoded textual representation; and determining a loss for the respective transcribed speech utterance based on the higher order audio feature representation and the higher order textual feature representation each corresponding to the respective transcribed speech utterance. The operations also include training the speech encoder and the text encoder of the correction recognition model based on the loss determined for each respective transcribed speech utterance of the set of transcribed speech utterances.
Implementations of the disclosure may include one or more of the following optional features. In some implementations, the operations further include receiving a speech utterance spoken by a user associated with a user device, generating an initial transcription for the speech utterance using the speech recognition model, and generating a second higher order audio feature representation for the speech utterance using the trained speech encoder of the correction model. In these implementations, the operations may further include: generating a list of biasing phrases based on context data of the user device; for each respective biasing phrase in the list of biasing phrases, generating a second higher order textual feature representation for the respective biasing phrase using the trained text encoder of the correction model, and determining a corresponding cosine distance between the second higher order audio feature representation and the second higher order textual feature representation; and determining a nearest neighbor second higher order textual feature representation from the second higher order textual feature representations generated by the trained text encoder for each respective biasing phrase by selecting a respective one of the second higher order textual feature representations that includes a lowest corresponding cosine distance. Here, the operations may further include determining that the initial transcription is an inaccurate transcription for the speech utterance and replacing the initial transcription generated by the speech recognition model with an updated transcription corresponding to the nearest neighbor higher order textual feature representation in response to determining that the initial transcription is an inaccurate transcription for the speech utterance.
In some examples, the speech encoder includes a first stack of multi-head self-attention layers and the text encoder includes a second stack of multi-head self-attention layers. In these examples, the first and second stack of multi-head self-attention layers may include a stack of transformer layers or a stack of conformer layers. In some implementations, the operations further include obtaining context data from a user device that receives a speech utterance where the context data indicates a current context of the user device and generating a list of biasing phrases based on the context data. Each biasing phrase in the list of biasing phrases is associated with the current context of the user device. In these implementations the context data obtained from the user device includes at least one of a dialog state of the user device, a device state of the user device, a geographic location of the user device, an application executing on the user device, or a language of a speech utterance received by the user device.
In some examples, the training data further includes a set of unspoken textual utterances. Here, each unspoken textual utterance is not paired with any corresponding spoken utterance. In these examples, the operations may further include generating a corresponding synthetic speech utterance for each unspoken textual utterance of the set of unspoken textual utterances using a text-to-speech model.
Another aspect of the disclosure provides a system that includes data processing hardware and memory hardware storing instructions that when executed on the data processing hardware causes the data processing hardware to perform operations. The operations include receiving training data that includes a set of transcribed speech utterances where each respective transcribed speech utterance is paired with a corresponding transcription. For each respective transcribed speech utterance of the set of transcribed speech utterances, the operations also include: generating, by a shared audio-text encoder of a speech recognition model, an encoded audio representation for the respective transcribed speech utterance and an encoded textual representation for a corresponding transcription of the respective transcribed speech utterance; generating, by a speech encoder of a correction model, a higher order audio feature representation for a corresponding encoded audio representation; generating, by a text encoder of the correction model, a higher order textual feature representation for a corresponding encoded textual representation; and determining a loss for the respective transcribed speech utterance based on the higher order audio feature representation and the higher order textual feature representation each corresponding to the respective transcribed speech utterance. The operations also include training the speech encoder and the text encoder of the correction recognition model based on the loss determined for each respective transcribed speech utterance of the set of transcribed speech utterances.
Implementations of the disclosure may include one or more of the following optional features. In some implementations, the operations further include receiving a speech utterance spoken by a user associated with a user device, generating an initial transcription for the speech utterance using the speech recognition model, and generating a second higher order audio feature representation for the speech utterance using the trained speech encoder of the correction model. In these implementations, the operations may further include: generating a list of biasing phrases based on context data of the user device; for each respective biasing phrase in the list of biasing phrases, generating a second higher order textual feature representation for the respective biasing phrase using the trained text encoder of the correction model, and determining a corresponding cosine distance between the second higher order audio feature representation and the second higher order textual feature representation; and determining a nearest neighbor second higher order textual feature representation from the second higher order textual feature representations generated by the trained text encoder for each respective biasing phrase by selecting a respective one of the second higher order textual feature representations that includes a lowest corresponding cosine distance. Here, the operations may further include determining that the initial transcription is an inaccurate transcription for the speech utterance and replacing the initial transcription generated by the speech recognition model with an updated transcription corresponding to the nearest neighbor higher order textual feature representation in response to determining that the initial transcription is an inaccurate transcription for the speech utterance.
In some examples, the speech encoder includes a first stack of multi-head self-attention layers and the text encoder includes a second stack of multi-head self-attention layers. In these examples, the first and second stack of multi-head self-attention layers may include a stack of transformer layers or a stack of conformer layers. In some implementations, the operations further include obtaining context data from a user device that receives a speech utterance where the context data indicates a current context of the user device and generating a list of biasing phrases based on the context data. Each biasing phrase in the list of biasing phrases is associated with the current context of the user device. In these implementations the context data obtained from the user device includes at least one of a dialog state of the user device, a device state of the user device, a geographic location of the user device, an application executing on the user device, or a language of a speech utterance received by the user device.
In some examples, the training data further includes a set of unspoken textual utterances. Here, each unspoken textual utterance is not paired with any corresponding spoken utterance. In these examples, the operations may further include generating a corresponding synthetic speech utterance for each unspoken textual utterance of the set of unspoken textual utterances using a text-to-speech model.
The details of one or more implementations of the disclosure are set forth in the accompanying drawings and the description below. Other aspects, features, and advantages will be apparent from the description and drawings, and from the claims.
Like reference symbols in the various drawings indicate like elements.
Automatic speech recognition (ASR) systems can suffer from low accuracy for various reasons including, but not limited to, noisy input audio data and using an insufficient amount of training data during training. Moreover, many modern ASR systems include end-to-end models that oftentimes lack separate acoustic and language models that are configured to further process input audio to improve recognition results. As such, some ASR systems improve transcription accuracy by obtaining contextually relevant transcriptions, for example, a set of music artists when a device is actively playing music, and biasing the ASR system to generate a transcription that includes one of the contextually relevant transcriptions. Here, biasing the ASR model with the contextually relevant phrases assumes that a user is more likely to speak particular phrases in certain contexts (e.g., speak a music artist when playing music on the device). A common approach is to use a best speech recognition hypothesis as a query to retrieve contextually relevant transcriptions.
However, a major drawback of this approach is when the best speech recognition hypothesis is phonetically dissimilar to the actual utterance spoken by the user thereby causing the ASR system to retrieve implausible contextually relevant transcriptions. Simply put, relying only on the best speech recognition hypothesis as the query (e.g., text query) to obtain contextually relevant transcriptions is a text-only approach that relies on the ASR system generating a best speech recognition hypothesis that is reasonably similar to the utterance actually spoken. Thus, in scenarios where the ASR system generates an implausible best speech recognition hypothesis, retrieving contextually relevant transcriptions is very unlikely to correct the recognition hypothesis because text-only queries have inherently less representational power than audio-based queries.
Accordingly, implementations herein are directed towards methods and systems for improving automatic speech recognition accuracy with multimodal embeddings search. The method includes receiving training data that includes a set of transcribed speech utterances and, for each respective transcribed speech utterance, generating an encoded audio representation and an encoded textual representation by a shared audio-text encoder of a speech recognition model. Notably, the shared audio-text encoder may be trained to generate similar encoded audio and textual representations for related audio and text inputs and generate different encoded audio and textual representations for unrelated audio and text inputs. Stated differently, the distance (e.g., cosine distance) between the encoded audio and textual representations generated by the shared audio-text encoder increases when there is a phonetic dissimilarity between speech and text inputs and decreases when there is phonetic similarity between speech and text inputs. In other examples, the shared audio-text encoder may be trained to generate encoded audio and textual representations for audio-text training input pairs where phonetic similarity is not evident. For instance, the shared audio-text encoder may generate similar audio and text representations for a spoken utterance of “Kesha” and a textual utterance of “Ke$ha.”
The method also includes generating a higher order audio feature representation by a speech encoder of a correction model for a corresponding encoded audio representation and generating a higher order textual feature representation by a text encoder of the correction model for a corresponding encoded textual representation. Thereafter, the method includes determining a loss for the respective transcribed speech utterance based on the higher order feature representation and the higher order textual feature representation each corresponding to the respective transcribed speech and training the speech encoder and the text encoder based on the loss.
As will become apparent, training the speech encoder and the text encoder of the correction model in this manner advantageously enables the speech recognition systems to leverage the higher order audio feature representations and the higher order textual feature representations during inference to obtain a list of biasing phrases (e.g., contextually relevant transcriptions) to bias the speech recognition model. Simply put, using text and audio representations to obtain the list of biasing phrases addresses the shortcomings of using text-only data to obtain contextually relevant transcriptions. Moreover, the method may include generating synthetic speech utterances using unspoken textual utterances to expand the training data used to train the speech encoder and the text encoder of the correction model.
illustrates an automated speech recognition (ASR) systemimplementing an ASR modelthat resides on a user deviceof a userand/or on a remote computing device(e.g., one or more servers of a distributed system executing in a cloud-computing environment) in communication with the user device. Although the user deviceis depicted as a mobile computing device (e.g., a smart phone), the user devicemay correspond to any type of computing device such as, without limitation, a tablet device, a laptop/desktop computer, a wearable device, a digital assistant device, a smart speaker/display, a smart appliance, an automotive infotainment system, or an Internet-of-Things (IoT) device, and is equipped with data processing hardwareand memory hardware.
The user deviceincludes an audio subsystemconfigured to receive an utterancespoken by the user(e.g., the user devicemay include one or more microphones for recording the spoken utterance) and convert the utteranceinto a corresponding digital format associated with an input sequence of acoustic framescapable of being processed by the ASR system. In the example shown, the userspeaks a respective utterancein a natural language of English for the phrase “What is the weather in New York City?” and the audio subsystemconverts the utteranceinto a corresponding sequence of acoustic framesfor input to the ASR system. Thereafter, the ASR modelreceives, as input, the sequence of acoustic framescorresponding to the utterance, and generates/predicts, as output, a corresponding transcription(e.g., recognition result/hypothesis) of the utterance. In the example shown, the user deviceand/or the remote computing devicealso executes a user interface generatorconfigured to present a representation of the transcriptionof the utteranceto the userof the user device. In some configurations, the transcriptionoutput from the ASR systemis processed, e.g., by a natural language understanding (NLU) module executing on the user deviceor the remote computing device, to execute a user command. Additionally or alternatively, a text-to-speech system (e.g., executing on any combination of the user deviceor the remote computing device) may convert the transcriptioninto synthesized speech for audible output by another device. For instance, the original utterancemay correspond to a message the useris sending to a friend in which the transcriptionis converted to synthesized speech for audible output to the friend to listen to the message conveyed in the original utterance. As will become apparent, the ASR systemincludes a correction modelconfigured to correct misrecognized transcriptionsgenerated by the ASR modelusing contextually relevant biasing phrases.
The ASR modelmay operate in a streaming fashion, a non-streaming fashion, or some combination thereof. The ASR modeloperates in the streaming fashion by, while receiving the sequence of acoustic frames, encoding the sequence of acoustic framesand then decoding the encoded sequence of acoustic framesinto an initial transcription (e.g., speech recognition result/hypothesis). Thus, the initial transcriptionmay correspond to words, word pieces, and/or individual characters generated by the ASR modelas soon as they are spoken. On the other hand, the ASR modeloperates in the non-streaming fashion by receiving and processing additional right-context to improve upon the initial transcriptionthereby generating a final transcription. That is, the ASR modelprocesses additional input audio data or encoded acoustic frames (e.g., right-context) to improve the transcriptionoutput by the ASR model, but at increased latency.
Referring to, an example ASR modelmay include a Recurrent Neural Network-Transducer (RNN-T) model architecture which adheres to latency constraints with interactive applications. The use of the RNN-T model architecture is exemplary only, as the ASR modelmay include other architectures such as transformer-transducer and conformer-transducer model architectures among others. The RNN-T modelprovides a small computational footprint and utilizes less memory requirements than conventional ASR architectures, making the RNN-T model architecture suitable for performing speech recognition entirely on the user device(e.g., no communication with a remote server is required). The RNN-T modelincludes an encoder network, a prediction network, and a joint network. The encoder network, which is roughly analogous to an acoustic model (AM) in a traditional ASR system, includes a stack of self-attention layers (e.g., Conformer or Transformer layers) or a recurrent network of stacked Long Short-Term Memory (LSTM) layers. For instance, the encoder network (e.g., encoder)reads a sequence of d-dimensional feature vectors (e.g., acoustic frames()) x=(x, x, . . . , x), where x, ∈, and produces at each output step a higher-order feature representation (e.g., encoded representation). This higher-order feature representation is denoted as
In some examples, the encoder networkincludes a dual encoder framework that has a speech encoderand a text encoder().
Similarly, the prediction networkis also an LSTM network, which, like a language model (LM), processes the sequence of non-blank symbols output by a final Softmax layerso far, y, . . . , y, into a dense representation p. Finally, with the RNN-T model architecture, the representations produced by the encoder and prediction/decoder networks,are combined by the joint network. The prediction networkmay be replaced by an embedding look-up table to improve latency by outputting looked-up sparse embeddings in lieu of processing dense representations. The joint network then predicts P(y|x, y, . . . , y), which is a distribution over the next output symbol. Stated differently, the joint networkgenerates, at each output step (e.g., time step), a probability distribution over possible speech recognition hypotheses. Here, the “possible speech recognition hypotheses” correspond to a set of output labels each representing a symbol/character in a specified natural language. For example, when the natural language is English, the set of output labels may include twenty-seven (27) symbols, e.g., one label for each of the 26-letters in the English alphabet and one label designating a space. Accordingly, the joint networkmay output a set of values indicative of the likelihood of occurrence of each of a predetermined set of output labels. This set of values can be a vector and can indicate a probability distribution over the set of output labels. In some cases, the output labels are graphemes (e.g., individual characters, and potentially punctuation and other symbols), but the set of output labels is not so limited. For example, the set of output labels can include wordpieces, phonemes, and/or entire words, in addition to or instead of graphemes. The output distribution of the joint networkcan include a posterior probability value for each of the different output labels. Thus, if there are 100 different output labels representing different graphemes or other symbols, the output yof the joint networkcan include 100 different probability values, one for each output label. The probability distribution can then be used to select and assign scores to candidate orthographic elements (e.g., graphemes, wordpieces, and/or words) in a beam search process (e.g., by the Softmax layer) for determining the transcription.
The Softmax layermay employ any technique to select the output label/symbol with the highest probability in the distribution as the next output symbol predicted by the RNN-T modelat the corresponding output step. In this manner, the RNN-T modeldoes not make a conditional independence assumption, rather the prediction of each symbol is conditioned not only on the acoustics but also on the sequence of labels output so far. The RNN-T modeldoes assume an output symbol is independent of future acoustic frames, which allows the RNN-T model to be employed in the streaming fashion, the non-streaming fashion, or some combination thereof.
In some examples, the encoderof the RNN-T model includes a plurality of multi-head (e.g., 8 heads) self-attention layers. For example, the plurality of multi-head self-attention layers may include Conformer layers (e.g., Conformer-encoder), transformer layers, performer layers, convolution layers (including lightweight convolution layers), or any other type of multi-head self-attention layers. The plurality of multi-head self-attention layers may include any number of layers, for instance 16 layers. Moreover, the encodermay operate in the streaming fashion (e.g., the encoderoutputs initial higher-order feature representations as soon as they are generated), in the non-streaming fashion (e.g., the encoderoutputs subsequent higher-order feature representations by processing additional right-context to improve initial higher-order feature representations), or in a combination of both the streaming and non-streaming fashion.
illustrate an example training processfor training the encoderof the ASR model(). In some examples, the encoderis a shared audio-text encoder that is compatible with audio and textual inputs, and thus, the encodermay interchangeably be referred to as “the shared audio-text encoder” herein. Notably, after training, the shared audio-text encoderis leveraged to train the correction model (e.g., ASR correction model)(). The training processmay train the shared audio-text encoderusing available training data that includes a set of unspoken textual utterances (X), a set of transcribed non-synthetic speech utterances (X), and/or un-transcribed non-synthetic speech utterances (X). Each unspoken textual utteranceincludes text-only data (i.e., unpaired data) such that each unspoken textual utteranceis not paired with any corresponding spoken audio representation (i.e., speech) of the utterance. The unspoken textual utterancemay include any sequence of text chunks including words, word-pieces, phonemes, and/or graphemes. Each un-transcribed non-synthetic speech utterance(also referred to as simply “un-transcribed speech utterance”) includes audio-only data (i.e., unpaired data) such that the un-transcribed speech utteranceis not paired with any corresponding transcription. On the other hand, each transcribed non-synthetic speech utterance(also referred to as simply “transcribed speech utterance”) includes a corresponding transcriptionpaired with a corresponding non-synthetic speech representation of the corresponding transcribed speech utterance.
For simplicity, the training processincludes a contrastive self-supervised loss part(), a supervised loss part(), and a consistency regularization part(). The training processtrains the shared audio-text encoderon a total loss (L) based on: contrastive losses (L)derived using the contrastive self-supervised loss partfrom the unspoken training text utterances (X), a corpus of transcribed non-synthetic speech utterances (X), and un-transcribed non-synthetic speech utterances (X); supervised losses (L),derived using the supervised loss partfrom the unspoken training text utterances (X)and the transcribed non-synthetic speech utterances (X); and consistency losses ((θ))derived using the consistency regularization part
The training processmay employ an alignment modelthat is configured to generate, at each of a plurality of output steps, alignment outputs (i.e., textual representations)for each of a plurality of unspoken training text utterances() or for each of a plurality of transcriptionscorresponding to transcribed speech utterances(). The unspoken textual utterancesincludes unspoken text that is text-only data, i.e., unpaired data, such that each unspoken textual utterance (X)is not paired with any synthesized or non-synthesized speech. Accordingly, the alignment modelgenerates a corresponding alignment outputfor each of the unspoken textual utterancesor for each of the transcriptions.
Referring now to, in some examples, the alignment modelincludes an embedding extractor, a duration predictor, and an upsampler. The embedding extractorreceives the unspoken textual utterance(or transcription) that includes a sequence of text chunks including words, word-pieces, phonemes, and/or graphemes and extracts a corresponding initial textual representation (e). The initial textual representationincludes embedding lexical information from the unspoken textual utteranceor the transcriptioncorresponding to the transcribed speech utterance. The duration predictorreceives the initial textual representationfrom the embedding extractorand predicts a corresponding text chunk duration (i.e., word, word-piece, phoneme, and/or grapheme duration). The text chunk durationindicates a duration the corresponding text chunk would be spoken if a human (or text-to-speech system) spoke the unspoken textual utterance. For example, the unspoken textual utterancemay include a sequence of phonemes and the duration predictorpredicts a phoneme durationfor each phoneme in the sequence of phonemes. In this example, the duration predictorpredicts the phoneme durationby predicting a probability of non-zero duration for each phoneme and predicting a probability of continuous phoneme duration for each phoneme. As the sequence of phonemes includes regular phonemes, silences between word boundaries, and punctuation marks, only the regular phonemes are associated with non-zero duration while the silences and punctuation marks are generally associated with the continuous phoneme duration. Accordingly, the duration predictormay use a sigmoid activation following a first one of two independent activations to predict the probability of non-zero duration and use a soft plus activation following a second one of the two independent projections to predict the continuous text chunk durationfor each text chunk. The duration predictordetermines, for each text chunk, whether the probability of non-zero duration is less than a threshold value, and when the probability of non-zero duration is less than the threshold value, a multiplier may zero-out the continuous text chunk durationpredicted by the softplus activation for the corresponding text chunk. Otherwise, when the probability of non-zero duration is not less than the threshold value, the predicted text chunk durationmay be set equal to the continuous phoneme duration predicted by the softplus activation.
The upsamplerreceives, for each unspoken textual utterance, the corresponding initial textual representationand the predicted text chunk duration, and generates an alignment output (ê)having a number of frames by upsampling the initial textual representationusing the corresponding predicted text chunk duration. In some examples, paired training data is available and the upsamplergenerates the alignment outputas follows:
Here, the upsamplerincludes resampler and refiner layers that align the initial textual embeddingto align with a corresponding encoded audio representation() directly. In other examples, paired training data is not available and the upsamplergenerates the alignment outputas follows.
In particular, the number of frames of the alignment outputindicates a predicted speech duration of the unspoken textual utterance. Stated differently, the number of frames of the alignment outputmaps (i.e., aligns) the sequence of text chunks of the unspoken textual utteranceto speech frames. Here, the upsamplerincludes resampler and refiner layers that replicate the initial textual embeddingto match the predicted text chunk duration(i.e., speech duration). As such, the alignment outputincludes a textual representation of the unspoken textual utterancehaving a timing component that aligns with how a human would speak the unspoken textual utterance.
Referring now to, in some implementations, the shared audio-text encoderincludes a conformer encoder including a stack of conformer blocks each of which includes a multi-head self-attention, depth wise convolution, and feed-forward layers. Alternatively, the shared audio-text encodermay include another type of encoder having a stack of self-attention layers/blocks, such as a transformer encoder including a stack of transformer blocks. The shared audio-text encodercan naturally be split into a feature encoder, including a convolution subsampling block, and a context network, including a linear layerand a stack of Conformer blocks. In some implementations, the convolution subsampling blockhas two two-dimensional-convolution layers, both with strides (2, 2), resulting in a 4× reduction in the feature sequence length. The convolution subsampling blockreceives, as input, a sequence of input features/vectors (e.g., mel-frequency spectrograms such as the acoustic framesof) associated with each transcribed non-synthetic speech utteranceand each un-transcribed non-synthetic speech utterance, and generates, as output, for each of a plurality of output steps, an encoded audio featurethat corresponds to a respective one of the transcribed non-synthetic speech utterancesor a respective one of the un-transcribed non-synthetic speech utterances. The convolution subsampling blockmay receive, as input, each alignment outputand generate, as output, for each of the plurality of output steps, an encoded textual featurethat corresponds to a respective one of the alignment outputs.
The encoded audio and textual features,(i.e., interchangeably referred to as “encoded features,”) output from the convolution subsampling blockmay be fed to a masking modulewhere some of the encoded features,are randomly chosen and replaced with a trained feature vector shared between all masked time steps to provide corresponding masked encoded audio features,and masked encoded textual features,. In some examples, the masking modulemasks the randomly chosen encoded features,for masking by randomly sampling without replacement a certain proportion p of all time steps to be start indices and then masks the subsequent M consecutive time steps from every sample index, whereby some spans may overlap. After masking is applied, the linear layerand the Conformer blocksof the context network receives the masked encoded features(or encoded features,not chosen by the masking module) and outputs corresponding contrastive context vectors (i.e., encoded representation)from masked encoded features,. Moreover, a quantizerreceives the encoded features,as input, and generates quantized vectors (i.e., target context vectors)as output. Thereafter, a contrastive loss modulederives a contrastive loss (L)between the contrastive context vectorsat the masked positions and the target context vectorsas follows.
where ct is contrastive context vectorcentered over a masked time step t and qt represents a target context vectorat the time step t in a set of K+1 candidate target context vectorswhich includes qand K distractors. Distractors may be uniformly sampled from other masked time steps of the same utterance.
The contrastive lossis optimized between the contrastive context vectorsat the masked positions and the target context vectors. After the shared audio-text encoderconverges on the un-transcribed non-synthetic speech utterances, the training procedure is repeated on both the alignment outputscorresponding to the unspoken textual utteranceand the transcribed non-synthetic speech utterances. Thus, the contrastive loss (L)is optimized for both real/human (non-synthetic) and unspoken textual utterancesrepresented by alignment outputs, with additional auxiliary losses on the transcribed non-synthetic speech utterancesand the alignment outputsas described in greater detail below with reference to. Accordingly, the training processtrains the shared audio-text encoderon the derived contrastive lossapplied on the corresponding encoded features,associated with each alignment output, each transcribed non-synthetic speech utterance, and each un-transcribed non-synthetic speech utteranceprovided as input to the shared audio-text encoder. Training the shared audio-text encodermay include updating parameters of the shared audio-text encoderbased on the contrastive losses.
Referring to, the supervised loss partof the training processis configured to inject lexical information into the shared audio-text encoderduring training based on supervised loss terms,derived from the transcribed non-synthetic speech utterancesand the alignment outputscorresponding to unspoken textual utterancesoutput by the alignment model. Notably, the supervised loss partleverages one or more auxiliary decodersfor generating the supervised loss terms,. The auxiliary decodersmay include Connectionist Temporal Classification (CTC) decoders, Listen Attend Spell (LAS) decoders, or RNN-T decoders. These auxiliary decodersmay include at least one of a phoneme decoder configured to decode a sequence of phonemes or a wordpiece decoder configured to decode a sequence of word pieces. The auxiliary decoderscould also include a grapheme decoder configured to decode a sequence of graphemes.
In some implementations, the shared audio-text encoderincludes a text encoderconfigured to receive textual inputs and generate corresponding encodings and a speech encoderconfigured to receive audio inputs and generate corresponding encodings. That is, the text encoderof the shared audio-text encoderis configured to receive alignment outputs(i.e., text embeddings) from the alignment modeland the speech encoderis configured to receive transcribed non-synthetic speech utterances. Thus, the text encodergenerates encoded textual representationsfor alignment outputs(e.g., corresponding to an unspoken textual utterance) and the speech encodergenerates encoded audio representationsfor speech inputs (i.e., transcribed non-synthetic speech utterances). Notably, the shared audio-text encodergenerates the encoded textual representationsand the encoded audio representationssuch that both representations are compatible with the auxiliary decoderdespite the input modality mismatch between text and audio. Accordingly, the shared audio-text encodergenerates the encoded textual representationsand the encoded audio representations(e.g., multimodal embeddings) into a shared latent representation space compatible with the auxiliary decoder.
The auxiliary decoderincluding the phoneme decoder or the wordpiece decoder receives, as input, each encoded textual representationand generates, as output, a first probability distributionover possible speech recognition hypotheses for the corresponding alignment outputat the corresponding time step. In some examples, the first probability distributionover possible speech recognition hypotheses includes one of possible phoneme labels, possible word piece labels, or possible grapheme labels. Thereafter, a supervised loss modulemay determine an alignment output loss termbased on the first probability distributionover possible speech recognition hypotheses for the alignment outputcorresponding to the unspoken textual utterance. Here, the corresponding unspoken textual utterancein which the alignment outputis generated from also serves as a ground-truth transcription. The supervised loss partmay train the shared audio-text encoderon the alignment output loss termby updating parameters of the shared audio-text encoderusing the alignment output loss term.
Similarly, the auxiliary decoderincluding the phoneme decoder or the wordpiece decoder receives, as input, each encoded audio representationand generates, as output, a second probability distributionover possible non-synthetic speech recognition hypotheses for the corresponding transcribed non-synthetic speech utteranceat the corresponding time step. In some examples, the second probability distributionover possible non-synthetic speech recognition hypotheses includes the one of possible phoneme labels, the possible word piece labels, or the possible grapheme labels. Thereafter, the supervised loss modulemay determine a non-synthetic speech loss termbased on the second probability distributionover possible non-synthetic speech recognition hypotheses and the corresponding transcriptionpaired with the transcribed non-synthetic speech utterance. Here, the corresponding transcriptionserves as a ground-truth transcription and may include a sequence of target phonemes, target word pieces, and/or target graphemes. The supervised loss partmay train the shared audio-text encoderon the non-synthetic speech loss termby updating parameters of the shared audio-text encoderusing the non-synthetic speech loss term.
The un-transcribed non-synthetic speech utterancesand the unspoken textual utteranceseach correspond to “unpaired” training data whereby the contrastive loss (L)derived from the unspoken textual utterances (X)may be combined with the supervised lossassociated with the alignment output loss termto obtain an unspoken textual loss function,, as follows.
Likewise, the contrastive loss (L)derived from the un-transcribed non-synthetic speech utterances (X)may be used to express an unsupervised speech loss function,, as follows.
During training of the shared audio-text encoder, the alignment outputsand the un-transcribed non-synthetic utterancesmay be separated or mixed within each batch. In order to force the shared audio-text encoderto learn representations that are effective for both alignment outputscorresponding to unspoken textual utterancesand non-synthetic (human/real) speech, the loss mask ø is applied when combining the loss functionsand of Equations. 5 and 6 to obtain an unpaired data loss function,, as follows.
The transcribed non-synthetic speech utterancescorrespond to “paired” and “supervised” training data whereby the derived contrastive loss Land the derived supervised lossassociated with the non-synthetic speech loss termmay be combined to obtain a paired data loss function,, as follows.
Referring to, the consistency regularization part (i.e., modality matching part)of the training processis configured to promote the shared audio-text encoderto learn consistent predictions between non-synthetic speech (e.g., real/human speech) and alignment outputscorresponding to unspoken textual utterancesby generating a consistent loss term ((θ))between training utterance pairsthat each include a corresponding one of the transcribed non-synthetic speech utterances (X)and a paired alignment outputof the same utterance as the corresponding transcribed non-synthetic speech utterance. As such, the non-synthetic speech utteranceand the paired alignment outputof each training utterance pairis associated with a same ground-truth transcription. In short, the consistent loss termbetween the transcribed non-synthetic speech utteranceand paired alignment outputof the same training utterance provides an unsupervised training aspect by encouraging the shared audio-text encoderto behave consistently regardless of whether the training utterance belongs to non-synthetic speech (i.e., speech training data) or the alignment output (i.e., text training data) and independent of supervised loss terms between the ground-truth transcriptionand each of: non-synthetic speech recognition hypotheses output by the auxiliary decoder; and speech recognition hypothesis output by the auxiliary decoder.
Similar to the alignment outputsgenerated from the unspoken textual utterancesin, the alignment modelmay generate each paired alignment outputusing the corresponding transcriptionthat is paired with the transcribed non-synthetic speech utterance. Here, the non-synthetic speech representationis associated with paired alignment outputgenerated by the alignment modelmapping the unspoken textual utteranceinto speech frames.
Unknown
March 17, 2026
Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.