A method includes receiving an audio input and providing the audio input to a transducer model. The method also includes predicting, using the transducer model, text associated with the audio input and outputting the predicted text. The transducer model is trained using a training dataset that includes samples including audio frames and a ground truth subword unit transcription. A first view of a training sample is generated by adding first random noise to the audio frames. A second view of the training sample is generated by adding second random noise to the audio frames. The first view and the second view are provided to the transducer model. The transducer model predicts a probability distribution for each of the first view and the second view. The probability distribution includes probabilities of possible frame synchronization decoding between the audio frames and transcriptions. The transducer model is modified based on a transducer loss.
Legal claims defining the scope of protection, as filed with the USPTO.
receiving, via an audio input device, an audio input; providing the audio input to a transducer model; predicting, using the transducer model, text associated with the audio input; and outputting the predicted text; the transducer model is trained using a training dataset, samples in the training dataset including audio frames and a ground truth subword unit transcription; a first view of a training sample is generated by adding first random noise to the audio frames; a second view of the training sample is generated by adding second random noise to the audio frames; the first view and the second view are provided to the transducer model; the transducer model predicts a probability distribution for each of the first view and the second view, the probability distribution including probabilities of possible frame synchronization decoding between the audio frames and transcriptions; and the transducer model is modified based on a transducer loss. wherein: . A method comprising:
claim 1 the transducer model is modified based on the transducer loss via an application of a first transducer divergence loss, wherein the first view is treated as a teacher and the second view is treated as a student; and the transducer model is modified based on the transducer loss via an application of a second transducer divergence loss, wherein the second view is used as the teacher and the first view is used as the student. . The method of, wherein:
claim 2 each possible frame synchronization decoding corresponds to a sequence of token predictions for the audio frames; and a token prediction for a given audio frame is either a transcript token or a blank token. . The method of, wherein:
claim 2 a weighted sum of the first transducer divergence loss and the second transducer divergence loss is used during training; and weights of the weighted sum are dependent upon how much each probability distribution contributes to the transducer loss. . The method of, wherein:
claim 1 . The method of, wherein the transducer model includes an encoder, a predictor, and a joiner.
claim 5 . The method of, wherein the encoder encodes the audio input into an audio feature vector.
claim 6 . The method of, wherein the predictor receives previous tokens to predict a subsequent token.
claim 7 . The method of, wherein the joiner combines the audio feature vector and the subsequent token to output the probability distribution.
receive, via an audio input device, an audio input; provide the audio input to a transducer model; predict, using the transducer model, text associated with the audio input; and output the predicted text; at least one processing device configured to: the transducer model is trained using a training dataset, samples in the training dataset including audio frames and a ground truth subword unit transcription; a first view of a training sample is generated by adding first random noise to the audio frames; a second view of the training sample is generated by adding second random noise to the audio frames; the first view and the second view are provided to the transducer model; the transducer model predicts a probability distribution for each of the first view and the second view, the probability distribution including probabilities of possible frame synchronization decoding between the audio frames and transcriptions; and the transducer model is modified based on a transducer loss. wherein: . An electronic device comprising:
claim 9 the transducer model is modified based on the transducer loss via an application of a first transducer divergence loss, wherein the first view is treated as a teacher and the second view is treated as a student; and the transducer model is modified based on the transducer loss via an application of a second transducer divergence loss, wherein the second view is used as the teacher and the first view is used as the student. . The electronic device of, wherein:
claim 10 each possible frame synchronization decoding corresponds to a sequence of token predictions for the audio frames; and a token prediction for a given audio frame is either a transcript token or a blank token. . The electronic device of, wherein:
claim 10 a weighted sum of the first transducer divergence loss and the second transducer divergence loss is used during training; and weights of the weighted sum are dependent upon how much each probability distribution contributes to the transducer loss. . The electronic device of, wherein:
claim 9 . The electronic device of, wherein the transducer model includes an encoder, a predictor, and a joiner.
claim 13 . The electronic device of, wherein the encoder is configured to encode the audio input into an audio feature vector.
claim 14 . The electronic device of, wherein the predictor is configured to receive previous tokens and predict a subsequent token.
claim 15 . The electronic device of, wherein the joiner is configured to combine the audio feature vector and the subsequent token to output the probability distribution.
receive, via an audio input device, an audio input; provide the audio input to a transducer model; predict, using the transducer model, text associated with the audio input; and output the predicted text; the transducer model is trained using a training dataset, samples in the training dataset including audio frames and a ground truth subword unit transcription; a first view of a training sample is generated by adding first random noise to the audio frames; a second view of the training sample is generated by adding second random noise to the audio frames; the first view and the second view are provided to the transducer model; the transducer model predicts a probability distribution for each of the first view and the second view, the probability distribution including probabilities of possible frame synchronization decoding between the audio frames and transcriptions; and the transducer model is modified based on a transducer loss. wherein: . A non-transitory machine-readable medium comprising instructions that when executed by at least one processor cause an electronic device to:
claim 17 the transducer model is modified based on the transducer loss via an application of a first transducer divergence loss, wherein the first view is treated as a teacher and the second view is treated as a student; and the transducer model is modified based on the transducer loss via an application of a second transducer divergence loss, wherein the second view is used as the teacher and the first view is used as the student. . The non-transitory machine-readable medium of, wherein:
claim 18 a weighted sum of the first transducer divergence loss and the second transducer divergence loss is used during training; and weights of the weighted sum are dependent upon how much each probability distribution contributes to the transducer loss. . The non-transitory machine-readable medium of, wherein:
claim 17 the transducer model includes an encoder, a predictor, and a joiner; the encoder is configured to encode the audio input into an audio feature vector; the predictor is configured to receive previous tokens and predict a subsequent token; and the joiner is configured to combine the audio feature vector and the subsequent token to output the probability distribution. . The non-transitory machine-readable medium of, wherein:
Complete technical specification and implementation details from the patent document.
This application claims priority under 35 U.S.C. § 119 (e) to U.S. Provisional Patent Application No. 63/667,647 filed on Jul. 3, 2024, which is hereby incorporated by reference in its entirety.
This disclosure relates generally to machine learning systems and processes. More specifically, this disclosure relates to transducer consistency regularization for speech-to-text applications.
Consistency regularization is an approach for learning how to generate consistent representations of input features with different views or from different models via minimizing a distribution difference. Consistency regularization is used in self-supervised learning and semi-supervised learning to stabilize training and alleviate data perturbation sensitivity. Studies have shown consistency learning is beneficial for speech-to-text tasks, although such studies are limited to certain frameworks and are only optimized with cross entropy loss.
This disclosure relates to transducer consistency regularization for speech-to-text applications.
In a first embodiment, a method includes receiving, via an audio input device, an audio input. The method also includes providing the audio input to a transducer model and predicting, using the transducer model, text associated with the audio input. The method further includes outputting the predicted text. The transducer model is trained using a training dataset, where samples in the training dataset include audio frames and a ground truth subword unit transcription. A first view of a training sample is generated by adding first random noise to the audio frames. A second view of the training sample is generated by adding second random noise to the audio frames. The first view and the second view are provided to the transducer model. The transducer model predicts a probability distribution for each of the first view and the second view, where the probability distribution includes probabilities of possible frame synchronization decoding between the audio frames and transcriptions. The transducer model is modified based on a transducer loss.
In a second embodiment, an electronic device includes at least one processing device configured to receive, via an audio input device, an audio input. The at least one processing device is also configured to provide the audio input to a transducer model and predict, using the transducer model, text associated with the audio input. The at least one processing device is further configured to output the predicted text. The transducer model is trained using a training dataset, where samples in the training dataset include audio frames and a ground truth subword unit transcription. A first view of a training sample is generated by adding first random noise to the audio frames. A second view of the training sample is generated by adding second random noise to the audio frames. The first view and the second view are provided to the transducer model. The transducer model predicts a probability distribution for each of the first view and the second view, where the probability distribution includes probabilities of possible frame synchronization decoding between the audio frames and transcriptions. The transducer model is modified based on a transducer loss.
In a third embodiment, a non-transitory machine-readable medium includes instructions that when executed by at least one processor cause an electronic device to receive, via an audio input device, an audio input. The non-transitory machine-readable medium also includes instructions that when executed by the at least one processor cause the electronic device to provide the audio input to a transducer model and predict, using the transducer model, text associated with the audio input. The non-transitory machine-readable medium further includes instructions that when executed by the at least one processor cause the electronic device to output the predicted text. The transducer model is trained using a training dataset, where samples in the training dataset include audio frames and a ground truth subword unit transcription. A first view of a training sample is generated by adding first random noise to the audio frames. A second view of the training sample is generated by adding second random noise to the audio frames. The first view and the second view are provided to the transducer model. The transducer model predicts a probability distribution for each of the first view and the second view, where the probability distribution includes probabilities of possible frame synchronization decoding between the audio frames and transcriptions. The transducer model is modified based on a transducer loss.
Any single one or any combination of the following features may be used with the first, second, or third embodiments. The transducer model may be modified based on the transducer loss via an application of a first transducer divergence loss, and the first view may be treated as a teacher and the second view may be treated as a student. The transducer model may be modified based on the transducer loss via an application of a second transducer divergence loss, and the second view may be used as the teacher and the first view may be used as the student. Each possible frame synchronization decoding may correspond to a sequence of token predictions for the audio frames, and a token prediction for a given audio frame may be either a transcript token or a blank token. A weighted sum of the first transducer divergence loss and the second transducer divergence loss may be used during training, and weights of the weighted sum may be dependent upon how much each probability distribution contributes to the transducer loss. The transducer model may include an encoder, a predictor, and a joiner. The encoder may encode the audio input into an audio feature vector. The predictor may receive previous tokens to predict a subsequent token. The joiner may combine the audio feature vector and the subsequent token to output the probability distribution.
Other technical features may be readily apparent to one skilled in the art from the following figures, descriptions, and claims.
Before undertaking the DETAILED DESCRIPTION below, it may be advantageous to set forth definitions of certain words and phrases used throughout this patent document. The terms “transmit,” “receive,” and “communicate,” as well as derivatives thereof, encompass both direct and indirect communication. The terms “include” and “comprise,” as well as derivatives thereof, mean inclusion without limitation. The term “or” is inclusive, meaning and/or. The phrase “associated with,” as well as derivatives thereof, means to include, be included within, interconnect with, contain, be contained within, connect to or with, couple to or with, be communicable with, cooperate with, interleave, juxtapose, be proximate to, be bound to or with, have, have a property of, have a relationship to or with, or the like.
Moreover, various functions described below can be implemented or supported by one or more computer programs, each of which is formed from computer readable program code and embodied in a computer readable medium. The terms “application” and “program” refer to one or more computer programs, software components, sets of instructions, procedures, functions, objects, classes, instances, related data, or a portion thereof adapted for implementation in a suitable computer readable program code. The phrase “computer readable program code” includes any type of computer code, including source code, object code, and executable code. The phrase “computer readable medium” includes any type of medium capable of being accessed by a computer, such as read only memory (ROM), random access memory (RAM), a hard disk drive, a compact disc (CD), a digital video disc (DVD), or any other type of memory. A “non-transitory” computer readable medium excludes wired, wireless, optical, or other communication links that transport transitory electrical or other signals. A non-transitory computer readable medium includes media where data can be permanently stored and media where data can be stored and later overwritten, such as a rewritable optical disc or an erasable memory device.
As used here, terms and phrases such as “have,” “may have,” “include,” or “may include” a feature (like a number, function, operation, or component such as a part) indicate the existence of the feature and do not exclude the existence of other features. Also, as used here, the phrases “A or B,” “at least one of A and/or B,” or “one or more of A and/or B” may include all possible combinations of A and B. For example, “A or B,” “at least one of A and B,” and “at least one of A or B” may indicate all of (1) including at least one A, (2) including at least one B, or (3) including at least one A and at least one B. Further, as used here, the terms “first” and “second” may modify various components regardless of importance and do not limit the components. These terms are only used to distinguish one component from another. For example, a first user device and a second user device may indicate different user devices from each other, regardless of the order or importance of the devices. A first component may be denoted a second component and vice versa without departing from the scope of this disclosure.
It will be understood that, when an element (such as a first element) is referred to as being (operatively or communicatively) “coupled with/to” or “connected with/to” another element (such as a second element), it can be coupled or connected with/to the other element directly or via a third element. In contrast, it will be understood that, when an element (such as a first element) is referred to as being “directly coupled with/to” or “directly connected with/to” another element (such as a second element), no other element (such as a third element) intervenes between the element and the other element.
As used here, the phrase “configured (or set) to” may be interchangeably used with the phrases “suitable for,” “having the capacity to,” “designed to,” “adapted to,” “made to,” or “capable of” depending on the circumstances. The phrase “configured (or set) to” does not essentially mean “specifically designed in hardware to.” Rather, the phrase “configured to” may mean that a device can perform an operation together with another device or parts. For example, the phrase “processor configured (or set) to perform A, B, and C” may mean a generic-purpose processor (such as a CPU or application processor) that may perform the operations by executing one or more software programs stored in a memory device or a dedicated processor (such as an embedded processor) for performing the operations.
The terms and phrases as used here are provided merely to describe some embodiments of this disclosure but not to limit the scope of other embodiments of this disclosure. It is to be understood that the singular forms “a,” “an,” and “the” include plural references unless the context clearly dictates otherwise. All terms and phrases, including technical and scientific terms and phrases, used here have the same meanings as commonly understood by one of ordinary skill in the art to which the embodiments of this disclosure belong. It will be further understood that terms and phrases, such as those defined in commonly-used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the relevant art and will not be interpreted in an idealized or overly formal sense unless expressly so defined here. In some cases, the terms and phrases defined here may be interpreted to exclude embodiments of this disclosure.
Examples of an “electronic device” according to embodiments of this disclosure may include at least one of a smartphone, a tablet personal computer (PC), a mobile phone, a video phone, an e-book reader, a desktop PC, a laptop computer, a netbook computer, a workstation, a personal digital assistant (PDA), a portable multimedia player (PMP), an MP3 player, a mobile medical device, a camera, or a wearable device (such as smart glasses, a head-mounted device (HMD), electronic clothes, an electronic bracelet, an electronic necklace, an electronic accessory, an electronic tattoo, a smart mirror, or a smart watch). Other examples of an electronic device include a smart home appliance. Examples of the smart home appliance may include at least one of a television, a digital video disc (DVD) player, an audio player, a refrigerator, an air conditioner, a cleaner, an oven, a microwave oven, a washer, a dryer, an air cleaner, a set-top box, a home automation control panel, a security control panel, a TV box (such as SAMSUNG HOMESYNC, APPLETV, or GOOGLE TV), a smart speaker or speaker with an integrated digital assistant (such as SAMSUNG GALAXY HOME, APPLE HOMEPOD, or AMAZON ECHO), a gaming console (such as an XBOX, PLAYSTATION, or NINTENDO), an electronic dictionary, an electronic key, a camcorder, or an electronic picture frame. Still other examples of an electronic device include at least one of various medical devices (such as diverse portable medical measuring devices (like a blood sugar measuring device, a heartbeat measuring device, or a body temperature measuring device), a magnetic resource angiography (MRA) device, a magnetic resource imaging (MRI) device, a computed tomography (CT) device, an imaging device, or an ultrasonic device), a navigation device, a global positioning system (GPS) receiver, an event data recorder (EDR), a flight data recorder (FDR), an automotive infotainment device, a sailing electronic device (such as a sailing navigation device or a gyro compass), avionics, security devices, vehicular head units, industrial or home robots, automatic teller machines (ATMs), point of sales (POS) devices, or Internet of Things (IoT) devices (such as a bulb, various sensors, electric or gas meter, sprinkler, fire alarm, thermostat, street light, toaster, fitness equipment, hot water tank, heater, or boiler). Other examples of an electronic device include at least one part of a piece of furniture or building/structure, an electronic board, an electronic signature receiving device, a projector, or various measurement devices (such as devices for measuring water, electricity, gas, or electromagnetic waves). Note that, according to various embodiments of this disclosure, an electronic device may be one or a combination of the above-listed devices. According to some embodiments of this disclosure, the electronic device may be a flexible electronic device. The electronic device disclosed here is not limited to the above-listed devices and may include new electronic devices depending on the development of technology.
In the following description, electronic devices are described with reference to the accompanying drawings, according to various embodiments of this disclosure. As used here, the term “user” may denote a human or another device (such as an artificial intelligent electronic device) using the electronic device.
Definitions for other certain words and phrases may be provided throughout this patent document. Those of ordinary skill in the art should understand that in many if not most instances, such definitions apply to prior as well as future uses of such defined words and phrases.
None of the description in this application should be read as implying that any particular element, step, or function is an essential element that must be included in the claim scope. The scope of patented subject matter is defined only by the claims. Moreover, none of the claims is intended to invoke 35 U.S.C. § 112 (f) unless the exact words “means for” are followed by a participle. Use of any other term, including without limitation “mechanism,” “module,” “device,” “unit,” “component,” “element,” “member,” “apparatus,” “machine,” “system,” “processor,” or “controller,” within a claim is understood by the Applicant to refer to structures known to those skilled in the relevant art and is not intended to invoke 35 U.S.C. § 112 (f).
1 7 FIGS.through , discussed below, and the various embodiments of this disclosure are described with reference to the accompanying drawings. However, it should be appreciated that this disclosure is not limited to these embodiments, and all changes and/or equivalents or replacements thereto also belong to the scope of this disclosure. The same or similar reference denotations may be used to refer to the same or similar elements throughout the specification and the drawings.
As noted above, consistency regularization is an approach for learning how to generate consistent representations of input features with different views or from different models via minimizing a distribution difference. Consistency regularization is used in self-supervised learning and semi-supervised learning to stabilize training and alleviate data perturbation sensitivity. Studies have shown consistency learning is beneficial for speech-to-text tasks, although such studies are limited to certain frameworks and are only optimized with cross entropy loss.
Consistency regularization is a desirable property for speech-to-text tasks since speech data contains different kinds of variations, such as channel variability, speaker variability, background noise, etc. Various embodiments of this disclosure use a transducer model to perform speech-to-text conversions. Transducer models were first introduced to transform any input sequence into another finite, discrete output sequence and have been adopted in automatic speech recognition (ASR) due to good performance and good support for both streaming and offline decoding modes. As described in this disclosure, transducer models can achieve very competitive results on other challenge tasks, such as speech-to-text translation (ST), which is a non-monotonic alignment task, and text-to-speech synthesis (TTS), which can have a longer target sequence than the input sequence. However, applying consistency regularization for transducer-based modeling is not straightforward due to the complexity of output distributions. In transducer optimization, the model would consider all potential alignments among input and output sequences, but not all alignments are practical for speech-to-text operations given that the input and output sequences and the distributions from those alignments could be far from the desirable distributions expected. Consistency regularization on those distributions could even decrease model performance.
This disclosure provides for training a transducer model for speech-to-text operations and provides for use of the trained transducer model in performing speech-to-text operations on an electronic device. In various embodiments, the transducer model can include a speech encoder that converts audio frames into an audio feature vector, a predictor that takes previous tokens generated based on the audio feature vector and predicts what the next token in the sequence should be, and a joiner that combines the audio feature vector with the predicted token and produces a probability distribution representing the likelihood of different possible outcomes for a next token in the sequence.
The transducer model can be trained using a training dataset that includes samples having audio frames and a ground truth subword unit transcription. Using the training dataset, a first view of a training sample can be generated by adding first random noise to the audio frames, and a second view of the training sample can be generated by adding second random noise to the audio frames. The first view and the second view can be provided to the transducer model, and the transducer model can predict a probability distribution for each of the first view and the second view. The probability distribution can include probabilities of possible frame synchronization decoding between the audio frames and transcriptions, and the transducer model can be modified based on a transducer loss.
Note that while some of the embodiments discussed below are described in the context of use in consumer electronic devices (such as smartphones), this is merely one example. It will be understood that the principles of this disclosure may be implemented in any number of other suitable contexts and may use any suitable device or devices. Also note that while some of the embodiments discussed below may be described based on the assumption that one device (such as a server) performs training of a machine learning model that is deployed to one or more other devices (such as one or more consumer electronic devices), this is also merely one example. It will be understood that the principles of this disclosure may be implemented using any number of devices, including a single device that both trains and uses a machine learning model. In general, this disclosure is not limited to use with any specific type(s) of device(s).
1 FIG. 1 FIG. 100 100 100 illustrates an example network configurationincluding an electronic device in accordance with this disclosure. The embodiment of the network configurationshown inis for illustration only. Other embodiments of the network configurationcould be used without departing from the scope of this disclosure.
101 100 101 110 120 130 150 160 170 180 101 110 120 180 According to embodiments of this disclosure, an electronic deviceis included in the network configuration. The electronic devicecan include at least one of a bus, a processor, a memory, an input/output (I/O) interface, a display, a communication interface, or a sensor. In some embodiments, the electronic devicemay exclude at least one of these components or may add at least one other component. The busincludes a circuit for connecting the components-with one another and for transferring communications (such as control messages and/or data) between the components.
120 120 120 101 120 The processorincludes one or more processing devices, such as one or more microprocessors, microcontrollers, digital signal processors (DSPs), application specific integrated circuits (ASICs), or field programmable gate arrays (FPGAs). In some embodiments, the processorincludes one or more of a central processing unit (CPU), an application processor (AP), a communication processor (CP), or a graphics processor unit (GPU). The processoris able to perform control on at least one of the other components of the electronic deviceand/or perform an operation or data processing relating to communication or other functions. As described in more detail below, the processormay perform various operations related to transducer consistency regularization for speech-to-text applications.
130 130 101 130 140 140 141 143 145 147 141 143 145 The memorycan include a volatile and/or non-volatile memory. For example, the memorycan store commands or data related to at least one other component of the electronic device. According to embodiments of this disclosure, the memorycan store software and/or a program. The programincludes, for example, a kernel, middleware, an application programming interface (API), and/or an application program (or “application”). At least a portion of the kernel, middleware, or APImay be denoted an operating system (OS).
141 110 120 130 143 145 147 141 143 145 147 101 147 143 145 147 141 147 143 147 101 110 120 130 147 145 147 141 143 145 The kernelcan control or manage system resources (such as the bus, processor, or memory) used to perform operations or functions implemented in other programs (such as the middleware, API, or application). The kernelprovides an interface that allows the middleware, the API, or the applicationto access the individual components of the electronic deviceto control or manage the system resources. The applicationmay support various functions related to transducer consistency regularization for speech-to-text applications. These functions can be performed by a single application or by multiple applications that each carries out one or more of these functions. The middlewarecan function as a relay to allow the APIor the applicationto communicate data with the kernel, for instance. A plurality of applicationscan be provided. The middlewareis able to control work requests received from the applications, such as by allocating the priority of using the system resources of the electronic device(like the bus, the processor, or the memory) to at least one of the plurality of applications. The APIis an interface allowing the applicationto control functions provided from the kernelor the middleware. For example, the APIincludes at least one interface or function (such as a command) for filing control, window control, image processing, or text control.
150 101 150 101 The I/O interfaceserves as an interface that can, for example, transfer commands or data input from a user or other external devices to other component(s) of the electronic device. The I/O interfacecan also output commands or data received from other component(s) of the electronic deviceto the user or the other external device.
160 160 160 160 The displayincludes, for example, a liquid crystal display (LCD), a light emitting diode (LED) display, an organic light emitting diode (OLED) display, a quantum-dot light emitting diode (QLED) display, a microelectromechanical systems (MEMS) display, or an electronic paper display. The displaycan also be a depth-aware display, such as a multi-focal display. The displayis able to display, for example, various contents (such as text, images, videos, icons, or symbols) to the user. The displaycan include a touchscreen and may receive, for example, a touch, gesture, proximity, or hovering input using an electronic pen or a body portion of the user.
170 101 102 104 106 170 162 164 170 The communication interface, for example, is able to set up communication between the electronic deviceand an external electronic device (such as a first electronic device, a second electronic device, or a server). For example, the communication interfacecan be connected with a networkorthrough wireless or wired communication to communicate with the external electronic device. The communication interfacecan be a wired or wireless transceiver or any other component for transmitting and receiving signals.
162 164 The wireless communication is able to use at least one of, for example, WiFi, long term evolution (LTE), long term evolution-advanced (LTE-A), 5th generation wireless system (5G), millimeter-wave or 60 GHz wireless communication, Wireless USB, code division multiple access (CDMA), wideband code division multiple access (WCDMA), universal mobile telecommunication system (UMTS), wireless broadband (WiBro), or global system for mobile communication (GSM), as a communication protocol. The wired connection can include, for example, at least one of a universal serial bus (USB), high definition multimedia interface (HDMI), recommended standard 232 (RS-232), or plain old telephone service (POTS). The networkorincludes at least one communication network, such as a computer network (like a local area network (LAN) or wide area network (WAN)), Internet, or a telephone network.
101 180 101 180 180 180 180 180 101 The electronic devicefurther includes one or more sensorsthat can meter a physical quantity or detect an activation state of the electronic deviceand convert metered or detected information into an electrical signal. For example, one or more sensorscan include one or more cameras or other imaging sensors for capturing images of scenes. The sensor(s)can also include one or more buttons for touch input, one or more microphones, a gesture sensor, a gyroscope or gyro sensor, an air pressure sensor, a magnetic sensor or magnetometer, an acceleration sensor or accelerometer, a grip sensor, a proximity sensor, a color sensor (such as an RGB sensor), a bio-physical sensor, a temperature sensor, a humidity sensor, an illumination sensor, an ultraviolet (UV) sensor, an electromyography (EMG) sensor, an electroencephalogram (EEG) sensor, an electrocardiogram (ECG) sensor, an infrared (IR) sensor, an ultrasound sensor, an iris sensor, or a fingerprint sensor. The sensor(s)can further include an inertial measurement unit, which can include one or more accelerometers, gyroscopes, and other components. In addition, the sensor(s)can include a control circuit for controlling at least one of the sensors included here. Any of these sensor(s)can be located within the electronic device.
102 104 101 102 101 102 170 101 102 102 101 In some embodiments, the first external electronic deviceor the second external electronic devicecan be a wearable device or an electronic device-mountable wearable device (such as an HMD). When the electronic deviceis mounted in the electronic device(such as the HMD), the electronic devicecan communicate with the electronic devicethrough the communication interface. The electronic devicecan be directly connected with the electronic deviceto communicate with the electronic devicewithout involving with a separate network. The electronic devicecan also be an augmented reality wearable device, such as eyeglasses, that include one or more imaging sensors.
102 104 106 101 106 101 102 104 106 101 101 102 104 106 102 104 106 101 101 101 170 104 106 162 164 101 1 FIG. The first and second external electronic devicesandand the servereach can be a device of the same or a different type from the electronic device. According to certain embodiments of this disclosure, the serverincludes a group of one or more servers. Also, according to certain embodiments of this disclosure, all or some of the operations executed on the electronic devicecan be executed on another or multiple other electronic devices (such as the electronic devicesandor server). Further, according to certain embodiments of this disclosure, when the electronic deviceshould perform some function or service automatically or at a request, the electronic device, instead of executing the function or service on its own or additionally, can request another device (such as electronic devicesandor server) to perform at least some functions associated therewith. The other electronic device (such as electronic devicesandor server) is able to execute the requested functions or additional functions and transfer a result of the execution to the electronic device. The electronic devicecan provide a requested function or service by processing the received result as it is or additionally. To that end, a cloud computing, distributed computing, or client-server computing technique may be used, for example. Whileshows that the electronic deviceincludes the communication interfaceto communicate with the external electronic deviceor servervia the networkor, the electronic devicemay be independently operated without a separate communication function according to some embodiments of this disclosure.
106 110 180 101 106 101 101 106 120 101 106 The servercan include the same or similar components-as the electronic device(or a suitable subset thereof). The servercan support to drive the electronic deviceby performing at least one of operations (or functions) implemented on the electronic device. For example, the servercan include a processing module or processor that may support the processorimplemented in the electronic device. As described in more detail below, the servermay perform various operations related to transducer consistency regularization for speech-to-text applications.
1 FIG. 1 FIG. 1 FIG. 1 FIG. 100 101 100 Althoughillustrates one example of a network configurationincluding an electronic device, various changes may be made to. For example, the network configurationcould include any number of each component in any suitable arrangement. In general, computing and communication systems come in a wide variety of configurations, anddoes not limit the scope of this disclosure to any particular configuration. Also, whileillustrates one operational environment in which various features disclosed in this patent document can be used, these features could be used in any other suitable system.
2 FIG. 1 FIG. 200 200 101 100 200 106 illustrates an example speech-to-text systemin accordance with this disclosure. For case of explanation, the systemis described as involving the use of the electronic devicein the network configurationof. However, the systemmay be used with any other suitable electronic device(s), such as the server, and in any other suitable system(s).
2 FIG. 200 101 120 120 202 202 As shown in, the systemincludes the electronic device, which includes the processor. The processoris operatively coupled to or otherwise configured to use one or more machine learning models, such as a transducer model. The transducer modelcan be trained to take audio data as input and perform speech-to-text operations to create one or more text transcriptions corresponding to the audio data.
120 204 101 130 120 The processorcan also be operatively coupled to or otherwise configured to use one or more other machine learning models, such as one or more personal assistant, automated speech recognition (ASR), and/or natural language understanding (NLU) models. It will be understood that the machine learning models can be stored in a memory of the electronic device(such as the memory) and accessed by the processorto perform speech-to-text or other tasks. However, the machine learning models can be stored in any other suitable manner.
200 206 208 210 160 120 206 202 202 204 The systemalso includes an audio input device(such as a microphone), an audio output device(such as a speaker or headphones), and a display(such as a screen or a monitor like the display). The processorcan receive an audio input from the audio input deviceand provides the audio input to the transducer model. The transducer modelcan convert audio frames from the audio input into audio feature vectors, generate tokens based on the audio feature vector, predict one or more additional tokens, and combine the audio feature vector and the predicted token(s) to create a probability distribution representing a likelihood of different possible outcomes for a next token for the audio data. This process serves to convert the input audio into a text representation that can be used, such as by one of the other models, to perform one or more tasks.
206 202 204 101 120 208 120 101 101 202 204 120 210 101 As a particular example, assume an utterance is received from a user via the audio input device, such as “hey BIXBY, call mom.” Here, the transducer modelconverts the audio input into a textual representation, and the textual representation can be used, such as by the other models, to determine a task to be performed by the electronic device. In this case, for instance, this can cause the processorto instruct the audio output deviceto output “calling Mom,” and the processorcan also cause a phone application or other communication application to begin a communication session with a “mom” contact stored on the electronic deviceor otherwise in association with the user of the electronic device. As another particular example, suppose an utterance of “hey BIXBY, start a timer” is received. In such an example, the transducer modelcan convert the audio data into a textual representation. Using the textual representation and possibly one or more of the other models, the processormay instruct execution of a timer application and display of a timer on the displayof the electronic device.
2 FIG. 2 FIG. 200 206 208 210 120 101 206 208 210 101 202 204 120 202 204 101 106 101 106 101 101 106 Althoughillustrates one example of a speech-to-text system, various changes may be made to. For example, the audio input device, the audio output device, and the displaycan be connected to the processorwithin the electronic device, such as via wired connections or circuitry. In other embodiments, the audio input device, the audio output device, and the displaycan be external to the electronic deviceand connected via wired or wireless connections. Also, in some cases, the transducer model, as well as one or more of the other machine learning models, can be stored as separate models called upon by the processorto perform certain tasks or can be included in and form a part of one or more larger machine learning models. Further, in some embodiments, one or more of the machine learning models, including the transducer modeland one or more of the other machine learning models, can be stored remotely from the electronic device, such as on the server. Here, the electronic devicecan transmit requests including inputs (such as captured audio data) to the serverfor processing of the inputs using the machine learning models, and the results can be sent back to the electronic device. In addition, in some embodiments, the electronic devicecan be replaced by the server, which can receive audio inputs from a client device and transmit instructions back to the client device to execute functions associated with instructions included in utterances.
3 FIG. 3 FIG. 1 FIG. 3 FIG. 2 FIG. 300 300 101 100 300 300 106 300 202 illustrates an example transducer model architecturein accordance with this disclosure. For ease of explanation, the architectureshown inis described as being implemented on or supported by the electronic devicein the network configurationof. However, the architectureshown incould be used with any other suitable device(s) and in any other suitable system(s), such as when the architectureis implemented on or supported by the server. In some cases, the architecturecan represent or be a part of the transducer modeldescried with respect to.
3 FIG. 300 302 304 306 302 308 308 302 308 300 As shown in, the transducer model architectureinclude a speech encoder, a predictor, and a joiner. The speech encoderreceives audio dataand transforms the audio datainto a hidden audio representation. For example, the speech encodercan convert one or more audio frames of the audio datainto an audio feature vector, such as by taking the raw audio data and extracting important information from the raw audio data. The audio feature vector can be used by the architecturefor generating predictions.
304 304 306 306 302 304 306 300 308 The predictorreceives a previous token and predicts a hidden text representation. For example, the predictorcan take previous tokens generated based on the audio feature vector and predict what the next token should be. By using this approach, the transducer model can generate accurate predictions concerning upcoming audio content. The joineruses the hidden audio representation and the hidden text representation to predict a current token. For example, the joinercan combine the audio feature vector with the predicted token and produce a probability distribution. This probability distribution represents the likelihood of different possible outcomes for the next token. Overall, the speech encoder, the predictor, and the joinerwork together seamlessly to create an efficient and effective transducer model for processing audio data. The architecturecan be used to output text predictions corresponding to the audio data.
302 304 304 306 302 304 In various embodiments, the speech encodercan be a recurrent neural network (RNN), a convolutional neural network (CNN), a transformer network, etc. Also, in various embodiments, the predictorcan be a prediction network decoder such as a recurrent neural network like a long short-term memory (LSTM) decoder, a gated recurrent units (GRU) decoder, or a transformer decoder. Unlike traditional sequence-to-sequence models, the predictormay operate at the output level, meaning it predicts tokens one by one. In addition, in various embodiments, the joinercan be a joint network that outputs the final sequence of tokens by jointly considering both the acoustic information from the speech encoderand the predicted output context from the predictor.
3 FIG. 310 As also shown in, during training of the transducer model, a transducer losscan be obtained. To train the transducer model, a training dataset can be used that includes audio recordings and corresponding subword unit transcriptions. In some cases, two different versions of each training example can be generated by adding random noise to the audio frames. These versions or “views” are provided to the transducer model, which calculates the likelihood of various alignments between the audio frames and the transcripts. The transducer model predicts a probability distribution for each of the different views, where the probability distribution includes probabilities of possible frame synchronization decoding between the audio frames and transcriptions. The process of frame synchronization decoding involves generating a sequence of predicted tokens for T audio frames. Each token prediction can either be a transcript token or a blank token.
In some embodiments, the transducer loss used to train the transducer model can include a combination of two types of losses. The two types of losses can include a first transducer Kullback-Leibler (KL) divergence loss, where a first view of the different views is considered the “teacher” and a second view is considered the “student.” The two types of losses can also include a second transducer KL divergence loss with the roles reversed, where the first view of the different views is considered the “student” and a second view is considered the “teacher.” KL divergence loss is a measure of a statistical distance, meaning the measure of how much a model probability distribution Q is different from a true probability distribution P. This approach allows for the transducer model to effectively learn from multiple noisy versions of the same data. The first transducer KL divergence loss and the second transducer KL divergence loss can be weighted based on their contributions to the overall transducer loss. Training the transducer model in this way has been found to produce more accurate results than other approaches.
300 101 101 101 Using a transducer model, including the architecture, for speech-to-text and text-to-speech can have various advantages depending on the implementation. For example, if a user is using the electronic devicefor voice commands, the electronic devicecan benefit from being able to understand what the user is saying and to respond quickly. This involves two main tasks: recognizing what the user said (speech-to-text) and saying something back to the user (text-to-speech). The transducer model acts like a translator to help the electronic deviceconvert the user speech into words (text) and, if needed, turn those words back into speech. Existing approaches for speech-to-text could break this process into several steps, like figuring out the sounds the user is making, matching them to words, and processing those words to understand the user, which tends to be time extensive. The various embodiments of this disclosure, however, provide for using a transducer model that performs all of this in one smooth step. The transducer model listens to the user's voice and directly translates user speech into text as the user speaks, which is especially useful for tasks like live conversations or commands. The transducer model can also perform text-to-speech by taking a generated text response to user speech and turning it into speech that sounds natural and smooth by determining the right tone, emphasis, and pauses to make the speech sound like a real person, instead of a robotic voice.
The transducer model of this disclosure processes tasks quickly, including performing speech-to-text as a user speaks, making applications like voice assistants or live subtitles very responsive while also maintaining high accuracy by handling everything at once, reducing mistakes that can happen when speech is processed in separate steps. The transducer model operates by aligning an input sequence with an output sequence dynamically, which can be particularly useful for ASR and speech-to-text tasks in which the length of the input (audio) and output (text) can vary significantly. Unlike traditional models that rely on external alignment techniques like Hidden Markov Models (HMMs) or Connectionist Temporal Classification (CTC), the transducer model performs alignment internally to provide for dynamic alignment. It can predict both “blank” tokens (which skip over input frames) and actual output symbols (such as text characters or words). This allows the transducer model to handle situations where the input and output lengths do not correspond directly. One possible significant advantage of the transducer model is its ability to process streaming inputs. Since the transducer model does not need the entire input sequence beforehand, the transducer model can generate outputs (such as text) incrementally as more audio becomes available, making it suitable for real-time applications.
302 304 306 302 304 In various embodiments, during ASR tasks, the transducer model can process an audio signal and convert the audio signal directly into text. In ASR tasks, the speech encoderprocesses the raw acoustic features (such as Mel-spectrogram or Mel-Frequency Cepstral Coefficients (MFCC) features) and generates a high-level sequence of hidden states representing the speech signal. The predictorgenerates hypotheses for the next token in the output text based on previous tokens (or blank predictions). The joinertakes the representations from the speech encoderand the predictor, combining them to output a probability distribution over possible output symbols at each time step. In some cases, the transducer model can be trained using forward-backward algorithms, optimizing over all possible alignments between the input and output.
304 In various embodiments, during speech-to-text translation tasks, the transducer model can be extended to directly map audio in one language to text in another language. Here, the transducer model may work similarly to ASR transducers but can be trained with bilingual data, where the input is speech in one language and the output is text in another language. However, the predictorhere can generate tokens in the target language instead of the source language, which adds complexity due to the non-monotonic alignment between speech and translated text.
302 304 In text-to-speech synthesis (TTS) tasks, the transducer model can be used to convert a textual input into a sequence of Mel-spectrogram frames, which can be converted into speech waveforms using a vocoder (such as WaveNet or WaveGlow). Here, the speech encodercan convert the input text into hidden representations, and the predictorcan predict the Mel-spectrogram sequence. In some embodiments, attention mechanisms can be used to align the input text with the output speech.
In some embodiments, the transducer model can be trained end-to-end, meaning the transducer model can be optimized directly for the task of interest (such as speech-to-text or text-to-speech) without the need for intermediate steps like acoustic modeling or language modeling in ASR. Also, the transducer model may allow for streaming and low latency, making the transducer model particularly suited for real-time applications, as the transducer model can process input incrementally and produce output as soon as sufficient data is available. The transducer model can also provide for internal alignment. By allowing “blank” predictions, the transducer can handle variable-length alignment between inputs (such as speech) and outputs (such as text) internally, without relying on external forced alignment tools like hidden Markov models (HMMs) or alignment models in CTC.
3 FIG. 3 FIG. 3 FIG. 300 Althoughillustrates one example of a transducer model architecture, various changes may be made to. For example, various components and functions inmay be combined, further subdivided, replicated, rearranged, or omitted according to particular needs. Also, one or more additional components and functions may be included if needed or desired.
4 FIG. 1 FIG. 2 FIG. 400 400 101 100 202 400 106 101 106 illustrates an example transducer decoding latticein accordance with this disclosure. For case of explanation, the transducer decoding latticeis described as involving the use of the electronic devicein the network configurationofand the transducer modelof. However, the transducer decoding latticemay be used with any other suitable electronic device (such as the server) or a combination of devices (such as the electronic deviceand the server) and in any other suitable system(s).
4 FIG. 400 202 400 1:T 1 2 T 1:U 1 2 U 1:U 0:U 0 t i i 1<i<U+T 1:T i i As shown in, the transducer decoding latticecan represent frame synchronization-based decoding performed by a transducer model, such as the transducer model. The latticeincludes an input sequence along an x axis and an output target sequence along a y axis. Specifically, let x=(x, x, . . . , x) denote a length T input sequence, and let y=(y, y, . . . , y) denote an output target sequence. The transducer model monotonically maps xto y, where y=(bos) is the beginning of sentence (bos) token. During frame synchronization-based decoding, every xcan generate one or multiple tokens. Such mappings α=[(t, u)]between xand you can be referred to as alignments, where t∈ [1, T] and u∈ [0, U].
400 4 FIG. 4 FIG. t 3 r 1:u Each vertical arrow in the latticeofdenotes a new token generation. If no token is generated at x, a blank token Ø is assigned. For example, as shown in, there are two consecutive horizontal arrows from step 2 to step 4, so a blank token Ø is assigned to xin this example. The transducer reduces or minimizes condition probability P(y|x) by marginalizing all possible alignment paths. In various embodiments, this can be performed using a forward-backward algorithm. For instance, the forward variable α(t, u) is the probability of generating yat step t from the beginning. This can be represented as follows.
r U+1 r u U+1:U 4 FIG. 4 FIG. Here, α(1,0)=1, y(t, u)=P(y|t, u) is the emitting probability from token u at time t to generate token u+1 (vertical movement in the lattice from). Ø(t, u)=P(y|t, u) is the corresponding emitting probability for blank token at u and t (horizontal movement from step t to t+1 in). The corresponding backward variable β(t, u) is the probability of sequence y: u from time t to the end of the utterance. This can be represented as follows.
0:u 1:T r Here, β(T,U)=Ø(T, U). The overall probability of sequence yat xis P(y|x)=α(T,U) Ø(T,U).
4 FIG. 4 FIG. 400 Althoughillustrates one example of a transducer decoding lattice, various changes may be made to. For example, although a certain number of inputs in the input sequence and outputs in the output sequence are shown, it will be understood that any number of inputs and outputs could be used.
5 FIG. 1 FIG. 500 500 101 100 500 106 101 106 illustrates an example transducer model training processin accordance with this disclosure. For ease of explanation, the processis described as involving the use of the electronic devicein the network configurationof. However, the processmay be used with any other suitable electronic device (such as the server) or a combination of devices (such as the electronic deviceand the server) and in any other suitable system(s).
5 FIG. 3 FIG. 5 FIG. 502 202 502 502 504 506 504 506 508 510 x As shown in, a training datasetis provided for use in training a transducer model, such as the transducer model. The training dataset, as also described with respect to, can include audio recordings and corresponding subword unit transcriptions. As shown in, multiple (such as two) copies of the training datasetare created. Augmentation is performed on the copies, such as by adding random noise to the audio frames, to generate a first augmented training dataset(x′) and a second augmented training dataset()). These versions or views are provided to the transducer model. The transducer model performs dropout operations on the first and second augmented datasets,and calculates the likelihood of various alignments between the audio frames and the transcriptions to provide a first probability distributionand a second probability distribution.
512 508 504 514 510 506 512 514 502 512 514 504 506 r r i j Divergence loss, such as KL divergence, is used to identify a first divergence lossassociated with the first probability distributionand the first augmented training datasetand a second divergence lossassociated with the second probability distributionand the second augmented training dataset. The first divergence loss can be represented as P(y|x), and the second divergence loss can be represented as P(y|x). The first and second divergence losses,are used to ensure that predictions made by the transducer model are similar for both copies of the training dataset. The first and second divergence losses,are used to determine a transducer consistency regularization (TCR) loss. The TCR loss is a weighted sum of the divergence loss between the first augmented training datasetand the second augmented training dataset. The weights are calculated based on the transducer's occupational probability. The occupational probability is obtained during gradient calculation and can be used to perform a weighted sum on the divergence loss.
In some embodiments, the TCR loss can be represented as follows.
KL U+1 KL i j i j ω Here, βø and βø are weights for non-blank token and blank token consistency regularization, respectively. Also, D(y, t, u, x, x) (or D(yurt, u, x, x) is the KL divergence between distributions y(t, u) (or Ø(t, u)) from data views i and j. Further,ø and ωø are the occupational probability at time t and symbol u. That is, the TCR loss is based on the forward and backward probability to reach t, u, which can be represented as follows.
4 FIG. By using a weighted sum, how much each part of the prediction contributes to the overall result can be adjusted. For instance, if an area of the probability distribution has a high likelihood of being occupied, its contribution to the final result can be set to be greater. On the other hand, if an area of the probability distribution has a low likelihood of being occupied, its contribution can be set to be smaller, such as described with respect to. A total loss for the transducer model can be composed of the transducer loss plus the TCR loss. The total loss can be represented as follows.
400 4 FIG. 4 FIG. ii t t t The parameters, such as weights, of the transducer model can be adjusted based on the transducer loss during training to provide a modified and trained transducer model for use in applications such as speech-to-text applications. In some cases, a transducer output may be four-dimensional and include dimensions (B, T, U, D), where B is the batch size, T is the input sequence length, U is the output target sequence length, and D is the total number of tokens. As described with respect to the latticeof, most of the paths from α(0,0) to α(T, U) in the decoding lattice for the transducer may be implausible. For example, as shown in, a diagonal path through the lattice may be the most plausible, while paths outside the diagonal portion are less plausible. A pruned region including the most plausible paths can thus be determined using the transducer model. A resulting occupational probability can be used to select pruning bounds such that most of the joint output occupational probability mass is retained. The final pruned transducer output thus becomes (B, T, S, D), where S is a constant number and SU. In the U dimension, the pruned region is defined by p<u<p+S, where pis the lower bound at time t. Thus, the calculation of the lower bound such that the pruned region maximizes the occupational probability can be represented as follows.
tot tot tot Here, y′(t, u) is the derivatives of Lwith respect to y(t, u), and Ø(t, u) is the derivative of Lwith respect to Ø(t, u), where total loss can also be defined by L=α(T−1, U)+Ø(T−1, U).
4 FIG. In some embodiments, as an alternative to using a weighted sum for the divergence losses, a hard cutoff of which region to contribute to the final TCR loss can be used. For example, regions with emission probability less than a certain threshold may not contribute to the final TCR loss, such as regions outside the diagonal areas of the probability distribution like is described with respect to.
5 FIG. 5 FIG. 5 FIG. 500 Althoughillustrates one example of a transducer model training process, various changes may be made to. For example, while shown as a series of steps, various steps incould overlap, occur in parallel, occur in a different order, or occur any number of times (including zero times).
6 FIG. 6 FIG. 1 FIG. 600 600 101 100 600 106 illustrates an example methodfor training a transducer model in accordance with this disclosure. For ease of explanation, the methodshown inis described as being performed using the electronic devicein the network configurationof. However, the methodcould be performed using any other suitable device(s), such as the server, and in any other suitable system(s).
602 502 604 606 504 506 At step, a training dataset is obtained that includes samples having audio frames and a ground truth subword unit transcription. The training dataset can be, for example, the training dataset. At step, a first view of a training sample is generated by adding first random noise to the audio frames. At step, a second view of the training sample is generated by adding second random noise to the audio frames. The first and second views can be the first and second augmented training datasets,. Note that additional views may or may not be generated here.
608 202 At step, the first view and the second view are provided to a transducer model, such as the transducer model. As described in this disclosure, the transducer model can include an encoder, a predictor, and a joiner. The encoder encodes the audio input into an audio feature vector, the predictor receives previous tokens to predict a subsequent token, and the joiner combines the audio feature vector and the subsequent token to output the probability distribution.
610 At step, the transducer model predicts a probability distribution for each of the first view and the second view. The probability distribution includes probabilities of possible frame synchronization decoding between the audio frames and transcriptions in the first and second views. In various embodiments, each possible frame synchronization decoding corresponds to a sequence of token predictions for the audio frames, and a token prediction for a given audio frame is either a transcript token or a blank token.
612 5 FIG. At step, the transducer model is modified based on a transducer loss, such as described in this disclosure with respect to. For example, in various embodiments, the transducer model can be modified based on the transducer loss via an application of a first transducer divergence loss, where the first view is treated as a teacher and the second view is treated as a student. The transducer model can also be modified based on the transducer loss via an application of a second transducer divergence loss, where the second view is used as the teacher and the first view is used as the student. In various embodiments, a weighted sum of the first transducer divergence loss and the second transducer divergence loss can be used during training, and weights of the weighted sum can be dependent upon how much each probability distribution contributes to the transducer loss.
6 FIG. 6 FIG. 6 FIG. 600 Althoughillustrates one example of a methodfor training a transducer model, various changes may be made to. For example, while shown as a series of steps, various steps incould overlap, occur in parallel, occur in a different order, or occur any number of times (including zero times).
7 FIG. 7 FIG. 1 FIG. 700 700 101 100 700 106 illustrates an example methodfor performing a speech-to-text task using a transducer model in accordance with this disclosure. For case of explanation, the methodshown inis described as being performed using the electronic devicein the network configurationof. However, the methodcould be performed using any other suitable device(s), such as the server, and in any other suitable system(s).
702 206 704 202 706 At step, an audio input device, an audio input is received such as via an audio input device like the audio input device. At step, the audio input is provided to a transducer model, such as the transducer model. At step, the transducer model is used to predict text associated with the audio input. For example, as described in this disclosure, the transducer model can include an encoder, a predictor, and a joiner. The encoder encodes the audio input into an audio feature vector, the predictor receives previous tokens to predict a subsequent token, and the joiner combines the audio feature vector and the subsequent token to output the probability distribution.
708 At step, the predicted text is output by the transducer model and can be used by the electronic device in various further tasks. For example, the audio input may include commands to the electronic device to perform a task such as performing a search on the Internet, initiating a communication such as a voice or video call to a contact, setting a timer, storing the text for later use, such as in a personal note, conducting an ongoing conversational machine learning process, etc. It will be understood that any task using the text may be performed. It will also be understood that the transducer model can assist with performing text-to-speech as well, as described in this disclosure.
7 FIG. 7 FIG. 7 FIG. 700 Althoughillustrates one example of a methodfor performing a speech-to-text task using a transducer model, various changes may be made to. For example, while shown as a series of steps, various steps incould overlap, occur in parallel, occur in a different order, or occur any number of times (including zero times).
2 7 FIGS.through 2 7 FIGS.through 2 7 FIGS.through 2 7 FIGS.through 2 7 FIGS.through 101 102 104 106 120 101 102 104 106 106 202 106 101 It should be noted that the functions shown inor described above can be implemented in an electronic device,,, server, or other device(s) in any suitable manner. For example, in some embodiments, at least some of the functions shown inor described above can be implemented or supported using one or more software applications or other software instructions that are executed by the processorof the electronic device,,, server, or other device(s). In other embodiments, at least some of the functions shown inor described above can be implemented or supported using dedicated hardware components. In general, the functions shown inor described above can be performed using any suitable hardware or any suitable combination of hardware and software/firmware instructions. Also, the functions shown inor described above can be performed by a single device or by multiple devices. For instance, the servermight be used to train one or more machine learning models such as the transducer model, and the servercould deploy the trained machine learning model(s) to one or more other devices (such as the electronic device) for use.
Although this disclosure has been described with reference to various example embodiments, various changes and modifications may be suggested to one skilled in the art. It is intended that this disclosure encompass such changes and modifications as fall within the scope of the appended claims.
Cooperative Patent Classification codes for this invention. Click any code to explore related patents in that topic.
May 2, 2025
January 8, 2026
Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.