Patentable/Patents/US-11978433
US-11978433

Multi-encoder end-to-end automatic speech recognition (ASR) for joint modeling of multiple input devices

PublishedMay 7, 2024
Assigneenot available in USPTO data we have
Inventorsnot available in USPTO data we have
Technical Abstract

An end-to-end automatic speech recognition (ASR) system includes: a first encoder configured for close-talk input captured by a close-talk input mechanism; a second encoder configured for far-talk input captured by a far-talk input mechanism; and an encoder selection layer configured to select at least one of the first and second encoders for use in producing ASR output. The selection is made based on at least one of short-time Fourier transform (STFT), Mel-frequency Cepstral Coefficient (MFCC) and filter bank derived from at least one of the close-talk input and the far-talk input. If signals from both the close-talk input mechanism and the far-talk input mechanism are present for a speech segment, the encoder selection layer dynamically selects between the close-talk encoder and the far-talk encoder to select the encoder that better recognizes the speech segment. An encoder-decoder model is used to produce the ASR output.

Patent Claims
14 claims

Legal claims defining the scope of protection. Each claim is shown in both the original legal language and a plain English translation.

Claim 2

Original Legal Text

2. The system according to claim 1, wherein the encoder selection layer dynamically switches between the first encoder and the second encoder to select an encoder that better recognizes the speech segment.

Plain English Translation

This invention relates to a speech recognition system that dynamically selects between multiple encoders to improve recognition accuracy. The system addresses the challenge of varying speech conditions, such as background noise, speaker accents, or speech clarity, which can degrade performance in traditional single-encoder systems. The system includes a primary encoder optimized for general speech recognition and a secondary encoder specialized for specific speech conditions. An encoder selection layer evaluates the input speech segment and dynamically switches between the two encoders based on which one is better suited to recognize the segment. The selection is made by analyzing features of the speech segment, such as noise levels, speaker characteristics, or linguistic patterns, to determine the most appropriate encoder. This dynamic switching improves recognition accuracy by leveraging the strengths of each encoder under different conditions. The system may also include a decoder that processes the encoded speech to generate a recognized text output. The dynamic selection process ensures that the system adapts to real-time speech variations, enhancing overall performance in diverse environments.

Claim 3

Original Legal Text

3. The system according to claim 1, wherein only one of the quality of the close-talk input or the quality of the far-talk input is used to select the first encoder, the second encoder, or both for producing ASR output.

Plain English Translation

This invention relates to audio processing systems for automatic speech recognition (ASR), particularly in environments where both close-talk and far-talk microphones are used. The problem addressed is the challenge of selecting the optimal audio source for ASR when multiple microphones capture speech with varying quality. The system dynamically selects between a first encoder, a second encoder, or both, based on the quality of either the close-talk input or the far-talk input, but not both simultaneously. The first encoder processes audio from the close-talk microphone, which typically captures speech from a nearby speaker with higher clarity but may be affected by background noise. The second encoder processes audio from the far-talk microphone, which captures speech from a distant speaker with broader coverage but may suffer from lower signal quality. The system evaluates the quality of one input source at a time and uses that assessment to determine whether to encode the audio using the first encoder, the second encoder, or a combination of both. This approach ensures that the ASR system receives the highest-quality audio input for accurate transcription, adapting to real-time changes in speech conditions.

Claim 4

Original Legal Text

4. The system according to claim 1, wherein the far-talk input mechanism is a multi-channel input mechanism for capturing a multi-channel far-talk input signal.

Plain English Translation

A system for capturing and processing far-talk audio signals, particularly in environments where the speaker is positioned at a distance from the microphone. The system addresses the challenge of accurately capturing speech from distant sources, which is often degraded by ambient noise, reverberation, and signal distortion. The system includes a multi-channel input mechanism designed to capture a multi-channel far-talk input signal, allowing for enhanced spatial audio processing. By utilizing multiple microphones or sensors, the system can improve signal-to-noise ratio, directionality, and speech intelligibility. The multi-channel design enables advanced beamforming, noise suppression, and source separation techniques to isolate the desired speech signal from background interference. This approach is particularly useful in applications such as conference rooms, smart home devices, and automotive voice interfaces, where reliable far-field speech recognition is critical. The system may also incorporate adaptive filtering, machine learning-based enhancement, and real-time processing to further optimize audio quality. The multi-channel input mechanism ensures robust performance even in challenging acoustic environments, making it suitable for a wide range of voice interaction applications.

Claim 6

Original Legal Text

6. The system according to claim 1, wherein the first type of input device is different from the second type of input device.

Plain English Translation

A system for processing input from multiple input devices of different types is disclosed. The system addresses the challenge of integrating diverse input sources, such as keyboards, mice, touchscreens, or voice commands, into a unified interface. Each input device type has distinct characteristics, such as input modality, response time, or data format, which can complicate seamless interaction. The system includes a processing module that receives and normalizes input from at least two input devices, where the devices are of different types. For example, one device may be a touchscreen while another is a voice recognition system. The processing module standardizes the input data into a common format, enabling consistent handling and interpretation. This allows applications to interact with multiple input sources without requiring separate adaptation for each device type. The system may also include a prioritization mechanism to resolve conflicts or determine which input takes precedence when multiple inputs are received simultaneously. The solution improves usability by enabling flexible, multi-modal interaction in applications such as gaming, accessibility tools, or industrial control systems.

Claim 8

Original Legal Text

8. The system according to claim 1, wherein the first type of input device comprises a headphone or an MP3 recorder.

Plain English Translation

A system for audio input and processing includes a first type of input device, such as a headphone or an MP3 recorder, configured to capture or receive audio signals. The system also includes a second type of input device, such as a microphone, for capturing additional audio signals. The system processes these audio signals to generate an output, which may involve filtering, amplifying, or analyzing the audio data. The headphone or MP3 recorder may be used to provide audio input from a user or an external source, while the microphone captures ambient or direct audio. The system may further include a processing unit that synchronizes or combines the audio signals from both input devices to enhance audio quality, reduce noise, or perform other audio-related tasks. The system may be used in applications such as voice recognition, audio recording, or communication devices where multiple audio sources are utilized to improve performance.

Claim 9

Original Legal Text

9. The system according to claim 8, wherein the second type of input device comprises a microphone array.

Plain English Translation

A system for processing audio signals from a microphone array is disclosed. The system includes a microphone array configured to capture audio signals from an environment and a processing unit that analyzes the captured audio signals to determine spatial characteristics, such as direction or location of sound sources. The processing unit may apply beamforming techniques to enhance audio from specific directions while suppressing noise or interference from other directions. The system may also include additional input devices, such as cameras or motion sensors, to correlate audio data with visual or motion information for improved source localization or tracking. The microphone array may be arranged in a specific geometric configuration to optimize directional sensitivity and spatial resolution. The system may further include a user interface for displaying or adjusting the spatial characteristics of the captured audio, such as highlighting sound sources or filtering out unwanted noise. The system is particularly useful in applications like voice recognition, surveillance, or environmental monitoring where accurate audio source localization is critical.

Claim 10

Original Legal Text

10. The system according to claim 9, wherein the encoder selection layer is configured to select the first encoder, the second encoder, or both based on the quality of the close-talk input captured by the headphone or the MP3 recorder, the quality of the far-talk input captured by the microphone array, or both.

Plain English Translation

This invention relates to an audio processing system designed to enhance communication quality in environments with multiple audio sources. The system addresses the challenge of optimizing audio encoding based on the quality of input signals from different sources, such as close-talk devices (e.g., headphones or MP3 recorders) and far-talk devices (e.g., microphone arrays). The system includes an encoder selection layer that dynamically selects between a first encoder, a second encoder, or both, depending on the quality of the audio inputs. The selection is based on evaluating the signal quality of close-talk inputs, far-talk inputs, or both. The system ensures that the most suitable encoding method is applied to improve audio clarity and reduce noise, particularly in scenarios where multiple audio sources are present. The encoder selection layer may prioritize one encoder over another or combine their outputs based on real-time assessments of input quality, enhancing overall communication performance. This approach improves adaptability in varying acoustic conditions, ensuring optimal audio processing for different input sources.

Claim 12

Original Legal Text

12. The computer-implemented method according to claim 11, wherein the encoder selection layer dynamically switches between the first encoder and the second encoder to select an encoder that better recognizes the speech segment.

Plain English Translation

This invention relates to speech recognition systems that dynamically select between multiple encoders to improve accuracy. The problem addressed is the variability in speech signals, where different encoders may perform better under different conditions, such as varying noise levels, speaker accents, or speech clarity. The solution involves a system with at least two encoders, each optimized for different speech characteristics. An encoder selection layer evaluates the input speech segment and dynamically switches between the encoders to choose the one that better recognizes the segment. The selection is based on real-time analysis of the speech features, ensuring optimal performance for each segment. This approach enhances speech recognition accuracy by leveraging the strengths of multiple encoders rather than relying on a single fixed encoder. The system may also include preprocessing steps to condition the input speech and post-processing to refine the output. The dynamic switching mechanism allows the system to adapt to changing conditions without manual intervention, improving robustness in real-world applications.

Claim 13

Original Legal Text

13. The computer-implemented method according to claim 11, wherein only one of the quality of the close-talk input or the quality of the far-talk input is used to select the first encoder, the second encoder, or both for producing ASR output.

Plain English Translation

This invention relates to audio processing for automatic speech recognition (ASR) systems, specifically addressing the challenge of selecting optimal encoding strategies for speech inputs from different microphone sources. In scenarios where both close-talk (e.g., headset) and far-talk (e.g., room microphone) inputs are available, the system dynamically determines which input quality to prioritize for ASR processing. The method evaluates the quality of either the close-talk or far-talk input—never both simultaneously—and uses this assessment to select one or both of two distinct encoders. The selected encoder(s) then generate ASR output based on the chosen input. This approach ensures efficient resource utilization by avoiding redundant processing of lower-quality inputs while maintaining accuracy. The system may also adjust encoder parameters or switch between encoders based on real-time quality assessments, improving ASR performance in varying acoustic environments. The invention is particularly useful in applications like video conferencing, where microphone quality can fluctuate due to user movement or background noise.

Claim 14

Original Legal Text

14. The computer-implemented method according to claim 11, wherein the far-talk input mechanism is a multi-channel input mechanism for capturing a multi-channel far-talk input signal.

Plain English Translation

This invention relates to audio processing systems designed to enhance far-field speech recognition, particularly in environments where speech is captured from a distance. The problem addressed is the degradation of speech quality in far-talk scenarios due to ambient noise, reverberation, and signal distortion, which reduces the accuracy of speech recognition systems. The method involves a multi-channel input mechanism that captures a multi-channel far-talk input signal, meaning it uses multiple microphones to collect speech from a distant speaker. This setup allows for spatial filtering and beamforming techniques to isolate the desired speech signal from background noise and interference. The system processes the multi-channel input to improve signal clarity before further analysis, such as speech recognition or transcription. The multi-channel approach enables advanced signal processing, including direction-of-arrival estimation, adaptive beamforming, and noise suppression, which collectively enhance the intelligibility of the captured speech. By leveraging multiple microphones, the system can dynamically adjust to changing acoustic conditions, improving performance in real-world environments like conference rooms, smart home devices, or automotive systems. The invention aims to provide a robust solution for far-field speech capture, ensuring higher accuracy in speech recognition tasks.

Claim 16

Original Legal Text

16. The computer-implemented method according to claim 11, wherein the encoder selection layer is configured to select the first encoder and then switch to the second encoder based on the close-talk input being followed by the far-talk input.

Plain English Translation

This invention relates to a computer-implemented method for dynamically selecting encoders in a speech processing system to optimize audio quality based on the source of the input signal. The system addresses the challenge of maintaining clear audio communication in environments where the speaker's position relative to the microphone changes, such as in video conferencing or voice assistant applications. When the speaker moves from a close-talk position (e.g., speaking directly into a microphone) to a far-talk position (e.g., speaking from a distance), the system automatically switches between two different encoders to adapt to the varying signal characteristics. The first encoder is optimized for close-talk input, which typically has higher signal quality and lower background noise, while the second encoder is designed for far-talk input, which may require noise suppression or beamforming to enhance intelligibility. The encoder selection layer monitors the input signal to detect the transition from close-talk to far-talk conditions and triggers the switch to the appropriate encoder. This dynamic adjustment ensures that the audio remains clear and intelligible regardless of the speaker's position, improving the overall user experience in communication systems.

Claim 18

Original Legal Text

18. The computer-implemented method according to claim 11, wherein the first type of input device comprises a headphone or an MP3 recorder.

Plain English Translation

A computer-implemented method improves user interaction with electronic devices by dynamically adjusting input device functionality based on context. The method addresses the challenge of optimizing user experience when multiple input devices are available, such as headphones or MP3 recorders, by automatically selecting the most appropriate device for a given task. The system monitors user activity and environmental conditions to determine the optimal input device, ensuring seamless transitions between devices without manual intervention. For example, if a user is listening to audio through headphones, the system may prioritize audio input from the headphones over other devices. Similarly, if an MP3 recorder is being used, the system may adjust settings to enhance recording quality or reduce background noise. The method also includes error handling to manage device failures or conflicts, ensuring continuous operation. By dynamically adapting to the user's needs, the system enhances efficiency and usability across various applications, such as media playback, voice commands, or data recording.

Claim 19

Original Legal Text

19. The computer-implemented method according to claim 18, wherein the second type of input device comprises a microphone array.

Plain English Translation

A computer-implemented method for enhancing audio input processing in a computing system involves using a microphone array as a second type of input device to capture audio signals. The method includes receiving audio data from the microphone array, processing the audio data to extract relevant information, and integrating this information with inputs from other types of input devices, such as keyboards, touchscreens, or motion sensors. The microphone array is configured to detect and analyze audio signals, such as voice commands, environmental sounds, or other acoustic inputs, to improve the system's responsiveness and accuracy. The processed audio data may be used to trigger specific actions, adjust system settings, or provide contextual information for other input processing tasks. By combining inputs from multiple devices, including the microphone array, the system achieves more robust and adaptive user interaction capabilities. The method may also involve noise reduction, beamforming, or speech recognition techniques to enhance the quality and reliability of the audio data. This approach is particularly useful in environments where audio input plays a critical role in user-system interaction, such as virtual assistants, smart home systems, or collaborative computing applications.

Claim 20

Original Legal Text

20. The computer-implemented method according to claim 19, wherein the encoder selection layer is configured to select the first encoder, the second encoder, or both based on the quality of the close-talk input captured by the headphone or the MP3 recorder, the quality of the far-talk input captured by the microphone array, or both.

Plain English Translation

This invention relates to audio processing systems that dynamically select encoders for different audio sources based on input quality. The system captures close-talk audio from a headphone or MP3 recorder and far-talk audio from a microphone array. An encoder selection layer evaluates the quality of these inputs—such as signal-to-noise ratio, clarity, or distortion—and dynamically chooses between a first encoder, a second encoder, or both to process the audio. The selection ensures optimal encoding based on the input conditions, improving overall audio fidelity. The system may prioritize high-quality inputs for encoding while suppressing or filtering lower-quality signals. This approach enhances audio communication in environments where input quality varies, such as in hybrid meetings or noisy settings. The invention addresses the challenge of maintaining consistent audio quality when multiple input sources with varying conditions are present.

Classification Codes (CPC)

Cooperative Patent Classification codes for this invention. Click any code to explore related patents in that topic.

Patent Metadata

Filing Date

June 22, 2021

Publication Date

May 7, 2024

Want to explore more patents?

Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.

Citation & reuse

Analysis on this page is generated by Patentable — an AI-powered patent intelligence platform. AI-generated summaries, explanations, FAQs, and analysis may be reused with attribution and a visible link back to the canonical URL below. Patent abstracts and claims are USPTO public domain.

Cite as: Patentable. “Multi-encoder end-to-end automatic speech recognition (ASR) for joint modeling of multiple input devices” (US-11978433). https://patentable.app/patents/US-11978433

© 2026 Nomic Interactive Technology LLC. Machine-readable context available at /api/llm-context/US-11978433. See llms.txt for full attribution policy.