10818284

Methods of and Electronic Devices for Determining an Intent Associated with a Spoken User Utterance

PublishedOctober 27, 2020
Assigneenot available in USPTO data we have
Technical Abstract

Patent Claims
20 claims

Legal claims defining the scope of protection. Each claim is shown in both the original legal language and a plain English translation.

Claim 1

Original Legal Text

1. A method of determining an intent associated with a spoken user utterance, the spoken user utterance having been captured in a form of a digital audio signal, the method executable by a server, the method comprising: executing, by the server, a speech-to-text analysis of the digital audio signal to determine: at least one speech unit of the spoken user utterance, each speech unit having textual data representative of one of a word and a pause, each speech unit having a corresponding segment of the digital audio signal; for each speech unit: generating a respective textual feature vector by: determining, by the server based on the respective textual data, textual features of the respective speech unit; generating, by the server based on the respective textual features, the respective textual feature vector; generating a respective acoustic feature vector by: determining, by the server based on the corresponding segment of the digital audio signal, respective acoustic features of the corresponding segment of the digital audio signal; generating, by the server based on the respective acoustic features, the respective acoustic feature vector; generating, by the server, a respective enhanced feature vector by combining the respective acoustic feature vector and the respective textual feature vector; employing, by the server, a neural network (NN) configured to determine the intent of the spoken user utterance by inputting into the NN the enhanced feature vectors, the NN having been trained to estimate a probability of the intent being of a given type.

Plain English Translation

This invention relates to natural language processing and specifically to determining user intent from spoken language. The problem addressed is accurately identifying the user's intention from an audio recording of their speech. The method involves processing a digital audio signal representing a spoken utterance. First, a speech-to-text analysis is performed to break down the utterance into speech units. Each speech unit corresponds to a word or a pause and is linked to a specific segment of the original audio. For each speech unit, two types of feature vectors are generated. A textual feature vector is created by extracting textual features from the word or pause. Simultaneously, an acoustic feature vector is generated by extracting acoustic features from the corresponding audio segment. These two vectors are then combined to form an enhanced feature vector. Finally, a neural network, previously trained to estimate intent probabilities, receives these enhanced feature vectors as input. The neural network then determines the intent associated with the spoken user utterance by outputting the probability of the intent belonging to a specific type.

Claim 2

Original Legal Text

2. The method of claim 1 , wherein the NN is a recurrent neural network (RNN).

Plain English Translation

A system and method for processing sequential data using a neural network (NN) involves training the NN to analyze input sequences and generate corresponding outputs. The NN is specifically implemented as a recurrent neural network (RNN), which is designed to handle sequential data by maintaining an internal state that captures information from previous inputs. This allows the RNN to model dependencies and patterns across time steps or ordered sequences, making it suitable for tasks such as time-series prediction, natural language processing, and speech recognition. The RNN processes each element of the input sequence in a step-by-step manner, updating its hidden state at each step to incorporate new information while retaining relevant context from prior steps. The output is generated based on the final hidden state or intermediate states, depending on the application. This approach enables the system to effectively learn and predict complex temporal relationships in the data. The RNN architecture may include variations such as long short-term memory (LSTM) or gated recurrent units (GRU) to enhance its ability to capture long-term dependencies and mitigate issues like vanishing gradients. The method ensures robust performance in sequential data tasks by leveraging the RNN's inherent ability to process and retain contextual information over extended sequences.

Claim 3

Original Legal Text

3. The method of claim 1 , wherein the executing the speech-to-text analysis comprises determining: the textual data of each speech unit; and a time interval of the corresponding segment of the digital audio signal of each speech unit.

Plain English Translation

This invention relates to speech-to-text analysis systems that process digital audio signals to generate textual data. The problem addressed is the need to accurately transcribe spoken language into text while preserving the temporal structure of the audio, which is critical for applications like real-time captioning, voice search, and automated transcription services. The method involves analyzing a digital audio signal to identify discrete speech units, such as words or phrases, and converting each unit into corresponding textual data. Additionally, the method determines the time interval of the audio segment that corresponds to each speech unit. This ensures that the transcription not only captures the spoken content but also maps it to the exact timing within the original audio signal. The system may further process the transcribed text to correct errors, enhance readability, or format it for specific applications. The invention improves upon existing speech-to-text systems by providing precise temporal alignment between the transcribed text and the original audio, which is essential for applications requiring synchronization, such as video subtitling or interactive voice response systems. The method may also integrate with other speech processing techniques, such as speaker diarization or emotion detection, to provide a more comprehensive analysis of the audio content.

Claim 4

Original Legal Text

4. The method of claim 1 , wherein the generating the respective textual feature vector is executed by a word embedding process implemented by the server.

Plain English Translation

This invention relates to natural language processing (NLP) and text analysis, specifically improving the accuracy of text classification or semantic analysis by generating textual feature vectors using word embeddings. The problem addressed is the need for efficient and meaningful representation of text data to enhance machine learning models' performance in tasks like sentiment analysis, topic modeling, or document classification. The method involves processing input text data to generate feature vectors that capture semantic relationships between words. A server executes a word embedding process, which converts words or phrases into dense, low-dimensional vectors that preserve contextual meaning. This embedding process may use techniques like Word2Vec, GloVe, or transformer-based models to map words into a continuous vector space where semantically similar words are positioned closer together. The resulting feature vectors are then used as input for downstream machine learning tasks, improving accuracy by leveraging the rich semantic information encoded in the embeddings. The invention also includes preprocessing steps to clean and normalize the input text, such as removing stop words, stemming, or tokenization, to enhance the quality of the embeddings. The server may further optimize the embedding process by adjusting hyperparameters or selecting the most suitable embedding model based on the specific application. This approach ensures that the generated feature vectors effectively represent the underlying semantic structure of the text, leading to better performance in classification or analysis tasks.

Claim 5

Original Legal Text

5. The method of claim 1 , wherein the textual feature vector of a given speech unit that is a pause is a vector with null values.

Plain English Translation

This invention relates to speech processing, specifically methods for representing speech units, including pauses, as textual feature vectors. The problem addressed is the need for a standardized way to encode pauses in speech data, ensuring consistent representation in machine learning or natural language processing tasks. The method involves generating a textual feature vector for each speech unit, where a pause is represented by a vector containing null values. This distinguishes pauses from other speech units, which are encoded with non-null feature values. The approach ensures that pauses are explicitly accounted for in speech analysis without disrupting the processing pipeline. The invention is part of a broader system for speech unit analysis, where each unit is converted into a feature vector for further processing, such as classification or transcription. The use of null values for pauses simplifies downstream tasks by clearly marking pauses while maintaining compatibility with existing vector-based processing frameworks. This method is particularly useful in applications like speech recognition, voice assistants, and automated transcription, where accurate pause detection and representation are critical for performance.

Claim 6

Original Legal Text

6. The method of claim 1 , wherein the acoustic features are at least some of: volume level; energy level; pitch level; harmonicity; and tempo.

Plain English Translation

This invention relates to audio signal processing, specifically analyzing acoustic features of audio signals to extract meaningful characteristics. The problem addressed is the need for accurate and comprehensive extraction of acoustic features from audio data to enable applications such as speech recognition, music analysis, or sound classification. Traditional methods often rely on limited feature sets, which may not capture the full complexity of audio signals. The invention describes a method for analyzing audio signals by extracting multiple acoustic features, including volume level, energy level, pitch level, harmonicity, and tempo. Volume level refers to the amplitude or loudness of the audio signal, while energy level measures the signal's power over time. Pitch level indicates the fundamental frequency of the audio, harmonicity assesses the presence of harmonic components, and tempo determines the rhythmic speed of the signal. By extracting these features, the method provides a detailed representation of the audio signal, enabling more accurate analysis and classification. The extracted features can be used in various applications, such as speech recognition, music information retrieval, or sound event detection, where understanding the acoustic properties of the signal is crucial. The method improves upon prior art by offering a broader and more nuanced set of features, enhancing the accuracy and robustness of audio analysis tasks.

Claim 7

Original Legal Text

7. The method of claim 1 , wherein the determining the respective acoustic features of the corresponding segment of the digital audio signal comprises: determining, by the server, respective acoustic features of each sub-segment of the corresponding segment of the digital audio signal by applying a sliding window, and wherein the generating the respective acoustic feature vector comprises: generating, by the server, respective intermediary acoustic feature vectors for each sub-segment based on the respective acoustic features; and generating, by the server based on the respective intermediary acoustic feature vectors, the respective acoustic feature vector for the corresponding segment of the digital audio signal.

Plain English Translation

This invention relates to digital audio signal processing, specifically analyzing audio segments to extract acoustic features. The problem addressed is efficiently and accurately determining acoustic characteristics of audio segments for applications like speech recognition, music analysis, or audio classification. The method involves processing a digital audio signal by dividing it into segments and further subdividing each segment into smaller sub-segments. For each sub-segment, acoustic features are extracted using a sliding window technique, which captures local variations in the audio signal. These features are then used to generate intermediary acoustic feature vectors for each sub-segment. The intermediary vectors are aggregated to form a comprehensive acoustic feature vector for the entire segment. This hierarchical approach ensures detailed analysis while maintaining computational efficiency. The system employs a server to perform these operations, ensuring scalability and real-time processing capabilities. By breaking down the analysis into sub-segments and combining their features, the method improves accuracy in capturing dynamic changes within the audio signal. This technique is particularly useful in applications requiring precise acoustic feature extraction, such as voice activity detection, speaker identification, or audio event classification. The sliding window method allows for flexible adaptation to different audio signal characteristics, enhancing the robustness of the feature extraction process.

Claim 8

Original Legal Text

8. The method of claim 7 , wherein each sub-segment is of a pre-determined time length.

Plain English Translation

This invention relates to a method for processing time-based data, such as audio or video signals, to improve analysis or compression. The method addresses the challenge of efficiently dividing continuous data streams into manageable segments for further processing, such as feature extraction, encoding, or noise reduction. The method involves segmenting a continuous data stream into multiple sub-segments, where each sub-segment has a pre-determined fixed time length. This ensures consistent processing intervals, which can simplify subsequent operations like pattern recognition or data compression. The fixed-length sub-segments allow for uniform handling of the data, reducing variability in processing outcomes. The method may be applied in various domains, including audio signal processing for speech recognition, video frame analysis for object detection, or data compression for storage optimization. By standardizing the time length of sub-segments, the method enhances computational efficiency and accuracy in downstream tasks.

Claim 9

Original Legal Text

9. The method of claim 7 , wherein at least two sub-segments partially overlap.

Plain English Translation

A method for segmenting data into sub-segments with overlapping regions to improve processing efficiency. The method addresses the problem of data fragmentation in systems where segmented data must be processed in parallel or sequentially, leading to inefficiencies due to gaps or redundant processing. By introducing controlled overlap between sub-segments, the method ensures continuity and reduces errors during data reconstruction or analysis. The overlapping regions allow for seamless transitions between sub-segments, preventing data loss or misalignment. This technique is particularly useful in applications such as signal processing, image analysis, or distributed computing, where maintaining data integrity across segments is critical. The method involves dividing a data set into multiple sub-segments, where at least two of these sub-segments share a common overlapping region. The overlap size can be adjusted based on the specific requirements of the application, such as the desired level of redundancy or the processing constraints of the system. This approach enhances reliability and accuracy in data handling, making it suitable for high-precision applications where even minor discrepancies can lead to significant errors.

Claim 10

Original Legal Text

10. The method of claim 7 , wherein the sliding window slides with a time step of another pre-determine time length.

Plain English Translation

A system and method for analyzing time-series data using a sliding window technique. The invention addresses the challenge of efficiently processing sequential data by dynamically adjusting the window size and movement step to optimize computational efficiency and accuracy. The sliding window technique involves a movable frame that captures segments of the time-series data for analysis. The window size is predetermined based on the characteristics of the data and the specific analysis requirements. The window slides across the data in discrete steps, where each step is also predetermined to ensure consistent and overlapping or non-overlapping segments. The sliding step length is adjustable, allowing for finer or coarser granularity in the analysis. This method is particularly useful in applications such as signal processing, anomaly detection, and predictive modeling, where the temporal relationships within the data are critical. By controlling the window size and step length, the system can balance between computational efficiency and the level of detail required for accurate analysis. The invention ensures that the sliding window moves in a predetermined time step, which can be different from the window size, to optimize the trade-off between performance and accuracy. This approach enhances the flexibility and adaptability of the analysis process, making it suitable for a wide range of time-series data applications.

Claim 11

Original Legal Text

11. The method of claim 7 , wherein the generating the respective acoustic feature vector for the corresponding segment of the digital audio signal based on the respective intermediary acoustic feature vectors comprises executing, by the server, a statistically-driven combination of the respective intermediary acoustic feature vectors.

Plain English Translation

This invention relates to digital audio processing, specifically methods for generating acoustic feature vectors from audio signals. The problem addressed is the need for accurate and efficient extraction of acoustic features from digital audio, which is essential for applications like speech recognition, audio classification, and sound analysis. The invention improves upon prior methods by using intermediary acoustic feature vectors derived from segments of the audio signal and then combining them in a statistically-driven manner to produce a final acoustic feature vector for each segment. The method involves first dividing the digital audio signal into multiple segments. For each segment, intermediary acoustic feature vectors are generated, which capture initial characteristics of the audio. These intermediary vectors are then statistically combined to produce a refined acoustic feature vector for the segment. The statistical combination may involve operations such as averaging, weighted summation, or other statistical techniques to enhance the accuracy and robustness of the final feature vector. This approach ensures that the extracted features are more representative of the audio content, leading to improved performance in downstream audio processing tasks. The method is implemented on a server, allowing for scalable and efficient processing of large audio datasets.

Claim 12

Original Legal Text

12. The method of claim 1 , wherein the combining the respective acoustic feature vector and the respective textual feature vector comprises concatenating, by the server, the respective acoustic feature vector and the respective textual feature vector.

Plain English Translation

This invention relates to a method for processing audio and text data to improve analysis or recognition tasks. The method addresses the challenge of integrating acoustic and textual information from a spoken input, such as speech or audio recordings, to enhance accuracy in applications like speech recognition, emotion detection, or speaker identification. The method involves extracting acoustic features from an audio signal and textual features from a corresponding transcript or text representation of the audio. These features are then combined by concatenating the respective acoustic feature vector and the respective textual feature vector into a single combined feature vector. This concatenation allows the system to leverage both acoustic and textual information simultaneously, improving the robustness and accuracy of downstream tasks. The acoustic feature vector may include spectral, prosodic, or other audio-derived characteristics, while the textual feature vector may include word embeddings, n-gram statistics, or other text-based representations. By combining these vectors, the method enables a server or processing system to analyze the input more effectively, particularly in scenarios where either acoustic or textual data alone may be insufficient or ambiguous. This approach is useful in applications requiring multimodal analysis, such as voice assistants, call center analytics, or automated transcription systems, where both audio and text provide complementary information. The concatenation step ensures that the combined features retain the full context of both modalities, leading to better performance in classification, recognition, or other predictive tasks.

Claim 13

Original Legal Text

13. The method of claim 1 , wherein the given type is one of: open-ended question type; closed-ended question type; statement type; and exclamation type.

Plain English Translation

This invention relates to a method for categorizing textual content, specifically classifying statements or questions into distinct types to improve natural language processing (NLP) applications. The core problem addressed is the need to accurately distinguish between different forms of textual expressions—such as open-ended questions, closed-ended questions, statements, and exclamations—to enhance automated analysis, chatbot interactions, or sentiment analysis. The method involves analyzing a given text input to determine its type. For open-ended questions, the system identifies queries that require detailed or subjective responses, often lacking specific answer constraints. Closed-ended questions are recognized by their structured format, typically expecting brief or binary answers (e.g., yes/no). Statements are classified as declarative expressions conveying information or opinions, while exclamations are marked by emotional intensity or emphasis, often ending with punctuation like exclamation marks. The classification process may involve parsing syntax, detecting question words, or evaluating sentence structure. By categorizing text into these types, the system enables more precise NLP applications, such as improving chatbot responses, refining search queries, or enhancing sentiment analysis accuracy. The method ensures that automated systems can appropriately interpret and respond to different textual inputs, reducing ambiguity and improving user interaction quality.

Claim 14

Original Legal Text

14. The method of claim 1 , wherein the method further comprises: acquiring, by the server, auxiliary data generated by the NN for each inputted enhanced feature vector associated with a given word; responsive to determining that the intent is of the given type: executing, by the server, an auxiliary MLA for determining a target word amongst the at least one word by inputting into the auxiliary MLA the auxiliary data, the target word being indicative of a context of the spoken user utterance.

Plain English Translation

This invention relates to natural language processing (NLP) systems that enhance spoken user utterances for improved intent recognition and context determination. The problem addressed is the difficulty in accurately interpreting user intent and context from spoken input, particularly when dealing with ambiguous or incomplete utterances. The method involves processing spoken user utterances by first generating enhanced feature vectors for words in the utterance. These vectors are then input into a neural network (NN) to produce auxiliary data for each word. The auxiliary data is used to determine the intent type of the utterance. If the intent matches a predefined type, an auxiliary machine learning algorithm (MLA) is executed. This MLA takes the auxiliary data as input and identifies a target word from the utterance that best represents the context of the spoken input. The target word helps clarify the user's intent by providing additional contextual information derived from the auxiliary data generated by the NN. The auxiliary MLA refines the interpretation of the utterance by leveraging the auxiliary data, which may include features such as word embeddings, syntactic relationships, or semantic associations. This approach improves the accuracy of intent recognition and context determination in NLP systems, particularly in applications like virtual assistants, voice-controlled devices, or automated customer service systems. The method ensures that the system can dynamically adapt to different types of user intents and extract meaningful context from spoken language.

Claim 15

Original Legal Text

15. A server for determining an intent associated with a spoken user utterance, the spoken user utterance having been captured in a form of a digital audio signal, the server being configured to: execute a speech-to-text analysis of the digital audio signal to determine: at least one speech unit of the spoken user utterance, each speech unit having textual data representative of one of a word and a pause, each speech unit having a corresponding segment of the digital audio signal; for each speech unit: generate a respective textual feature vector by: determining, by the server based on the respective textual data, textual features of the respective speech unit; generating, by the server based on the respective textual features, the respective textual feature vector; generate a respective acoustic feature vector by: determining, by the server based on the corresponding segment of the digital audio signal, respective acoustic features of the corresponding segment of the digital audio signal; generating, by the server based on the respective acoustic features, the respective acoustic feature vector; generate a respective enhanced feature vector by combining the respective acoustic feature vector and the respective textual feature vector; employ a neural network (NN) configured to determine the intent of the spoken user utterance by inputting into the NN the enhanced feature vectors, the NN having been trained to estimate a probability of the intent being of a given type.

Plain English Translation

The invention relates to a server system for analyzing spoken user utterances to determine their intent. The system processes digital audio signals representing spoken input by first converting the audio into textual data through speech-to-text analysis. This conversion produces speech units, which can be words or pauses, each associated with a segment of the original audio. For each speech unit, the server generates two types of feature vectors: a textual feature vector derived from the unit's textual representation and an acoustic feature vector derived from the corresponding audio segment. These vectors are combined into an enhanced feature vector. The system then uses a neural network trained to classify intents, feeding the enhanced feature vectors into the network to estimate the probability that the utterance corresponds to a specific intent category. The neural network leverages both textual and acoustic information to improve intent recognition accuracy. This approach addresses challenges in natural language understanding by integrating multimodal features for more robust intent detection in spoken interactions.

Claim 16

Original Legal Text

16. The server of claim 15 , wherein the NN is a recurrent neural network (RNN).

Plain English Translation

This invention relates to a server system that processes data using a neural network (NN) to improve performance in tasks such as prediction, classification, or decision-making. The server includes a processing unit configured to execute the NN, which is trained to analyze input data and generate output data based on learned patterns. The system is designed to handle large-scale data processing efficiently, reducing computational overhead while maintaining accuracy. The NN can be implemented as a recurrent neural network (RNN), which is particularly suited for sequential data tasks, such as time-series forecasting or natural language processing, where dependencies between data points are important. The RNN processes input sequences step-by-step, maintaining a hidden state that captures contextual information from previous inputs. This allows the system to make predictions or classifications that consider temporal or sequential relationships in the data. The server may also include additional components, such as memory and input/output interfaces, to support data storage, retrieval, and real-time processing. The overall system aims to provide scalable and adaptive neural network-based solutions for various data-driven applications.

Claim 17

Original Legal Text

17. The server of claim 15 , wherein the server configured to execute the speech-to-text analysis comprises the server being configured to determine: the textual data of each speech unit; and a time interval of the corresponding segment of the digital audio signal of each speech unit.

Plain English Translation

This invention relates to a server system for processing digital audio signals containing speech. The system addresses the challenge of accurately transcribing spoken language into text while preserving temporal information, which is critical for applications like real-time captioning, voice search, and automated meeting transcription. The server is configured to perform speech-to-text analysis by converting spoken language in a digital audio signal into textual data. It processes the audio by segmenting it into discrete speech units, each corresponding to a portion of the audio signal. For each speech unit, the server determines the textual representation of the spoken words and the precise time interval of the associated audio segment. This allows for synchronization between the transcribed text and the original audio, enabling features such as word-level timing for playback alignment or search functionality. The system may also include additional components, such as a client device that captures the digital audio signal and transmits it to the server for processing. The server may further analyze the transcribed text to identify keywords, topics, or other linguistic features, enhancing the utility of the transcribed data. The invention ensures accurate and time-stamped transcription, improving the reliability of speech recognition systems in various applications.

Claim 18

Original Legal Text

18. The server of claim 15 , wherein the server configured to determine the respective acoustic features of the corresponding segment of the digital audio signal comprises the server being configured to: determine respective acoustic features of each sub-segment of the corresponding segment of the digital audio signal by applying a sliding window, and wherein the server configured to generate the respective acoustic feature vector comprises the server being configured to: generate respective intermediary acoustic feature vectors for each sub-segment based on the respective acoustic features; and generate, based on the respective intermediary acoustic feature vectors, the respective acoustic feature vector for the corresponding segment of the digital audio signal.

Plain English Translation

This invention relates to audio signal processing, specifically to a server system that analyzes digital audio signals to extract and process acoustic features. The problem addressed is the need for efficient and accurate extraction of acoustic features from audio segments, particularly in applications like speech recognition, audio classification, or music information retrieval, where detailed feature analysis is required. The server is configured to process a digital audio signal by dividing it into segments and further subdividing each segment into smaller sub-segments. For each sub-segment, the server applies a sliding window to determine acoustic features, such as spectral, temporal, or cepstral characteristics. These features are then used to generate intermediary acoustic feature vectors for each sub-segment. The server aggregates these intermediary vectors to produce a final acoustic feature vector for the entire segment. This hierarchical approach ensures that fine-grained acoustic details are captured while maintaining computational efficiency. The system improves upon prior methods by leveraging sub-segment analysis to enhance feature representation, allowing for more robust and detailed acoustic modeling. The sliding window technique ensures that features are extracted consistently across overlapping sub-segments, reducing information loss and improving accuracy in subsequent audio processing tasks. This method is particularly useful in applications requiring high-resolution audio analysis, such as real-time speech recognition or audio-based biometric identification.

Claim 19

Original Legal Text

19. The server of claim 1 , wherein the given type is one of: open-ended question type; closed-ended question type; statement type; and exclamation type.

Plain English Translation

This invention relates to a server system designed to process and analyze natural language inputs, particularly in the context of conversational interfaces or chatbots. The system categorizes user inputs into distinct types to improve response generation and interaction quality. The server includes a classification module that identifies the type of a given input, which can be an open-ended question, a closed-ended question, a statement, or an exclamation. Open-ended questions are queries that require detailed or subjective responses, while closed-ended questions seek specific, often binary answers. Statements are declarative inputs, and exclamations are expressive or emotional utterances. The classification module uses linguistic analysis, such as syntax, semantics, or contextual cues, to determine the input type. This categorization enables the server to generate contextually appropriate responses, improving user engagement and interaction efficiency. The system may also adapt its response strategy based on the identified input type, such as providing detailed explanations for open-ended questions or direct answers for closed-ended ones. This approach enhances the naturalness and effectiveness of automated conversational systems.

Claim 20

Original Legal Text

20. The server of claim 15 , wherein the server is further configure to: acquire auxiliary data generated by the NN for each inputted enhanced feature vector associated with a given word; responsive to determining that the intent is of the given type: execute an auxiliary MLA for determining a target word amongst the at least one word by inputting into the auxiliary MLA the auxiliary data, the target word being indicative of a context of the spoken user utterance.

Plain English Translation

This invention relates to natural language processing (NLP) systems that enhance speech recognition accuracy by leveraging neural networks (NN) and machine learning algorithms (MLA). The problem addressed is the difficulty in accurately determining the context of spoken user utterances, particularly when multiple words have similar phonetic representations or ambiguous meanings. The system includes a server configured to process enhanced feature vectors derived from spoken user utterances. These vectors are generated by an NN that analyzes acoustic and linguistic features of the input speech. The server further acquires auxiliary data produced by the NN for each word in the utterance. This auxiliary data includes contextual and semantic information that helps disambiguate the meaning of the spoken words. When the server determines that the user's intent matches a predefined type, it executes an auxiliary MLA. This MLA uses the auxiliary data to identify a target word from the utterance, which provides additional context about the user's intent. The target word helps resolve ambiguities in the spoken input, improving the accuracy of the NLP system's interpretation. The auxiliary MLA is trained to recognize patterns in the auxiliary data that correlate with specific contextual meanings. By analyzing these patterns, the system can select the most appropriate target word, enhancing the overall understanding of the user's utterance. This approach is particularly useful in applications like voice assistants, transcription services, and automated customer support systems, where accurate context interpretation is critical.

Patent Metadata

Filing Date

Unknown

Publication Date

October 27, 2020

Inventors

Ivan Aleksandrovich KARPUKHIN

Want to explore more patents?

Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.

Citation & reuse

Analysis on this page is generated by Patentable — an AI-powered patent intelligence platform. AI-generated summaries, explanations, FAQs, and analysis may be reused with attribution and a visible link back to the canonical URL below. Patent abstracts and claims are USPTO public domain.

Cite as: Patentable. “METHODS OF AND ELECTRONIC DEVICES FOR DETERMINING AN INTENT ASSOCIATED WITH A SPOKEN USER UTTERANCE” (10818284). https://patentable.app/patents/10818284

© 2026 Nomic Interactive Technology LLC. Machine-readable context available at /api/llm-context/10818284. See llms.txt for full attribution policy.