Patentable/Patents/US-11960636
US-11960636

Multimodal task execution and text editing for a wearable system

PublishedApril 16, 2024
Assigneenot available in USPTO data we have
Inventorsnot available in USPTO data we have
Technical Abstract

Examples of wearable systems and methods can use multiple inputs (e.g., gesture, head pose, eye gaze, voice, and/or environmental factors (e.g., location)) to determine a command that should be executed and objects in the three-dimensional (3D) environment that should be operated on. The multiple inputs can also be used by the wearable system to permit a user to interact with text, such as, e.g., composing, selecting, or editing text.

Patent Claims
12 claims

Legal claims defining the scope of protection. Each claim is shown in both the original legal language and a plain English translation.

Claim 5

Original Legal Text

5. The system of claim 1, wherein the hardware processor is programmed to implement an automated speech recognition (ASR) engine to obtain the transcription.

Plain English Translation

The system relates to automated speech recognition (ASR) technology, specifically addressing the challenge of accurately transcribing spoken language into text. The system includes a hardware processor configured to process audio input and generate a transcription of the spoken content. The ASR engine within the system converts spoken words into written text, enabling applications such as voice assistants, transcription services, and real-time captioning. The system may also include additional components, such as a microphone for capturing audio input, a memory for storing data, and a network interface for transmitting or receiving information. The ASR engine may employ machine learning models trained on large datasets to improve accuracy and adapt to different accents, languages, or speaking styles. The system may further include noise reduction algorithms to enhance audio quality before processing. The transcription output can be used for various purposes, including documentation, searchability, and accessibility features. The system may also integrate with other software applications to provide seamless transcription services across different platforms. The overall goal is to provide an efficient and accurate method for converting speech into text, improving usability in both personal and professional settings.

Claim 6

Original Legal Text

6. The system of claim 5, wherein the ASR engine is configured to produce a score associated with one or more words in the string of text, which indicates a likelihood that the ASR engine correctly transcribed such words.

Plain English Translation

Automatic speech recognition (ASR) systems convert spoken language into text, but errors in transcription can occur due to background noise, accents, or unclear speech. These errors reduce the accuracy of downstream applications like voice assistants, transcription services, and real-time captioning. To address this, an ASR system is configured to generate confidence scores for transcribed words, indicating the likelihood that each word was correctly transcribed. These scores help identify uncertain transcriptions, enabling applications to flag low-confidence words for review, correction, or further processing. The system may also use these scores to improve transcription accuracy by applying error correction algorithms or prompting the user to repeat unclear phrases. By providing transparency into transcription reliability, the system enhances the usability and accuracy of ASR-based applications. This approach is particularly useful in scenarios where high accuracy is critical, such as medical transcription, legal documentation, or real-time communication for individuals with hearing impairments. The confidence scoring mechanism can be integrated into existing ASR pipelines or deployed as a standalone module to evaluate transcriptions from third-party systems.

Claim 7

Original Legal Text

7. The system of claim 6, wherein the hardware processor is further programmed to cause the HMD to emphasize the one or more words if the likelihood of correct transcription is below a threshold level.

Plain English Translation

This invention relates to a system for enhancing transcription accuracy in augmented reality (AR) or virtual reality (VR) environments using a head-mounted display (HMD). The system addresses the challenge of accurately transcribing spoken words in real-time AR/VR applications, where environmental noise, user movement, or speech recognition limitations can degrade performance. The system includes a hardware processor that processes audio input from a user wearing the HMD. The processor identifies one or more words in the audio input that are likely to be incorrectly transcribed based on a confidence score or likelihood metric. If the likelihood of correct transcription falls below a predefined threshold, the system emphasizes the problematic words in the HMD's display. Emphasis may involve visual highlighting, text enlargement, or other visual cues to alert the user to potential transcription errors. The system may also compare the transcribed text against a predefined vocabulary or context to further refine accuracy. The processor may also adjust the emphasis based on user feedback, such as corrections or repeated utterances, to improve future transcription accuracy. Additionally, the system may integrate with other AR/VR features, such as gesture recognition or gaze tracking, to enhance transcription reliability in dynamic environments. The goal is to provide real-time feedback and corrections, ensuring accurate transcription in AR/VR applications where traditional input methods are impractical.

Claim 9

Original Legal Text

9. The system of claim 8, wherein the hardware processor is further programmed to determine that the user has given the user input command to edit the user-selected portion of the text based on data from the eye gaze tracking device indicating that the user's gaze has lingered on the portion of the text presented by the display for at least a threshold period of time.

Plain English Translation

This invention relates to a system for detecting user intent to edit text using eye gaze tracking. The system addresses the problem of traditional text editing interfaces requiring explicit user actions, such as clicking or tapping, to initiate editing. By leveraging eye gaze tracking, the system enables more intuitive and hands-free text editing, particularly useful in scenarios where physical interaction is limited or inconvenient. The system includes a display for presenting text, an eye gaze tracking device to monitor the user's gaze, and a hardware processor. The processor is configured to detect when a user's gaze lingers on a portion of the displayed text for at least a threshold period of time, interpreting this as an implicit command to edit that portion. The system may also include a microphone for receiving voice commands, allowing the user to further refine or confirm the editing action. The processor processes the gaze data and, if applicable, voice input, to determine the user's editing intent and executes the corresponding text modification. This approach reduces the need for manual selection and streamlines the editing process, enhancing efficiency and accessibility.

Claim 10

Original Legal Text

10. The system of claim 8, wherein the hardware processor is programmed to determine that the user has given the user input command to edit the user-selected portion of the text based on data from the audio sensing device and data from the eye gaze tracking device indicating that the audio sensing device received a voice command while the user's gaze was focused on the portion of the text presented by the display.

Plain English Translation

This invention relates to a system for detecting and processing user input commands to edit text based on combined audio and eye gaze tracking data. The system addresses the challenge of accurately interpreting user intent when editing digital text, particularly in environments where traditional input methods like keyboards or touchscreens are impractical or inefficient. The system includes a display for presenting text, an audio sensing device for capturing voice commands, and an eye gaze tracking device for monitoring the user's gaze. A hardware processor analyzes data from both devices to determine whether the user has issued a voice command to edit a specific portion of the text. The processor verifies that the voice command was given while the user's gaze was focused on the intended text portion, ensuring the command corresponds to the user's visual focus. This dual-input validation reduces errors in command interpretation, improving the accuracy and efficiency of text editing in interactive applications. The system may also include additional features, such as a microphone array for directional audio capture, a camera for eye tracking, and software for processing and correlating the audio and gaze data. The processor may further execute commands to modify the text based on the validated input, such as deleting, inserting, or formatting the selected portion. This approach enhances usability in applications like virtual reality, augmented reality, or hands-free computing environments where traditional input methods are limited.

Claim 13

Original Legal Text

13. The system of claim 8, wherein the hardware processor is further programmed to produce an automated speech recognition (ASR) score associated with one or more words in the text, which indicates a likelihood that such words are correctly transcribed.

Plain English Translation

This invention relates to automated speech recognition (ASR) systems that evaluate transcription accuracy. The system processes audio input to generate a text transcription and assigns confidence scores to individual words or phrases, indicating the likelihood that each word is correctly transcribed. These scores help identify potential transcription errors, allowing for corrections or further processing. The system may also compare the ASR output against a reference text or use contextual analysis to refine accuracy. By quantifying transcription reliability, the system improves the trustworthiness of ASR-generated text in applications like voice assistants, transcription services, and real-time captioning. The invention addresses challenges in ASR accuracy, particularly in noisy environments or with diverse accents, by providing measurable confidence metrics for downstream applications. The hardware processor executes algorithms to compute these scores, which may be displayed to users or used internally for error correction. This approach enhances the reliability of speech-to-text systems in various domains.

Claim 14

Original Legal Text

14. The system of claim 13, wherein the hardware processor is further programmed to calculate the aggregated confidence score utilizing the first confidence score, the second confidence score, and the ASR score.

Plain English Translation

The invention relates to a system for improving the accuracy of automated speech recognition (ASR) by integrating multiple confidence scores. The system addresses the problem of unreliable ASR outputs, which often lack contextual understanding and produce errors in transcription. The system enhances ASR performance by combining confidence scores from different sources to generate a more accurate aggregated confidence score. The system includes a hardware processor that processes audio input to generate an initial transcription using ASR. It then calculates a first confidence score based on the ASR output and a second confidence score derived from a separate analysis of the audio input. The processor further computes an ASR score, which reflects the reliability of the ASR output. The aggregated confidence score is derived by combining these three scores—first confidence score, second confidence score, and ASR score—to improve the overall confidence in the transcription. This aggregated score helps in identifying and correcting errors in the ASR output, leading to more accurate transcriptions. The system may also include additional components, such as a memory for storing audio data and a user interface for displaying results. The integration of multiple confidence scores ensures that the system can adapt to different audio conditions and improve transcription accuracy in real-time applications.

Claim 17

Original Legal Text

17. The method of claim 15, wherein at least a portion of the text is emphasized on the display where the portion is associated with a low confidence that a translation from the spoken input to the corresponding portion of the text is correct.

Plain English Translation

This invention relates to speech-to-text translation systems, specifically addressing the challenge of improving user confidence in translated text by highlighting portions of the output where the translation may be uncertain. The system processes spoken input, converts it into text, and displays the translated text on a screen. During this process, the system evaluates the confidence level of each translated segment. If the confidence in a particular portion of the translated text is low, that portion is visually emphasized on the display to alert the user. This emphasis may include formatting changes such as bolding, underlining, or color highlighting. The system may also provide additional context or suggestions for low-confidence translations to help the user verify or correct the output. By drawing attention to potentially inaccurate translations, the system enhances user trust and accuracy in speech-to-text applications. The method applies to real-time or batch translation processes, ensuring users can quickly identify and address translation uncertainties.

Claim 19

Original Legal Text

19. The method of claim 18, wherein the first mode of user input comprises a speech input received from an audio sensor of the wearable device, wherein the method further comprises transcribing the speech input to identify at least one of the selected portion of text, the subject, or the command operation.

Plain English Translation

This invention relates to wearable devices with speech-based interaction capabilities for text manipulation. The technology addresses the challenge of efficiently selecting and modifying text in digital documents using voice commands, particularly in environments where manual input is impractical or inconvenient. The method involves a wearable device equipped with an audio sensor that captures speech input from a user. The device transcribes the speech to identify specific elements, including a selected portion of text, a subject (such as a document or application), or a command operation (such as editing, formatting, or navigation). The transcription process converts spoken language into machine-readable text, enabling the system to interpret and execute the user's intent. For example, a user might verbally select a sentence, specify a document, and issue a command like "delete" or "copy," with the system processing these inputs to perform the requested action. The wearable device may also include additional sensors or interfaces to enhance accuracy, such as motion or gaze tracking to refine text selection. The method ensures seamless interaction by dynamically processing speech inputs in real-time, reducing reliance on manual input and improving accessibility. This approach is particularly useful in scenarios where hands-free operation is necessary, such as during mobility or multitasking. The invention aims to streamline text manipulation tasks while maintaining precision and user convenience.

Claim 20

Original Legal Text

20. The method of claim 18, wherein the second mode of user input comprises an input from at least one of: a user input device, a gesture, or an eye gaze.

Plain English Translation

A system and method for enhancing user interaction with a computing device by dynamically adjusting input modes based on user context. The invention addresses the problem of inefficient or cumbersome user input in computing environments, particularly when traditional input methods (e.g., keyboard, mouse) are impractical or when users need alternative interaction methods. The system detects a user's context, such as their physical or cognitive state, and automatically switches between different input modes to optimize usability. For example, if a user is in a hands-free scenario, the system may transition from a keyboard-based input mode to a gesture-based or eye-gaze-based input mode. The system includes sensors or tracking devices to monitor user behavior, such as movement, gaze direction, or physiological signals, and processes this data to determine the most appropriate input mode. The invention also allows manual override, enabling users to select their preferred input method regardless of the system's automatic recommendations. This adaptive approach improves accessibility, efficiency, and user experience across various computing tasks.

Claim 21

Original Legal Text

21. The method of claim 18, wherein the interaction with the selected portion of text comprises at least one of: selecting, editing, or composing the selected portion of text.

Plain English Translation

This invention relates to text interaction in digital environments, addressing the need for efficient and intuitive ways to manipulate selected text portions. The method involves detecting user input to select a portion of text displayed on a device screen. Once selected, the method enables various interactions with the text, including selecting, editing, or composing the selected portion. The selection process may involve touch gestures, cursor movements, or other input methods. Editing functions allow modifications such as deletions, insertions, or formatting changes. Composing functions enable the creation of new text based on the selected portion, such as generating responses or summaries. The method may also include displaying contextual options or tools related to the selected text to enhance user productivity. The system dynamically adjusts interaction options based on the type of text and user preferences, ensuring a seamless and adaptive experience. This approach improves text manipulation efficiency by providing direct, context-aware actions on selected content.

Claim 22

Original Legal Text

22. The method of claim 18, wherein the subject comprises one or more of: a word, a phrase, or a sentence.

Plain English Translation

This invention relates to natural language processing and information retrieval, specifically addressing the challenge of accurately identifying and extracting meaningful textual subjects from unstructured data. The method involves analyzing input text to detect and isolate subjects, which can be individual words, phrases, or complete sentences, depending on the context and linguistic structure. The process includes parsing the text to determine grammatical relationships, semantic relevance, and contextual significance, ensuring that the extracted subjects are both syntactically correct and contextually appropriate. The method may also involve filtering or ranking the extracted subjects based on predefined criteria, such as relevance, frequency, or user-defined parameters, to improve the precision of the results. This approach enhances the efficiency of text analysis tasks, such as search, summarization, and content categorization, by providing more accurate and meaningful subject identification. The system can be applied in various domains, including document processing, chatbot interactions, and automated content generation, where precise subject extraction is critical for effective communication and data interpretation.

Classification Codes (CPC)

Cooperative Patent Classification codes for this invention. Click any code to explore related patents in that topic.

Patent Metadata

Filing Date

December 22, 2021

Publication Date

April 16, 2024

Want to explore more patents?

Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.

Citation & reuse

Analysis on this page is generated by Patentable — an AI-powered patent intelligence platform. AI-generated summaries, explanations, FAQs, and analysis may be reused with attribution and a visible link back to the canonical URL below. Patent abstracts and claims are USPTO public domain.

Cite as: Patentable. “Multimodal task execution and text editing for a wearable system” (US-11960636). https://patentable.app/patents/US-11960636

© 2026 Nomic Interactive Technology LLC. Machine-readable context available at /api/llm-context/US-11960636. See llms.txt for full attribution policy.