Patentable/Patents/US-12573405-B2
US-12573405-B2

Word correction using automatic speech recognition (ASR) incremental response

PublishedMarch 10, 2026
Assigneenot available in USPTO data we have
Inventorsnot available in USPTO data we have
Technical Abstract

An exemplary automatic speech recognition (ASR) system may receive an audio input including a segment of speech. The segment of speech may be independently processed by general ASR and domain-specific ASR to generate multiple ASR results. A selection between the multiple ASR results may be performed based on respective confidence levels for the general ASR and domain-specific ASR. As incremental ASR is performed, a composite result may be generated based on general ASR and domain-specific ASR.

Patent Claims

Legal claims defining the scope of protection, as filed with the USPTO.

1

. A method, comprising:

2

. The method of, further comprising:

3

. The method of, wherein the at least one candidate word is further selected based on shared additional candidate words that are included in both the first speech recognition and the second speech recognition.

4

. The method of, wherein the first speech recognition is a general speech recognition and the second speech recognition is a domain-specific speech recognition.

5

. The method of, wherein the domain-specific speech recognition comprises a plurality of metadata types for each of a plurality of media assets.

6

. The method of, wherein the speech pattern comprises at least one of: terse, choppy, narrative, and conversational.

7

. The method of, wherein the first confidence value and the second confidence value are based on an edit distance between the incremental speech input and the first additional candidate word and second additional candidate word, as well as edit distances associated with prior words in the incremental speech input.

8

. The method of, wherein at least one of the first speech recognition and the second speech recognition is a sequence aware speech recognition.

9

. The method of, further comprising:

10

. The method of, wherein the first speech recognition is sequence aware, and wherein the method further comprises:

11

. A system for identifying words from speech of a user, the system comprising:

12

. The system of, wherein the control circuitry is further configured to:

13

. The system of, wherein the at least one candidate word is further selected based on shared additional candidate words that are included in both the first speech recognition and the second speech recognition.

14

. The system of, wherein the first speech recognition is a general speech recognition and the second speech recognition is a domain-specific speech recognition.

15

. The system of, wherein the domain-specific speech recognition comprises a plurality of metadata types for each of a plurality of media assets.

16

. The system of, wherein the speech pattern comprises at least one of: terse, choppy, narrative, and conversational.

17

. The system of, wherein the first confidence value and the second confidence value are based on an edit distance between the incremental speech input and the first additional candidate word and second additional candidate word, as well as edit distances associated with prior words in the incremental speech input.

18

. The system of, wherein at least one of the first speech recognition and the second speech recognition is a sequence aware speech recognition.

19

. The system of, wherein the control circuitry is further configured to:

20

. The system of, wherein the first speech recognition is sequence aware, and wherein the control circuitry is further configured to:

Detailed Description

Complete technical specification and implementation details from the patent document.

This application is a continuation of U.S. patent application Ser. No. 17/384,970, filed Jul. 26, 2021, which is a continuation of U.S. patent application Ser. No. 16/408,260, filed May 9, 2019, now U.S. Pat. No. 11,107,475, which are hereby incorporated by reference herein in their entireties.

The present disclosure is directed to systems for performing automatic speech recognition (ASR), and in particular, for performing domain-related ASR.

An increasing number of systems and devices utilize ASR to interact with a user, for example, to process user speech to identify commands or other interactions with the system or device. ASR is implemented in smart devices such as phones, tablets, watches, speakers, televisions, and appliances. Vehicles, public transportation and infrastructure, and customer service systems utilize ASR for processing user or customer speech. Thus, ASR integrations are increasingly interacting with more aspects of users' lives and for increasingly broad purposes in an increasingly narrow set of applications.

ASR systems actively listen to audio, and when a verbal input is detected, the ASR system performs speech-to-text conversion and returns recognized text. In an incremental response ASR model, the ASR system provides an entire sequence including some resolved words and also words that include intermediate predicted possibilities. As ASR is increasingly used in particular sub-domains (e.g., niche applications executing on a smart device), the ASR outputs may be more likely to depart from the user's intention. For example, error correction and selection between possible results typically occurs based on a corpus of training data, and further based on inferences that can be drawn from generated phonetic variations of the input speech and a model created based on the training data.

In some embodiments of the present disclosure, a speech recognition system (e.g., speech recognition software executing on a device) recognizes that it is operating in a particular domain, such as operation of a motor vehicle, a kiosk in a grocery store, customer service for a particular product, or a media guidance application. The system may employ domain-specific ASR based on the particular domain, for example, that biases results based on the vocabulary and usage peculiar to a particular domain. The system may also employ a general ASR that processes results based on general speech vocabulary and usage, for example, related to the user's language, dialect, location, and other non-domain-specific factors.

When a user speaks, the speech segment may be received by a monitoring component such as a microphone and provided to control circuitry (e.g., of a device including the microphone, or in some embodiments, a remote device such as a server). The control circuitry processes the received speech segment using both the general ASR and the domain-specific ASR, for example, by processing the same speech segment through both ASRs in parallel. Each ASR outputs respective candidate words for the speech segment along with confidence values (e.g., corresponding to edit distances between processed speech data and respective words and/or phrases) for the candidate words. Additional weighting may be provided to the outputs, for example, based on how closely the user is adhering to the domain (e.g., based on high confidence values for previous domain-specific results or a user selection to weight heavily for domain-specific speech). A composite plurality of words may be identified based on the candidate words and the confidence values (weighted, as appropriate) from the respective ASRs, resulting in a speech output that accurately captures domain-specific words and also captures general conversational words. In this manner, a user can speak normally, i.e., without resorting to excessive domain-specific speech patterns and without foregoing accuracy of domain-specific portions of the overall output. The resulting output may then be used for a variety of tasks and functions, for example, based on the end use device and the domain. In an exemplary media guidance application, the user may interact with the media guidance application conversationally to prompt queries and searches, display and playback of media, acquisition of supplemental information for a displayed media asset or actor, interaction with advertisements, coordination with parental control settings, and numerous other related functions.

In some embodiments, the multiple ASR speech recognition system may process incremental speech inputs that are updated dynamically as speech is received and processed, while the user is speaking. As each new incremental segment of speech is received, the processing of the newly received speech identifies new candidate words and confidence values for the general ASR and domain-specific ASR and selects between the candidate words based on confidence values as described herein. In some embodiments, the general ASR and domain-specific ASR are sequence aware, such that current confidence values (e.g., based on edit distances) are partially based on preceding words and phrases that may be formed by combinations therewith. In the context of domain-specific ASR, the phrases may be domain-specific. Previously selected words may also be modified based on the newly identified words and their relations to phrases, as well as respective confidence values for individual words and phrases. In some instances, a previous selection between a general ASR output and domain-specific ASR may be updated.

In some embodiments, a variety of metrics may be used as confidence values. For example, an initial output of a speech recognition algorithm may result in a set of characters or phonemes that can be compared to possible words, for example, based on an edit distance between the initial output and each of the possible words. A match may be made from the possible words, and a final confidence value may be assigned, for example, based on the edit distance with the matched word and the relative edit distance as compared to other possible words that were not selected. In an exemplary embodiment of a domain-specific ASR for a media guidance application, the possible words used for matching may be based at least in part on media guidance metadata, which in turn may be structured into categories relevant to media guidance applications such as genre, actors, characters, directors, studio, format, resolution, source, provider, cost, playback time, user preferences, rankings, social scoring, and other similar factors relating to the management and delivery of media assets to a user.

In some embodiments, the sequence aware ASR and sequence unaware ASR may be performed in parallel, and for one or both of the general and domain-specific ASR, such that the selection between candidate words includes multiple sets of results. For example, some words or phrases may be identified by a sequence aware ASR based on general assumptions of speech recognition algorithms with respect to common phrases, speech structures, or colloquialisms. Speech that does not conform to these assumptions (e.g., dialects, accented speech, speech from language learners, new vocabulary, and the like) may result in low confidence for a sequence aware ASR that is attempting to fit the speech into generalized patterns in some instances, whereas a sequence unaware ASR may identify the individual words accurately. In this manner, a larger set of users may be able to interact with speech recognition in a conversational manner.

The present disclosure is related to recognition of speech using results from multiple ASRs to generate a composite speech output, for example, for real-time incremental speech recognition of conversational speech. A speech recognition system may utilize both general ASR and domain-specific ASR, as well as sequence aware ASR and sequence unaware ASR (e.g., for one or both of the general ASR and domain-specific ASR). In an exemplary embodiment as described herein, a media guidance system may utilize sequence aware general ASR, sequence aware domain-specific ASR, and sequence unaware general ASR. It will be understood, however, that other ASR combinations may be used in other applications based on the disclosure provided herein.

In an exemplary embodiment, each segment of incremental speech may be provided to each of the three ASRs and processed in parallel and/or asynchronously such that each of the ASRs outputs one or more candidate words along with confidence values that are associated with the candidate words. In exemplary embodiments as described herein, a confidence value may correspond to a zero-to-one score with zero corresponding to a lowest confidence level and one corresponding to a highest confidence level. Other scales may be used rather than a zero-to-one scale, and confidence values can be represented in manners other than scores, such as vector representations of respective edit distances.

Prior to providing a result for the segment of incremental speech (e.g., as an input to command dialogue, an input to a search query, or an output for display), the system may consider each of the respective candidate words and confidence values. In some embodiments, biasing factors may also be applied to the candidate words and confidence values, for example, based on a likelihood that a user is using domain-specific or general speech (e.g., as determined from previous portions of the incremental speech, based on user patterns such as time of day or viewing context, or based on other data such as supplemental information received from other input devices). Based on the candidate words and confidence values from the respective ASRs, and other biasing if applied, the system may select one or more words for the segment of incremental speech. In some instances, portions of speech preceding the newly selected one or more words may be updated, for example, based on updated confidence values for sequence aware general and domain-specific ASRs.

shows an illustrative embodiment of a user providing a spoken input to a user equipment device, in accordance with some embodiments of the disclosure. In an exemplary embodiment, the useris speaking in an environment that includes a user equipment devicesuch as a smart television, although the speech processing as described herein may be performed by a variety of devices or a combination thereof, including suitable hardware (e.g., microphones) for receiving speech input from a user and control circuitry (e.g., locally on the user equipment device, on another user equipment device via communication over a local network, or on a remote device such as a remote server of a service provider accessible via communication over a communication network.

The speech recognition may be performed at a variety of times and intervals, such as whenever the user equipment devicehas power (e.g., a passive listening device), when the user equipment deviceis active (e.g., based on a user interaction device, the detection of speech in proximity to the device, or another external stimulus to initiate speech recognition), or in response to an explicit or implicit request to perform speech recognition (e.g., a user selecting a button or command to initiate speech recognition, or accessing an application that uses speech recognition).

Once speech recognition is initiated, the speech may be processed in a batched or incremental format, depending on the end use application, processing capabilities, and particular ASRs used. For example, batched processing may be utilized where a time-sensitive result (e.g., real-time display of the speech on a screen or real-time control of a game or menu) is not required, or where a local device receives and transcodes speech for remote processing (e.g., for later batched association of viewers' speech with displayed portions of media assets). Batch processing may also be implemented where the speech is likely to follow a given command format (e.g., with particular speech patterns and commands that are easily recognized), although incremental speech recognition is also suitable for such applications. In some embodiments, incremental speech recognition may be utilized in real-time applications in which a most recent word or phrase is displayed immediately or immediately used for a resulting action such as performing a search request, navigating menus, or performing steps in a game. In some embodiments, incremental speech recognition may be utilized where the received speech is likely to include at least partial conversational content that does not correspond to a predetermined command format.

Once the speech is processed, the output may be processed. In an exemplary embodiment of a media guidance application, the processed speech may be used to perform any operations that are typically performed by a remote or by menus of a program guide or application. For example, the user's speech may be analyzed to download content for viewing, provide recommendations, target advertisements, navigate menus, request related information from non-media sources, and interact with other devices and equipment within a user's home. In some embodiments, a domain used for selection of the domain-specific ASR for use in speech processing may be fixed for a device or application, while in other embodiments a domain-specific ASR may be selected from multiple options. The selection of a domain-specific ASR may be based on user interactions that indicate a domain, from analysis of the user's speech, or combinations thereof. In an exemplary embodiment, general speech recognition can analyze a portion of speech and the results analyzed to identify a likely domain. In other embodiments, a user's speech may be provided to multiple domain-specific ASRs and an appropriate domain-specific ASR selected based on relative confidence score outputs.

shows an illustrative embodiment of a multiple ASR speech processing system, in accordance with some embodiments of the disclosure. Although a particular combination of multiple ASRs is depicted in, it will be understood that a variety of combinations of ASRs may be utilized in accordance with the present disclosure. For example, different general ASRs using different methodologies may provide different outputs and confidence scores. Domain-specific ASRs may be combined for particular uses or applications, such as combining a sports-focused ASR with a media guidance ASR for a user navigating a media guidance application for management and viewing of sporting events (e.g., selected based on a user profile and recognition of interest in viewing a sporting event identified during conversational speech). Any of the ASRs (e.g., general or domain-specific) may be sequence aware or sequence unaware. In some embodiments, multiple ASRs may be available (e.g., via a local and/or remote network) and may be selectively activated for particular applications and use cases (e.g., based on user profiles, user selections, analysis of user speech, and/or external information sources).

Input speech may be received at a speech input devicewhich may be any suitable microphone or other transducer capable of capturing speech information. The speech input devicemay be located on a suitable user equipment device. The received speech information may be transcoded and the transcoded speech information provided to control circuitry of a user equipment device such as the user equipment device including the speech input deviceor another device (e.g., a user equipment device or remote server) accessible via a local or wide area network. Transcoded speech may be provided for processing as segments in a streaming or batched fashion, for example, during spaces between utterances or on a periodic basis.

A segment transcoded speech from the speech input devicemay be provided to a plurality of ASRs for processing, such as general ASR, domain-specific ASR, and sequence unaware ASR. The respective ASRs-may be located on a common device (e.g., the user device including the speech input deviceor a remote server) or may be located on different devices (e.g., a general ASR and sequence unaware ASR on a local device and a remotely accessible domain-specific ASR). In some embodiments, the domain-specific ASR may be selectable based on multiple available domain-specific ASRs. In the exemplary embodiment described in the context of, the domain-specific ASR may be a sequence aware media guidance ASR.

General ASRmay be a sequence aware ASR and may receive the segment of transcoded input speech, e.g., at control circuitry executing instructions on a user equipment device or remote server. General ASRmay process the received segment of transcoded input speech to generate one or more candidate words and one or more associated confidence values. In some embodiments, the output of the general ASR may include multiple candidate words and confidence values (e.g., respective edit distances) for each potential word in the segment of input speech. In instances in which the segment of input speech corresponds to a set of multiple words, each potential word in the segment of input speech may be associated with one or more candidate words and confidence values. Because the general ASRis sequence aware, in some embodiments the confidence values associated with candidate words may be partially contingent on previous or subsequent word selections and confidence values. General ASRmay output candidate words and confidence values to ASR integration.

Domain-specific ASRmay be a sequence aware ASR and may receive the segment of transcoded input speech, e.g., at control circuitry executing instructions on a user equipment device or remote server. Domain-specific ASRmay process the received segment of transcoded input speech to generate one or more candidate words and one or more associated confidence values. In some embodiments, the output of the domain ASR may include multiple candidate words and confidence values (e.g., respective edit distances) for each potential word in the segment of input speech as compared to a corpus of domain-specific words. For example, a media guidance ASR may include a limited set of words including information about media assets (e.g., metadata indicating title, genre, plot, actors, characters, run-time, director, studio, etc.), information about media asset sources (e.g., play times, video quality and formats, costs, login information, subscriptions, etc.), information about user equipment devices used for playback (e.g., hardware capabilities, installed applications, available playback operations, etc.), and media guidance operations (e.g., menu navigation operations, playback operations, etc.). Confidence values (e.g., based on respective edit distances) may be calculated with respect to the corpus of domain-specific words. In instances in which the segment of input speech corresponds to a set of multiple words, each potential word in the segment of input speech may be associated with one or more candidate words and confidence values. In instances where the domain-specific ASRis sequence aware, in some embodiments the confidence values associated with candidate words may be partially contingent on previous or subsequent word selections and confidence values. Domain-specific ASRmay output candidate words and confidence values to ASR integration.

Sequence unaware ASRmay receive the segment of transcoded input speech, e.g., at control circuitry executing instructions on a user equipment device or remote server. Sequence unaware ASRmay process the received segment of transcoded input speech to generate one or more candidate words and one or more associated confidence values. In some embodiments, the output of the general ASR may include multiple candidate words and confidence values (e.g., respective edit distances) for each potential word in the segment of input speech. In instances in which the segment of input speech corresponds to a set of multiple words, each potential word in the segment of input speech may be associated with one or more candidate words and confidence values. Because the ASRis sequence unaware, the confidence values associated with candidate words are determined without regard to adjacent selected words and or other confidence values. Sequence unaware ASRmay output candidate words and confidence values to ASR integration.

ASR integrationmay receive candidate words and associated confidence values from a plurality of ASRs, such as general ASR, domain-specific ASR, and sequence unaware ASR. ASR integrationmay be located locally with any or all of the input speech deviceand ASRs as components (e.g., instructions executing on control circuitry) of a common device (e.g., a user equipment device) or on a local network. In some embodiments, ASR integrationmay be located remotely from some or all of the other components of the speech processing system, for example, as a service executing on a remote server. ASR integrationmay process the received candidate words and confidence values to generate a composite output. In some embodiments, ASR integrationmay select the candidate word with the best confidence value (e.g., highest confidence value). In other embodiments, bias may be introduced into the selection such that confidence values received from a particular ASR source are modified (e.g., proportionally increased), or to perform a tie-breaking function. For example, during initial processing of conversational speech, the confidence values of general ASRmay receive additional weighting, whereas domain-specific ASRconfidence values receive increased weighting once sufficient information (e.g., a threshold number or percentage of words) indicates that the received speech relates to the domain. Sequence unawaremay also receive increased weighting, for example, based on speech patterns indicating that the user's speech is relatively terse or choppy. In a similar manner, the general ASRoutputs may receive additional weighting for speech patterns that are more narrative or conversational.

ASR integrationoutputs the selected candidate words for further processing by other programs and processes, such as a media guidance application, a vehicle control system, a smart home controller, kitchen appliances, or retail operations. The output may be apportioned in a variety manners, for example, by providing only the word or words corresponding to the most recent portion of speech or by providing particular portions of speech, such as phrases or sentences. In some instances, for example, where a previously provided word has been modified (e.g., based on sequence aware modifications to prior words due to a high-confidence later word), one or more flags or indicators may be provided by the ASR integrationto trigger processing by the recipient program or process. Other flags may be provided to other programs and processes, for example, to highlight words associated with high or low confidence values, to highlight likely commands or selections of domain-specific words, or to provide information regarding the ASR integration (e.g., implementation of domain-specific biasing based on reception of a threshold portion of domain-specific words). ASR integration may also receive information to influence its processing and selections (e.g., biasing) from other processes and programs (e.g., based on user selections, results determined by other programs and processes, etc.).

shows an illustrative embodiment of incremental speech processing for a media guidance application with multiple ASRs, in accordance with some embodiments of the disclosure. In the exemplary embodiment of, incremental speech processing is applied using a sequence aware general ASR, sequence aware domain-specific ASR, and sequence unaware general ASR. The user's input speech is depicted in column, with incremental speech input depicted in rows-. Columncorresponds to ASR results and columncorresponds to confidence factors. Each of rows-includes 3 sub-rows corresponding to the ASR results columnand confidence factor column, with the top sub-row (i.e., sub-rowin row) corresponding to output from a sequence aware general ASR, the middle sub-row (i.e., sub-rowin row) corresponding to output from a sequence aware domain-specific ASR, and bottom sub-row (i.e., sub-rowin row) corresponding to output from a sequence unaware general ASR.

In row, the user has provided input speech of “show.” The sequence aware general ASR and sequence aware domain-specific ASR, lacking context to consider in forming a recommendation, may output candidate words of “snow” and “show,” each with respective confidence values of 0.4. The sequence unaware ASR may correctly recognize the input speech as corresponding to “show” and may have a relatively higher confidence value of 0.7. The ASR integration may thus select “show” as the initial incremental output corresponding to the speech received in row.

In row, the user has provided additional input speech of “me.” The sequence aware general ASR may be initially working from incorrect context (e.g., the candidate word of “snow”) and may thus have difficulty constructing a proper sequence of words including the new input, resulting in a confidence value of 0.2. The confidence value for the domain-specific ASR may increase to 0.6 for the candidate words “show me,” based on correspondence to a likely media guidance command. The sequence unaware ASR may properly capture the word “me,” providing additional confirmation based on a confidence value of 0.7. As a result, the ASR integration may provide an incremental output in column/rowof “show me.”

In row, the user has provided additional input speech of “this.” Based on feedback of the previous incremental outputs, the sequence aware general ASR may correctly identify “this” as a candidate word with a confidence value of 0.6. The sequence unaware general ASR may recognize “this” as a candidate word with a confidence value 0.7. The domain-specific ASR may not recognize “this,” for example, based on “this” lacking a correspondence to common usage in the media guidance application of the domain-specific ASR, and may instead identify “miss” (e.g., as in, did the user “miss” a show) with a low confidence value of 0.3. Based on the multiple high confidence values for “this,” the ASR integration may provide an incremental output in column/rowof “show me this.”

In row, the user has provided additional input speech of “week's.” Both of the sequence aware ASRs (general and domain-specific) may recognize the common sequenced speech of “this week's” including recognition that “week's” is possessive based on the related context. The resulting respective confidence values of 0.7 and 0.5 may be based in part on the confidence from the previous analysis of “this.” The sequence unaware ASR may identify the most recent word as “weak” and may have a relatively high confidence value of 0.6, for example, based on the user not pronouncing the ending “s” clearly or emphatically. The ASR integration may nonetheless select “week's” for output based on the relative scores and the agreement between the sequence aware general and domain-specific ASRs. Thus, the resulting complete incremental output corresponding to column/rowmay be “show me this week's.”

In row, the user has provided additional input speech of “movies.” The domain-specific ASR may recognize this speech as corresponding to “movies” with a high confidence value of 0.8, for example, based on a limited number of options having a similar sound profile within the domain-specific ASR. Both of the general ASRs (sequence aware and sequence unaware) may recognize the input speech as “moves” but with relatively low confidence values of 0.4 and 0.5, respectively. Based on these confidence values, the resulting complete incremental output corresponding to column/rowmay be “show me this week's movies.”

In row, the user has provided additional input speech of “with.” Both of the sequence aware ASRs (general and domain-specific) may recognize the common sequenced speech of “movies with,” in view of the previous inclusion of “movies” as the last word in the previous incremental output, resulting in respective confidence values of 0.7 and 0.6. The sequence unaware ASR may identify the most recent word as “this” and may have a relatively low confidence value of 0.4. The ASR integration may select “with” for output based on the relative scores and the agreement between the sequence aware general and domain-specific ASRs. Thus, the resulting complement incremental output corresponding to column/rowmay be “show me this week's movies with.”

In row, the user has provided additional input speech of “Penelope.” The domain-specific ASR may recognize this speech as corresponding to “Penelope” with a high confidence value of 0.8, for example, based on the use of the name “Penelope” in media asset titles and in the names of actresses. Both of the general ASRs (sequence aware and sequence unaware) may recognize the input speech as “panoply” but with relatively low confidence values of 0.3 and 0.2, for example, based on a large edit distance between the raw speech data and the most likely selection of panoply. Based on these confidence values, the resulting complete incremental output corresponding to column/rowmay be “show me this week's movies with Penelope.”

In row, the user has provided additional input speech of “Cruz.” The domain-specific ASR may recognize this speech as corresponding to “Cruz” with a high confidence value of 0.9, for example, based on the use of the name “Cruz” by actors and actresses and the correspondence to the previous word “Penelope” for the actress Penelope Cruz. The sequence aware general ASR may recognize the input speech as “cruise” but may nonetheless provide a low confidence value of 0.2, based on a lack of usage of “panoply” (the previous candidate word from the sequence aware ASR) in sequence with the word “cruise.” The sequence unaware ASR may output a candidate word of “cruise” with a relatively high confidence value of 0.8. The ASR integration may select the output from the domain-specific ASR based on the high confidence value, and also based on an applied bias due to previous language (e.g., “show me this week's movies”) being indicative that the user is using vocabulary specific to the media guidance domain. The resulting complete incremental output corresponding to column/rowmay be “show me this week's movies with Penelope Cruz.”

describe exemplary devices, systems, servers, and related hardware for creating, distributing, analyzing, and displaying media assets and content in accordance with the present disclosure. Such devices may interact with the speech recognition devices and methods described herein. As referred to herein, the terms “media asset” and “content” should be understood to mean an electronically consumable user asset, such as television programming, as well as pay-per-view programs, on-demand programs (as in video-on-demand (VOD) systems), Internet content (e.g., streaming content, downloadable content, Webcasts, etc.), video clips, audio, content information, pictures, rotating images, documents, playlists, websites, articles, books, electronic books, blogs, chat sessions, social media, applications, games, and/or any other media or multimedia and/or combination of the same. As referred to herein, the term “multimedia” should be understood to mean content that utilizes at least two different content forms described above, for example, text, audio, images, video, or interactivity content forms. Content may be recorded, played, displayed or accessed by user equipment devices, but can also be part of a live performance.

The application and/or any instructions for performing any of the embodiments discussed herein may be encoded on computer readable media. Computer readable media includes any media capable of storing data. The computer readable media may be transitory, including, but not limited to, propagating electrical or electromagnetic signals, or may be non-transitory including, but not limited to, volatile and non-volatile computer memory or storage devices such as a hard disk, floppy disk, USB drive, DVD, CD, media cards, register memory, processor caches, Random Access Memory (“RAM”), etc.

As referred to herein, the phrase “user equipment device,” “user equipment,” “user device,” “electronic device,” “electronic equipment,” “media equipment device,” or “media device” should be understood to mean any device for accessing the content described above, such as a television, a Smart TV, a set-top box, an integrated receiver decoder (IRD) for handling satellite television, a digital storage device, a digital media receiver (DMR), a digital media adapter (DMA), a streaming media device, a DVD player, a DVD recorder, a connected DVD, a local media server, a BLU-RAY player, a BLU-RAY recorder, a personal computer (PC), a laptop computer, a tablet computer, a WebTV box, a personal computer television (PC/TV), a PC media server, a PC media center, a hand-held computer, a stationary telephone, a personal digital assistant (PDA), a mobile telephone, a portable video player, a portable music player, a portable gaming machine, a smartphone, or any other television equipment, computing equipment, or wireless device, and/or combination of the same. In some embodiments, the user equipment device may have a front facing screen and a rear facing screen, multiple front screens, or multiple angled screens. In some embodiments, the user equipment device may have a front facing camera and/or a rear facing camera. On these user equipment devices, users may be able to navigate among and locate the same content available through a television. Consequently, a user interface in accordance with the present disclosure may be available on these devices, as well. The user interface may be for content available only through a television, for content available only through one or more of other types of user equipment devices, or for content available both through a television and one or more of the other types of user equipment devices. The user interfaces described herein may be provided as online applications (i.e., provided on a web-site), or as stand-alone applications or clients on user equipment devices. Various devices and platforms that may implement the present disclosure are described in more detail below.

The devices and systems described herein may allow a user to provide user profile information or may automatically compile user profile information. An application may, for example, monitor the content the user accesses and/or other interactions the user may have with the system and media assets provided through the system. Additionally, the application may obtain all or part of other user profiles that are related to a particular user (e.g., from other websites on the Internet the user accesses, such as www.Tivo.com, from other applications the user accesses, from other interactive applications the user accesses, from another user equipment device of the user, etc.), and/or obtain information about the user from other sources that the application may access. As a result, a user can be provided with a unified experience across the user's different user equipment devices. Additional personalized application features are described in greater detail in Ellis et al., U.S. Patent Application Publication No. 2005/0251827, filed Jul. 11, 2005, Boyer et al., U.S. Pat. No. 7,165,098, issued Jan. 16, 2007, and Ellis et al., U.S. Patent Application Publication No. 2002/0174430, filed Feb. 21, 2002, which are hereby incorporated by reference herein in their entireties.

Users may access content and applications from one or more of their user equipment devices.shows generalized embodiments of illustrative user equipment deviceand illustrative user equipment system. For example, user equipment devicecan be a smartphone device. In another example, user equipment systemcan be a user television equipment system. In another example, user equipment systemmay be in-vehicle entertainment system and/or vehicle control system. User equipment systemmay comprise a set top box. Set top boxmay be communicatively connected to speakerand display. In some embodiments, displaymay be a television display or a computer display. In some embodiments, set top boxmay be communicatively connected to user interface input. In some embodiments, user interface inputmay be a remote control device. Set top boxmay include circuit board. In some embodiments, circuit boardmay include processing circuitry, control circuitry, and storage (e.g., RAM, ROM, Hard Disk, Removable Disk, etc.). In some embodiments, circuit boardmay include an input/output path. Additional implementations of user equipment devices are discussed below in connection with. Each one of user equipment deviceand user equipment systemmay receive content and data via input/output (hereinafter “I/O”) path. I/O pathmay provide content (e.g., broadcast programming, on-demand programming, Internet content, content available over a local area network (LAN) or wide area network (WAN), and/or other content) and data to control circuitry, which includes processing circuitryand storage. Control circuitrymay be used to send and receive commands, requests, and other suitable data using I/O path. I/O pathmay connect control circuitry(and specifically processing circuitry) to one or more communication paths (described below). I/O functions may be provided by one or more of these communication paths, but are shown as a single path into avoid overcomplicating the drawing.

Control circuitrymay be based on any suitable processing circuitry such as processing circuitry. As referred to herein, processing circuitry should be understood to mean circuitry based on one or more microprocessors, microcontrollers, digital signal processors, programmable logic devices, field-programmable gate arrays (FPGAs), application-specific integrated circuits (ASICs), etc., and may include a multi-core processor (e.g., dual-core, quad-core, hexa-core, or any suitable number of cores) or supercomputer. In some embodiments, processing circuitry may be distributed across multiple separate processors or processing units, for example, multiple of the same type of processing units (e.g., two Intel Core i7 processors) or multiple different processors (e.g., an Intel Core i5 processor and an Intel Core i7 processor). In some embodiments, control circuitryexecutes instructions for an application stored in memory (i.e., storage). Specifically, control circuitrymay be instructed by applications to perform the functions discussed above and below. For example, applications may provide instructions to control circuitryto generate displays. In some implementations, any action performed by control circuitrymay be based on instructions received from the applications.

In client/server-based embodiments, control circuitrymay include communications circuitry suitable for communicating with an application server or other networks or servers. The instructions for carrying out the above-mentioned functionality may be stored on the application server. Communications circuitry may include a cable modem, an integrated services digital network (ISDN) modem, a digital subscriber line (DSL) modem, telephone modem, Ethernet card, or wireless modem for communications with other equipment, or any other suitable communications circuitry. Such communications may involve the Internet or any other suitable communications networks or paths (which are described in more detail in connection with). In addition, communications circuitry may include circuitry that enables peer-to-peer communication of user equipment devices, or communication of user equipment devices in locations remote from each other (described in more detail below).

Memory may be an electronic storage device provided as storagethat is part of control circuitry. As referred to herein, the phrase “electronic storage device” or “storage device” should be understood to mean any device for storing electronic data, computer software, or firmware, such as random-access memory, read-only memory, hard drives, optical drives, digital video disc (DVD) recorders, compact disc (CD) recorders, BLU-RAY disc (BD) recorders, BLU-RAY 3D disc recorders, digital video recorders (DVR, sometimes called a personal video recorder, or PVR), solid state devices, quantum storage devices, gaming consoles, gaming media, or any other suitable fixed or removable storage devices, and/or any combination of the same. Storagemay be used to store various types of content described herein as well as data described above. Nonvolatile memory may also be used (e.g., to launch a boot-up routine and other instructions). Cloud-based storage, described in relation to, may be used to supplement storageor instead of storage.

Control circuitrymay include video generating circuitry and tuning circuitry, such as one or more analog tuners, one or more MPEG-2 decoders or other digital decoding circuitry, high-definition tuners, or any other suitable tuning or video circuits or combinations of such circuits. Encoding circuitry (e.g., for converting over-the-air, analog, or digital signals to MPEG signals for storage) may also be provided. Control circuitrymay also include scaler circuitry for upconverting and downconverting content into the preferred output format of each one of user equipment deviceand user equipment system. Circuitrymay also include digital-to-analog converter circuitry and analog-to-digital converter circuitry for converting between digital and analog signals. The tuning and encoding circuitry may be used by the user equipment device to receive and to display, to play, or to record content. The tuning and encoding circuitry may also be used to receive guidance data. The circuitry described herein, including for example, the tuning, video generating, encoding, decoding, encrypting, decrypting, scaler, and analog/digital circuitry, may be implemented using software running on one or more general purpose or specialized processors. Multiple tuners may be provided to handle simultaneous tuning functions (e.g., watch and record functions, picture-in-picture (PIP) functions, multiple-tuner recording, etc.). If storageis provided as a separate device from each one of user equipment deviceand user equipment system, the tuning and encoding circuitry (including multiple tuners) may be associated with storage.

A user may send instructions to control circuitryusing user input interface. User input interfacemay be any suitable user interface, such as a remote control, mouse, trackball, keypad, keyboard, touch screen, touchpad, stylus input, joystick, microphone, voice recognition interface, or other user input interfaces. Displaymay be provided as a stand-alone device or integrated with other elements of each one of user equipment deviceand user equipment system. For example, displaymay be a touchscreen or touch-sensitive display. In such circumstances, user input interfacemay be integrated with or combined with display. Displaymay be one or more of a monitor, a television, a liquid crystal display (LCD) for a mobile device, amorphous silicon display, low temperature poly silicon display, electronic ink display, electrophoretic display, active matrix display, electro-wetting display, electrofluidic display, cathode ray tube display, light-emitting diode display, electroluminescent display, plasma display panel, high-performance addressing display, thin-film transistor display, organic light-emitting diode display, surface-conduction electron-emitter display (SED), laser television, carbon nanotubes, quantum dot display, interferometric modulator display, or any other suitable equipment for displaying visual images. In some embodiments, displaymay be HDTV-capable. In some embodiments, displaymay be a 3D display, and the interactive application and any suitable content may be displayed in 3D. A video card or graphics card may generate the output to the display. The video card may offer various functions such as accelerated rendering of 3D scenes and 2D graphics, MPEG-2/MPEG-4 decoding, TV output, or the ability to connect multiple monitors. The video card may be any processing circuitry described above in relation to control circuitry. The video card may be integrated with the control circuitry. Speakersmay be provided as integrated with other elements of each one of user equipment deviceand user equipment systemor may be stand-alone units. The audio component of videos and other content displayed on displaymay be played through speakers. In some embodiments, the audio may be distributed to a receiver (not shown), which processes and outputs the audio via speakers.

Applications may be implemented using any suitable architecture. For example, they may be stand-alone applications wholly implemented on each one of user equipment deviceand user equipment system. In such an approach, instructions of the applications are stored locally (e.g., in storage), and data for use by the application is downloaded on a periodic basis (e.g., from an out-of-band feed, from an Internet resource, or using another suitable approach). Control circuitrymay retrieve instructions of the application from storageand process the instructions to generate any of the displays discussed herein. Based on the processed instructions, control circuitrymay determine what action to perform when input is received from input interface. For example, movement of a cursor on a display up/down may be indicated by the processed instructions when input interfaceindicates that an up/down button was selected.

In some embodiments, the application is a client/server-based application. Data for use by a thick or thin client implemented on each one of user equipment deviceand user equipment systemis retrieved on demand by issuing requests to a server remote to each one of the user equipment deviceand the user equipment system. In one example of a client/server-based application, control circuitryruns a web browser that interprets web pages provided by a remote server. For example, the remote server may store the instructions for the application in a storage device. The remote server may process the stored instructions using circuitry (e.g., control circuitry) and generate the displays discussed above and below. The client device may receive the displays generated by the remote server and may display the content of the displays locally on each one of equipment deviceand equipment system. This way, the processing of the instructions is performed remotely by the server while the resulting displays are provided locally on each one of equipment deviceand equipment system. Each one of equipment deviceand equipment systemmay receive inputs from the user via input interfaceand transmit those inputs to the remote server for processing and generating the corresponding displays. For example, each one of equipment deviceand equipment systemmay transmit a communication to the remote server indicating that an up/down button was selected via input interface. The remote server may process instructions in accordance with that input and generate a display of the application corresponding to the input (e.g., a display that moves a cursor up/down). The generated display is then transmitted to each one of equipment deviceand equipment systemfor presentation to the user.

In some embodiments, the application is downloaded and interpreted or otherwise run by an interpreter or virtual machine (run by control circuitry). In some embodiments, the application may be encoded in the ETV Binary Interchange Format (EBIF), received by control circuitryas part of a suitable feed, and interpreted by a user agent running on control circuitry. For example, the application may be an EBIF application. In some embodiments, the application may be defined by a series of JAVA-based files that are received and run by a local virtual machine or other suitable middleware executed by control circuitry. In some of such embodiments (e.g., those employing MPEG-2 or other digital media encoding schemes), the application may be, for example, encoded and transmitted in an MPEG-2 object carousel with the MPEG audio and video packets of a program.

Each one of user equipment deviceand user equipment systemofcan be implemented in systemofas user television equipment, user computer equipment, wireless user communications device, or any other type of user equipment suitable for accessing content, such as a non-portable gaming machine. For simplicity, these devices may be referred to herein collectively as user equipment or user equipment devices, and may be substantially similar to user equipment devices described above. User equipment devices, on which an application may be implemented, may function as stand-alone devices or may be part of a network of devices. Various network configurations of devices may be implemented and are discussed in more detail below.

A user equipment device utilizing at least some of the system features described above in connection withmay not be classified solely as user television equipment, user computer equipment, or wireless user communications device. For example, user television equipmentmay, like some user computer equipment, be Internet-enabled, allowing for access to Internet content, while user computer equipmentmay, like some television equipment, include a tuner allowing for access to television programming. Applications may have the same layout on various different types of user equipment or may be tailored to the display capabilities of the user equipment. For example, on user computer equipment, applications may be provided as a website accessed by a web browser. In another example, applications may be scaled down for wireless user communications devices.

In system, there are typically more than one of each type of user equipment device, but only one of each is shown into avoid overcomplicating the drawing. In addition, each user may utilize more than one type of user equipment device and also more than one of each type of user equipment device.

In some embodiments, a user equipment device (e.g., user television equipment, user computer equipment, wireless user communications device) may be referred to as a “second screen device.” For example, a second screen device may supplement content presented on a first user equipment device. The content presented on the second screen device may be any suitable content that supplements the content presented on the first device. In some embodiments, the second screen device provides an interface for adjusting settings and display preferences of the first device. In some embodiments, the second screen device is configured for interacting with other second screen devices or for interacting with a social network. The second screen device can be located in the same room as the first device, a different room from the first device but in the same house or building, or in a different building from the first device. In some embodiments, speech may be monitored from any of the user equipment devices and communicated between the devices for processing as described herein.

Patent Metadata

Filing Date

Unknown

Publication Date

March 10, 2026

Inventors

Unknown

Want to explore more patents?

Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.

Citation & reuse

Analysis on this page is generated by Patentable — an AI-powered patent intelligence platform. AI-generated summaries, explanations, and analysis may be reused with attribution and a visible link back to the canonical URL below. Patent abstracts and claims are USPTO public domain.

Cite as: Patentable. “Word correction using automatic speech recognition (ASR) incremental response” (US-12573405-B2). https://patentable.app/patents/US-12573405-B2

© 2026 Patentable. All rights reserved.

Patentable is a research and drafting-assistant tool, not a law firm, and does not provide legal advice. Documents we generate are drafts for review by a licensed patent attorney.

Word correction using automatic speech recognition (ASR) incremental response | Patentable