US-8521523

Selecting speech data for speech recognition vocabulary

PublishedAugust 27, 2013

Assigneenot available in USPTO data we have

Inventorsnot available in USPTO data we have

Technical Abstract

Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for selecting training data. In one aspect, a method comprises: selecting a target out of vocabulary rate; selecting a target percentage of user sessions; and determining a minimum training data freshness for a vocabulary of words, the minimum training data freshness corresponding to the target percentage of user sessions experiencing the target out of vocabulary rate.

Patent Claims

12 claims

Legal claims defining the scope of protection. Each claim is shown in both the original legal language and a plain English translation.

Claim 1

Original Legal Text

1. A computer-implemented method comprising: Selecting, by one or more computers, a target out of vocabulary rate that indicates a rate at which a word included in a search query is not included in a vocabulary; selecting, by one or more computers, a target percentage of user sessions, wherein the target percentage represents a percentage of user sessions that include search queries that include words that are included in a vocabulary of words at a rate that satisfies the target out of vocabulary rate; obtaining, by one or more computers, a training set of user sessions, each user session in the training set comprising one or more search queries that each include one or more words; and determining, by one or more computers, and based on the training set, a minimum training data freshness for a vocabulary of words, the minimum training data freshness corresponding to at least the target percentage of the user sessions in the training set experiencing the target out of vocabulary rate.

Plain English Translation

The method involves computers optimizing speech recognition by selecting a target "out of vocabulary rate" (the rate words in a search query are not in the vocabulary). It also selects a target percentage of user sessions where search queries have an acceptable "out of vocabulary rate". The method obtains a training set of user sessions with search queries, then determines the minimum "training data freshness" (how recent data needs to be) for a vocabulary that ensures the target percentage of user sessions experience the target out of vocabulary rate. This determines how often the vocabulary needs updating.

Claim 2

Original Legal Text

2. The method of claim 1 , wherein the vocabulary comprises a plurality of unique words that were included in search queries received during previous search sessions.

Plain English Translation

The method described previously, which involves computers optimizing speech recognition by selecting a target "out of vocabulary rate" and a target percentage of user sessions where search queries have an acceptable "out of vocabulary rate", obtaining a training set of user sessions with search queries, and determining the minimum "training data freshness" (how recent data needs to be) for a vocabulary that ensures the target percentage of user sessions experience the target out of vocabulary rate, uses a vocabulary comprising unique words from previous search queries to create the speech recognition vocabulary.

Claim 3

Original Legal Text

3. The method of claim 2 , wherein the training data freshness indicates a length of time between an end of a first period of time during which the previous search sessions occurred and a beginning of a second period of time during which the training data freshness is determined.

Plain English Translation

This invention relates to improving search systems by evaluating the freshness of training data used to refine search algorithms. The problem addressed is ensuring that search results remain relevant by accounting for how recent the training data is, as outdated training data can lead to degraded search performance over time. The method involves determining a measure of training data freshness, which represents the time elapsed between a first period when previous search sessions occurred and a second period when the freshness is assessed. This freshness metric helps evaluate whether the training data is still representative of current user behavior and preferences. By incorporating this freshness measure, the search system can adjust its training process to prioritize more recent data, thereby maintaining accuracy and relevance in search results. The method may also include generating a training dataset from the previous search sessions, where the dataset includes search queries, user interactions, and other relevant data. The freshness metric is then used to determine whether the training dataset should be updated or supplemented with newer data to improve search performance. This approach ensures that the search algorithm adapts to evolving user behavior and content trends, preventing degradation in search quality over time.

Claim 4

Original Legal Text

4. The method of claim 1 , further comprising: generating, from a plurality of unique words that were included in search queries received during search sessions, a speech recognition vocabulary having a training data freshness equal to or more recent than the minimum training data freshness; and providing the speech recognition vocabulary to a data processing apparatus that operates a speech recognition service.

Plain English Translation

This invention relates to improving speech recognition accuracy by dynamically updating a speech recognition vocabulary based on recent search query data. The problem addressed is the degradation of speech recognition performance over time due to outdated training data, which fails to capture evolving language usage, new terms, or trending topics. The method involves collecting a plurality of unique words from search queries received during active search sessions. These words are used to generate a speech recognition vocabulary that reflects current language trends. The vocabulary is trained with data that meets or exceeds a specified minimum freshness threshold, ensuring it remains up-to-date. This updated vocabulary is then provided to a data processing apparatus operating a speech recognition service, enhancing its ability to accurately recognize and process spoken queries. The invention may also include preprocessing the search queries to filter out irrelevant or low-quality terms, ensuring the vocabulary remains relevant and effective. Additionally, the method may involve periodically refreshing the vocabulary to maintain its accuracy as language usage continues to evolve. By dynamically updating the speech recognition vocabulary with recent search query data, the system improves recognition performance for emerging terms and trends.

Claim 5

Original Legal Text

5. A system comprising: a data processing apparatus; and a storage device storing instructions executable by the data processing apparatus that, upon execution by the data processing apparatus, cause the data processing apparatus to perform operations comprising: selecting a target out of vocabulary rate that indicates a rate at which a word included in a search query is not included in a vocabulary; selecting a target percentage of user sessions, wherein the target percentage represents a percentage of user sessions that include search queries that include words that are included in a vocabulary of words at a rate that satisfies the target out of vocabulary rate; obtaining a training set of user sessions, each user session in the training set comprising one or more search queries that each include one or more words; and determining, based on the training set, a minimum training data freshness for a vocabulary of words, the minimum training data freshness corresponding to at least the target percentage of the user sessions in the training set experiencing the target out of vocabulary rate.

Plain English Translation

A system uses a data processing apparatus and a storage device to optimize speech recognition. The system selects a target "out of vocabulary rate" (the rate words in a search query are not in the vocabulary) and a target percentage of user sessions where search queries have an acceptable "out of vocabulary rate". The system obtains a training set of user sessions with search queries, then determines the minimum "training data freshness" (how recent data needs to be) for a vocabulary that ensures the target percentage of user sessions experience the target out of vocabulary rate. This determines how often the vocabulary needs updating.

Claim 6

Original Legal Text

6. The system of claim 5 , wherein the vocabulary comprises a plurality of unique words that were included in search queries received during previous search sessions.

Plain English Translation

The system described previously, which uses a data processing apparatus and a storage device to optimize speech recognition by selecting a target "out of vocabulary rate" and a target percentage of user sessions where search queries have an acceptable "out of vocabulary rate", obtaining a training set of user sessions with search queries, and determining the minimum "training data freshness" for a vocabulary that ensures the target percentage of user sessions experience the target out of vocabulary rate, uses a vocabulary comprising unique words from previous search queries to create the speech recognition vocabulary.

Claim 7

Original Legal Text

7. The system of claim 6 , wherein the training data freshness indicates a length of time between an end of a first period of time during which the previous search sessions occurred and a beginning of a second period of time during which the training data freshness is determined.

Plain English Translation

The system described in the previous claim, which uses a data processing apparatus and a storage device to optimize speech recognition by selecting a target "out of vocabulary rate" and a target percentage of user sessions where search queries have an acceptable "out of vocabulary rate", obtaining a training set of user sessions with search queries, and determining the minimum "training data freshness" for a vocabulary that ensures the target percentage of user sessions experience the target out of vocabulary rate, where the vocabulary comprises unique words from previous search queries, defines "training data freshness" as the time elapsed between the end of the period when previous searches occurred and the start of the period when the freshness is assessed. This represents how old the training data is considered to be.

Claim 8

Original Legal Text

8. The system of claim 5 , wherein the operations further comprise: generating, from a plurality of unique words that were included in search queries received during search sessions, a speech recognition vocabulary having a training data freshness equal to or more recent than the minimum training data freshness; and providing the speech recognition vocabulary to a data processing apparatus that operates a speech recognition service.

Plain English Translation

The system described previously, which uses a data processing apparatus and a storage device to optimize speech recognition by selecting a target "out of vocabulary rate" and a target percentage of user sessions where search queries have an acceptable "out of vocabulary rate", obtaining a training set of user sessions with search queries, and determining the minimum "training data freshness" for a vocabulary that ensures the target percentage of user sessions experience the target out of vocabulary rate, further generates a speech recognition vocabulary from unique words from search queries, ensuring it's as fresh or fresher than the minimum determined freshness. This vocabulary is then provided to a speech recognition service.

Claim 9

Original Legal Text

9. A computer readable storage device encoded with a computer program, the program comprising instructions that when executed by a data processing apparatus cause the data processing apparatus to perform operations comprising: selecting a target out of vocabulary rate that indicates a rate at which a word included in a search query is not included in a vocabulary; selecting a target percentage of user sessions, wherein the target percentage represents a percentage of user sessions that include search queries that include words that are included in a vocabulary of words at a rate that satisfies the target out of vocabulary rate; obtaining a training set of user sessions, each user session in the training set comprising one or more search queries that each include one or more words; and determining, based on the training set, a minimum training data freshness for a vocabulary of words, the minimum training data freshness corresponding to at least the target percentage of the user sessions in the training set experiencing the target out of vocabulary rate.

Plain English Translation

A computer-readable storage device holds a program that, when executed, optimizes speech recognition. It selects a target "out of vocabulary rate" (the rate words in a search query are not in the vocabulary) and a target percentage of user sessions where search queries have an acceptable "out of vocabulary rate". The program obtains a training set of user sessions with search queries, then determines the minimum "training data freshness" (how recent data needs to be) for a vocabulary that ensures the target percentage of user sessions experience the target out of vocabulary rate. This determines how often the vocabulary needs updating.

Claim 10

Original Legal Text

10. The computer storage device of claim 9 , wherein the vocabulary comprises a plurality of unique words that were included in search queries received during previous search sessions.

Plain English Translation

The computer storage device described previously, holding a program that optimizes speech recognition by selecting a target "out of vocabulary rate" and a target percentage of user sessions where search queries have an acceptable "out of vocabulary rate", obtaining a training set of user sessions with search queries, and determining the minimum "training data freshness" for a vocabulary that ensures the target percentage of user sessions experience the target out of vocabulary rate, uses a vocabulary comprising unique words from previous search queries to create the speech recognition vocabulary.

Claim 11

Original Legal Text

11. The computer storage device of claim 10 , wherein the training data freshness indicates a length of time between an end of a first period of time during which the previous search sessions occurred and a beginning of a second period of time during which the training data freshness is determined.

Plain English Translation

The computer storage device described in the previous claim, holding a program that optimizes speech recognition by selecting a target "out of vocabulary rate" and a target percentage of user sessions where search queries have an acceptable "out of vocabulary rate", obtaining a training set of user sessions with search queries, and determining the minimum "training data freshness" for a vocabulary that ensures the target percentage of user sessions experience the target out of vocabulary rate, where the vocabulary comprises unique words from previous search queries, defines "training data freshness" as the time elapsed between the end of the period when previous searches occurred and the start of the period when the freshness is assessed. This represents how old the training data is considered to be.

Claim 12

Original Legal Text

12. The computer storage device of claim 9 , wherein the operations further comprise: generating, from a plurality of unique words that were included in search queries received during search sessions, a speech recognition vocabulary having a training data freshness equal to or more recent than the minimum training data freshness; and providing the speech recognition vocabulary to a data processing apparatus that operates a speech recognition service.

Plain English Translation

The computer storage device described previously, holding a program that optimizes speech recognition by selecting a target "out of vocabulary rate" and a target percentage of user sessions where search queries have an acceptable "out of vocabulary rate", obtaining a training set of user sessions with search queries, and determining the minimum "training data freshness" for a vocabulary that ensures the target percentage of user sessions experience the target out of vocabulary rate, further generates a speech recognition vocabulary from unique words from search queries, ensuring it's as fresh or fresher than the minimum determined freshness. This vocabulary is then provided to a speech recognition service.

Classification Codes (CPC)

Cooperative Patent Classification codes for this invention. Click any code to explore related patents in that topic.

G10L G06F

Patent Metadata

Filing Date

August 24, 2012

Publication Date

August 27, 2013

Want to explore more patents?

Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.

Browse All Patents Try Prior Art Search