Legal claims defining the scope of protection, as filed with the USPTO.
1. A system for dynamic selection of speech recognition engines, the system comprising: a memory having computer readable instructions stored thereon; and at least one processor configured to execute the computer readable instructions to: receive a plurality of audio files, the plurality of audio files comprising a plurality of sample training audio files and at least one run-time audio file; segment the at least one run-time audio file into a plurality of blocks; compute at least one first variable value for a first block of the at least one run-time audio file and at least one second variable value for a subsequent block of the at least one run-time audio file following the first block, wherein the at least one first variable value and the at least one second variable value are at least one of a language, an accent, a business domain, a context, a rate of speech, or a noise level; dynamically select, based on the at least one first variable value, a first speech recognition engine from among a set of available speech recognition engines for the first block of the plurality of blocks of the at least one run-time audio file; process the first block of the plurality of blocks using the first speech recognition engine; determine an accuracy measure of the processing of the first block using the first speech recognition engine; and dynamically select, based on the at least one second variable value and the accuracy measure of the processing of the first block, a second speech recognition engine from among the set of available speech recognition engines for the subsequent block of the plurality of blocks of the at least one run-time audio file, the second speech recognition engine being different from the first speech recognition engine; and transcribe the at least one run-time audio file using the first and second speech recognition engines.
2. The system of claim 1 , wherein the at least one processor is configured to compute the at least one first variable value and the at least one second variable value by: analyzing the plurality of sample training audio files to determine a plurality of influencing variables associated with each sample training audio file of the plurality of sample training audio files; transcribing each sample training audio file of the plurality of sample training audio files into a corresponding text file, the text file generated using a speech recognition engine selected from the set of available speech recognition engines; and generating an accuracy map for each speech recognition engine of the set of available speech recognition engines based on the plurality of influencing variables.
3. The system of claim 2 , wherein the at least one processor is configured to analyze the plurality of sample training audio files by: dividing each sample training audio file into a plurality of sample training audio file segments; and determining the plurality of influencing variables for each sample training audio file segment of the plurality of sample training audio file segments.
4. The system of claim 3 , wherein the at least one processor is further configured to: compute a variable value for each sample training audio file based on the plurality of influencing variables for each sample training audio file segment of the corresponding sample training audio file.
5. The system of claim 2 , wherein the at least one processor is configured to generate the accuracy map by computing an accuracy index for each available speech recognition engine, the accuracy index based on the transcribed sample training audio file corresponding to each available speech recognition engine.
6. The system of claim 2 , wherein the at least one processor is configured to dynamically select the first speech recognition engine based on the at least one first variable value and the accuracy map.
7. The system of claim 1 , wherein the at least one processor is configured to dynamically select the second speech recognition engine for the subsequent block of the plurality of blocks by: selecting a third speech recognition engine from the set of available speech recognition engines based on the accuracy measure of the processing of the first block; and re-processing the first block using the third speech recognition engine.
8. The system of claim 7 , wherein the at least one processor is configured to dynamically select the second speech recognition engine for the subsequent block of the plurality of blocks by: selecting the third speech recognition engine for the subsequent block based on re-processing of the first block using the third speech recognition engine; and processing the subsequent block using the third speech recognition engine.
9. The system of claim 1 , wherein the at least one processor is configured to segment the at least one run-time audio file into the plurality of blocks based on one or more attributes, wherein the one or more attributes comprise at least one of silent periods, time intervals, noise level, or combinations thereof.
10. The system of claim 1 , wherein the at least one first variable value comprises a first language, the at least one second variable value comprises a second language that is different from the first language, the first speech recognition engine corresponds to the first language, and the second speech recognition engine corresponds to the second language.
11. A method for dynamic selection of speech recognition engines, the method comprising: receiving, using at least one processor, a plurality of audio files, the plurality of audio files comprising a plurality of sample training audio files and at least one run-time audio file; segmenting the at least one run-time audio file into a plurality of blocks; computing, using the at least one processor, at least one first variable value for a first block of the at least one run-time audio file and at least one second variable value for a subsequent block of the at least one run-time audio file following the first block, wherein the at least one first variable value and the at least one second variable value are at least one of a language, an accent, a business domain, a context, a rate of speech, or a noise level; dynamically selecting, using the at least one processor, based on the at least one first variable value, a first speech recognition engine from among a set of available speech recognition engines for the first block of the plurality of blocks of the at least one run-time audio file; processing, using the at least one processor, the first block of the plurality of blocks using the first speech recognition engine; determine, using the at least one processor, an accuracy measure of the processing of the first block using the first speech recognition engine; and dynamically selecting, using the at least one processor, and based on the at least one second variable value and the accuracy measure of the processing of the first block, a second speech recognition engine from among the set of available speech recognition engines for the subsequent block of the plurality of blocks of the at least one run-time audio file, the second speech recognition engine being different from the first speech recognition engine; and transcribing, using the at least one processor, the at least one run-time audio file using the first and second speech recognition engines.
12. The method of claim 11 , wherein the computing the at least one first variable value and the at least one second variable value comprises: analyzing the plurality of sample training audio files to determine a plurality of influencing variables associated with each sample training audio file of the plurality of sample training audio files; transcribing each sample training audio file of the plurality of sample training audio files into a corresponding text file, the text file generated using a speech recognition engine selected from the set of available speech recognition engines; and generating an accuracy map for each speech recognition engine of the set of available speech recognition engines based on the plurality of influencing variables.
13. The method of claim 12 , wherein the analyzing the plurality of sample training audio files comprises: dividing each sample training audio file into a plurality of sample training audio file segments; and determining the plurality of influencing variables for each sample training audio file segment of the plurality of sample training audio file segments.
14. The method of claim 13 , further computing, using the at least one processor, a variable value for each sample training audio file based on the plurality of influencing variables for each sample training audio file segment of the corresponding sample training audio file.
15. The method of claim 12 , wherein generating the accuracy map comprises: computing an accuracy index for each available speech recognition engine, the accuracy index based on the transcribed sample training audio file corresponding to each available speech recognition engine.
16. The method of claim 12 , wherein dynamically selecting the first speech recognition engine is based on the at least one first variable value and the accuracy map.
17. The method of claim 11 , wherein dynamically selecting the second speech recognition engine for the subsequent block of the plurality of blocks further includes: selecting a third speech recognition engine from the set of available speech recognition engines based on the accuracy measure of the processing of the first block; and re-processing the first block using the third speech recognition engine.
18. The method of claim 17 , wherein dynamically selecting the second speech recognition engine for the subsequent block of the plurality of blocks further includes: selecting the third speech recognition engine for the subsequent block based on re-processing of the first block using the third speech recognition engine; and processing the subsequent block using the third speech recognition engine.
19. The method of claim 11 , wherein segmenting the at least one run-time audio file into the plurality of blocks is based on one or more attributes, wherein the one or more attributes comprise at least one of silent periods, time intervals, noise level, or combinations thereof.
20. The method of claim 11 , wherein the at least one first variable value comprises a first language, the at least one second variable value comprises a second language that is different from the first language, the first speech recognition engine corresponds to the first language, and the second speech recognition engine corresponds to the second language.
Unknown
August 10, 2021
Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.