Patentable/Patents/US-20250372083-A1
US-20250372083-A1

Performance Optimization for Real-Time Large Language Speech to Text Systems

PublishedDecember 4, 2025
Assigneenot available in USPTO data we have
Inventorsnot available in USPTO data we have
Technical Abstract

Methods and systems for transcribing communications are provided. Methods may include receiving a communication. Methods may include splitting the communication into a plurality of communication segments. Each communication segment may include two or more words. Methods may include transcribing each segment included in the plurality of communication segments, in parallel. The transcribing may include using a transformer neural network to transcribe each segment included in the plurality of communication segments. Methods may include generating a transcription from the transcribing. The transcription may be generated by combining the transcription of each of the communication segments into a combined transcription. Methods may include correcting the combined transcription.

Patent Claims

Legal claims defining the scope of protection, as filed with the USPTO.

1

. A method for maintaining accuracy in transcribing a communication, the method comprising:

2

. The method ofwherein the communication occurs between a human caller and an interactive voice response system.

3

. The method ofwherein each segment comprises thirty seconds of the communication.

4

. The method ofwherein each segment comprises a snippet of less than thirty seconds of the communication.

Detailed Description

Complete technical specification and implementation details from the patent document.

This application is a continuation of U.S. patent application Ser. No. 18/204,981 filed on Jun. 2, 2023, and entitled “PERFORMANCE OPTIMIZATION FOR REAL-TIME LARGE LANGUAGE SPEECH TO TEXT SYSTEMS” which is hereby incorporated by reference herein in its entirety.

Aspects of the disclosure relate to speech to text systems.

Speech to text systems may be used by interactive voice response systems. The speech to text systems may transcribe the communications between human callers and the interactive voice response systems.

Available speech to text systems may be computationally expensive and introduce latency. Therefore, it may be desirable to split each communication into multiple segments. It would be further desirable to process each of the multiple segments in parallel.

It should be noted that a layer of accuracy may be lost in the splitting of each communication because legacy speech to text systems may not be designed to decipher phrases accurately. Rather, the legacy speech to text systems may only be designed to decipher an entire conversation accurately.

Therefore, it would be further desirable to utilize a combination processing system to combine the multiple segments and correct inaccuracies after the communication segments are transcribed.

Apparatus and methods for a performance optimization method that reduces resource consumption is provided. The method includes splitting the call into smaller chunks. The chunks may be thirty second chunks. The chunks may be less than thirty second chunks. The chunks may be any other suitable size chunks. The processing of the various chunks of the conversation may be processed in parallel. Although a layer of accuracy may be lost in the segmenting of the call, a combination process may correct inaccuracies after the conversation has been transcribed. It should be noted that legacy systems may be designed to decipher an entire conversation and not phrases.

Breaking down conversations into smaller chunks may enable rapid processing of the transcribing by transcribing the communication chunks using parallel processing computing techniques. Such a system may initially generate less than completely accurate responses. The system may also include a correction model that may fix mistakes after the parallel processing. The system may create a complete more accurate transcription in shorter time periods with less resource consumption than legacy systems.

Apparatus and methods for maintaining accuracy in transcribing a communication is provided. Methods may include receiving a communication. The communication may be received at a first environment. The first environment may include a processor. The first environment may include associated computing components. The processor and the associated computing components may be specialized for transcription.

The communication may occur between a human caller and an interactive voice response system. The communication may be an interaction between any two suitable parties. The communication may be a real-time communication. The communication may be a historical or recorded communication.

Methods may include transcribing the communication. The transcribing may generate a first transcription. The transcribing may occur using a robust speech recognition model. The robust speech recognition model may use a large-scale weak supervision model. The weak supervision model may be a machine learning model where noisy, limited, or imprecise sources may be used to provide a supervision signal for labeling large amounts of training data in a supervised learning setting.

The robust speech recognition via large-scale weak supervision model may include a transformer neural network. The transformer neural network may be a deep learning model that may use self-attention to identify a significance weight for each portion of the input data. Transformer neural networks may be used in natural language processing. Unlike recurrent neural networks, transformer neural networks may process the entire input in one complete iteration. The transformer neural network may take an input sequence and convert it into a vector called an encoding, and then decode it back into another sequence. Transformer neural networks may be used to solve sequence-to-sequence tasks and may be capable of processing long-range dependencies.

Methods may include receiving the communication in a second environment. The second environment may include a processor. The second environment may include associated computing components. The processor and the associated computing components may be specialized for transcription. Methods may include splitting the communication into a plurality of communication segments. Each communication segment may include two or more words. Each communication segment may include thirty seconds of the communication. Each communication segment may include less than thirty seconds of the communication. The communication may be split using a predetermined amount of time, such as thirty seconds, twenty seconds or other suitable time period. The communication may be split using a predetermined amount of words, such as ten words, twenty words or other suitable number of words.

Methods may include identifying a number of communication segments included in the plurality of communication segments. Methods may include instantiating a plurality of instances of the robust speech recognition via large-scale weak supervision model. The integer number of instances, included in the plurality of instances, may be equivalent to the number of communication segments. As such, each communication segment may be assigned, or linked to, an instance of the model.

Methods may include assigning each communication segment to an instance of the robust speech recognition via large-scale weak supervision model. The instance may be one of the plurality of instances of the robust speech recognition via large-scale weak supervision model.

Methods may include transcribing each communication segment. The transcribing may include using parallel processing of the remaining instances. The transcribing may include using the assigned instance of the robust speech recognition via large-scale weak supervision model. The transcribing may transcribe the communication segment into a transcribed communication segment.

Methods may include combining the transcribed communication segments into a combined transcription. Methods may include correcting the combined transcription using a domain-specific correction module. The domain-specific correction module may be specific to a discipline, such as a financial industry. As such, the correction module may be able to fine tune transcriptions that may be associated with a financial industry.

Methods may include using the second environment to transcribe incoming communications. As such, the second environment may be used to transcribe communications as received in real-time and/or historical communications.

Methods may include a test environment. The test environment may include a processor. The test environment may include associated computing components. The processor and the associated computing components may be specialized for testing. The test environment may identify a first quantifiable resources value. The first quantifiable resource value may be calculated based on the amount of resources (or number of processor cycles) consumed by transcribing the communication using the robust speech recognition via large-scale weak supervision model.

The test environment may identify a first accuracy level. The first accuracy level may be the level of accuracy of the first transcription. The test environment may identify a second quantifiable resources value.

The second quantifiable resources value may be calculated based on the amount of resources consumed by transcribing the communication using the plurality of instances of the robust speech recognition via large-scale weak supervision model.

The test environment may determine that the first quantifiable resources value is greater than the second quantifiable resources value. The first quantifiable resource value may be greater than the second quantifiable resources value by over a predetermined resources value threshold.

The test environment may guarantee that the first quantifiable resources value is greater than the second quantifiable resources value. The test environment may guarantee or confirm that first quantifiable resources value is greater than the second quantifiable resources value by over a predetermined resources value threshold.

The test environment may identify a second accuracy level. The second accuracy level may be the level of accuracy of the combined transcription.

The test environment may determine that the first accuracy level is greater than the second accuracy level. The first accuracy level may be greater than the second accuracy level by over a predetermined accuracy level threshold.

The test environment may guarantee or confirm that the first accuracy level is greater than the second accuracy level. The test environment may guarantee that the first accuracy level is greater than the second accuracy level by over a predetermined accuracy level threshold.

The test environment may identify a third quantifiable resources value. The third quantifiable resources value may be calculated based on an amount of resources (or number of processor cycles) consumed by correcting the combined transcription using the domain-specific correction module;

The test environment may identify a third accuracy level. The third accuracy level may be based on the level of accuracy of the combined transcription upon completion of correcting the combined transcription using the domain-specific correction module.

The test environment may determine that the third accuracy level is equivalent to, or greater than, the first accuracy level.

The test environment may guarantee that the third accuracy level is equivalent to, or greater than, the first accuracy level.

The test environment may identify a fourth quantifiable resources value. The fourth quantifiable resources value may be calculated based on the third quantifiable resources value. The third quantifiable resources value may include the second quantifiable resources value and the third quantifiable resources value.

The test environment may determine that the fourth quantifiable resources value is less than the first quantifiable resources value. The test environment may guarantee, or confirm, that the fourth quantifiable resources value is less than the first quantifiable resources value.

Apparatus and methods for maintaining accuracy in transcribing a communication. Systems may include a first environment. The first environment may include a processor. The first environment may include associated computing components. The processor and the associated computing components may be specialized for transcription.

The first environment may include a transformer neural network. The transformer neural network may be a deep learning model that may use self-attention to identify a significance weight for each portion of the input data. Transformer neural networks may be used in natural language processing. Unlike recurrent neural networks, transformer neural networks may process the entire input in one complete iteration. The transformer can take an input sequence and convert it into a vector called an encoding, and then decode it back into another sequence. Transformer neural networks may be used to solve sequence-to-sequence tasks and may be capable of processing long-range dependencies.

The transformer neural network may receive an audio communication. The audio communication may occur between a human caller and an interactive voice response system. The audio communication may be any suitable audio communication. The transformer neural network may transcribe the audio communication into a first transcription.

Systems may include a second environment. The second environment may include a processor. The second environment may include associated computing components. The processor and the associated computing components may be specialized for transcription. Systems may include using the second environment to transcribe incoming communications.

The second environment may include a receiver. The receiver may receive the audio communication. The audio communication may be a real-time audio communication and/or a recorded or historical audio communication. The second environment may include a segmentation model. The segmentation model may segment the audio communication into a plurality of communication segments. Each communication segment may include thirty or less seconds of the audio communication.

The second environment may include a transcriber. The transcriber may be a transcription module operating on a processor. The transcriber may instantiate an instance of a transformer neural network for each communication segment included in the plurality of communication segments. The transcriber may transcribe, using parallel processing, each communication segment. Each communication segment may be included in the plurality of communication segments. The transcriber may transcribe using the instance of the transformer neural network instantiated for the communication segment. The transcriber may combine the transcribed communication segments into a combined transcription. The transcriber may correct the combined transcription using a domain-specific corrector.

Systems may include a test environment. The test environment may include a processor. The test environment may include associated computing components. The processor and the associated computing components may be specialized for testing. The test environment may identify a first quantifiable resources value. The first quantifiable resource value may be calculated based on the amount of resources (number of processor cycles) consumed by transcribing the communication using the robust speech recognition via large-scale weak supervision model.

The test environment may identify a first accuracy level. The first accuracy level may be the level of accuracy of the first transcription. The test environment may identify a second quantifiable resources value.

The second quantifiable resources value may be calculated based on the amount of resources (number of processor cycles) consumed by transcribing the communication using the plurality of instances of the robust speech recognition via large-scale weak supervision model.

The test environment may determine that the first quantifiable resources value is greater than the second quantifiable resources value. The first quantifiable resource value may be greater than the second quantifiable resources value by over a predetermined resources value threshold.

The test environment may identify a second accuracy level. The second accuracy level may be the level of accuracy of the combined transcription.

The test environment may determine that the first accuracy level is greater than the second accuracy level. The first accuracy level may be greater than the second accuracy level by over a predetermined accuracy level threshold.

The test environment may identify a third quantifiable resources value. The third quantifiable resources value may be calculated based on an amount of resources consumed by correcting the combined transcription using the domain-specific correction module;

The test environment may identify a third accuracy level. The third accuracy level may be based on the level of accuracy of the combined transcription upon completion of correcting the combined transcription using the domain-specific correction module.

The test environment may determine that the third accuracy level is equivalent to or greater than the first accuracy level.

The test environment may identify a fourth quantifiable resources value. The fourth quantifiable resources value may be calculated based on the third quantifiable resources value. The third quantifiable resources value may include the second quantifiable resources value and the third quantifiable resources value.

The test environment may determine that the fourth quantifiable resources value is less than the first quantifiable resources value.

Apparatus and methods described herein are illustrative. Apparatus and methods in accordance with this disclosure will now be described in connection with the figures, which form a part hereof. The figures show illustrative features of apparatus and method steps in accordance with the principles of this disclosure. It is to be understood that other embodiments may be utilized and that structural, functional and procedural modifications may be made without departing from the scope and spirit of the present disclosure.

Patent Metadata

Filing Date

Unknown

Publication Date

December 4, 2025

Inventors

Unknown

Want to explore more patents?

Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.

Citation & reuse

Analysis on this page is generated by Patentable — an AI-powered patent intelligence platform. AI-generated summaries, explanations, and analysis may be reused with attribution and a visible link back to the canonical URL below. Patent abstracts and claims are USPTO public domain.

Cite as: Patentable. “PERFORMANCE OPTIMIZATION FOR REAL-TIME LARGE LANGUAGE SPEECH TO TEXT SYSTEMS” (US-20250372083-A1). https://patentable.app/patents/US-20250372083-A1

© 2026 Patentable. All rights reserved.

Patentable is a research and drafting-assistant tool, not a law firm, and does not provide legal advice. Documents we generate are drafts for review by a licensed patent attorney.

PERFORMANCE OPTIMIZATION FOR REAL-TIME LARGE LANGUAGE SPEECH TO TEXT SYSTEMS | Patentable