Phone-Based Sub-Word Units for End-to-End Speech Recognition

PublishedMay 10, 2022

Assigneenot available in USPTO data we have

InventorsWeiran Wang Yingbo Zhou Caiming Xiong

Technical Abstract

Patent Claims

20 claims

Legal claims defining the scope of protection, as filed with the USPTO.

1. A method for converting a spoken utterance into a text word, comprising: receiving the spoken utterance at an ensemble byte pair encoding (BPE) system, wherein the BPE system includes a phone BPE system and a character BPE system; identifying one or more first words using the phone BPE system; determining a first score for each of the one or more first words; converting each of the one or more first words into a character sequence; identifying, using the character BPE system, one or more second words from each character sequence; determining a second score for each second word; for each first word that matches a second word in the one or more second words, combining the first score of the first word and the second score of the second word; and determining the text word as a word that corresponds to a highest combined score.

2. The method of claim 1 , wherein the identifying the one or more first words further comprising: traversing a prefix tree of a language model to identify one or more sub-words until a word boundary is met, wherein the word boundary is met when pronunciation of the one or more sub-words matches pronunciation of the one or more words at a node in the prefix tree; and identifying the one or more words at the node in the prefix tree as the first one or more words.

3. The method of claim 2 , wherein the language model is a recurrent neural network.

4. The method of claim 1 , wherein the phone BPE system includes a multi-level language model that identifies the one or more first words and wherein the multi-level language model includes a sub-word language model and a word language model.

5. The method of claim 4 , wherein the phone BPE system includes an acoustic model and further comprising determining the first score for the each of the one or more first words after the multi-level language model identifies the one or more first words.

6. The method of claim 1 , wherein the character BPE system includes a multi-level language model and further comprising: traversing a prefix tree of the multi-level language model to identify one or more sub-words until a word boundary is met, wherein the word boundary is met when pronunciation of the one or more sub-words matches pronunciation of the one or more words at a node in the prefix tree; and identifying the one or more words at the node in the prefix tree as the second one or more words.

7. The method of claim 6 , wherein the multi-level language model is a recurrent neural network.

8. The method of claim 6 , wherein the character BPE system includes an acoustic model, and further comprising determining the second score for each of the one or more second words after the multi-level language model identifies the one or more second words.

9. A system for converting a spoken utterance into a text word, comprising: an ensemble byte pair encoding (BPE) system that includes a phone BPE system and a character BPE system and configured to receiving the spoken utterance; the phone BPE system configured to: identify one or more first words; and determine a first score for each of the one or more first words; an encoder configured to convert each of the one or more first words into a character sequence; the character BPE system configured to: identify one or more second words from each character sequence; determine a second score for each second word in the one or more second words; and for each first word that matches a second word in the one or more second words, combine the first score of the first word and the second score of the second word; and the phone BPE system further configured to determine the text word as a word that corresponds to a highest combined score.

10. The system of claim 9 , wherein to identify the one or more first words the phone BPE system is further configured to: traverse a prefix tree of a multi-level language model to identify one or more sub-words until a word boundary is met, wherein the word boundary is met when pronunciation of the one or more sub-words matches pronunciation of the one or more words at a node in the prefix tree; and identify the one or more words at the node in the prefix tree as the first one or more words.

11. The system of claim 10 , wherein the multi-level language model is a recurrent neural network.

12. The system of claim 9 , wherein the phone BPE system includes a multi-level language model that identifies the one or more first words, wherein the multi-level language model includes a sub-word language model and a word language model.

13. The system of claim 12 , wherein the phone BPE system includes an acoustic model that is configured to determine the first score for each of the one or more first words after the multi-level language model identifies the one or more first words.

14. The system of claim 9 , wherein the character BPE system includes a multi-level language model and is further configured to: traverse a prefix tree of the multi-level language model to identify one or more sub-words until a word boundary is met, wherein the word boundary is met when pronunciation of the one or more sub-words matches pronunciation of the one or more words at a node in the prefix tree; and identify the one or more words at the node in the prefix tree as the second one or more words.

15. The system of claim 14 , wherein the multi-level language model is a recurrent neural network.

16. The system of claim 14 , wherein the character BPE system includes an acoustic model and the acoustic model is configured to determine the second score for each of the one or more second words after the multi-level language model identifies the one or more second words.

17. A non-transitory machine-readable medium having stored thereon machine-readable instructions executable to cause a machine to perform operations for converting a spoken utterance into a text word, comprising: receiving the spoken utterance at an ensemble byte pair encoding (BPE) system, wherein the BPE system includes a phone BPE system that includes at least one first neural network and a character BPE system that includes at least one second neural network; identifying one or more first words using the phone BPE system; determining a first score for each of the one or more first words; converting each of the one or more first words into a character sequence; identifying, using the character BPE system, one or more second words from each character sequence; determining a second score for each second word; for each first word that matches a second word in the one or more second words, combining the first score of the first word and the second score of the second word; and determining the text word as a word that corresponds to a highest combined score.

18. The non-transitory machine-readable medium of claim 17 , wherein the identifying the one or more first words further causes the machine to perform operations comprising: traversing a prefix tree of a multi-level language model to identify one or more sub-words until a word boundary is met, wherein the word boundary is met when pronunciation of the one or more sub-words matches pronunciation of the one or more words at a node in the prefix tree; and identifying the one or more words at the node in the prefix tree as the first one or more words.

19. The non-transitory machine-readable medium of claim 18 , wherein the phone BPE system includes an acoustic model that determines the first score for each of the one or more first words after the multi-level language model identifies the one or more first words.

20. The non-transitory machine-readable medium of claim 17 , wherein the character BPE system includes a multi-level language model and the operation further comprise: traversing a prefix tree of the multi-level language model to identify one or more sub-words until a word boundary is met, wherein the word boundary is met when pronunciation of the one or more sub-words matches pronunciation of the one or more words at a node in the prefix tree; and identifying the one or more words at the node in the prefix tree as the second one or more words.

Patent Metadata

Filing Date

Unknown

Publication Date

May 10, 2022

Inventors

Weiran Wang

Yingbo Zhou

Caiming Xiong

Want to explore more patents?

Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.

Browse All Patents Try Prior Art Search