A system for subvocalized speech recognition includes a plurality of sensors, a controller and a processor. The sensors are coupled to a near-eye display (NED) and configured to capture non-audible and subvocalized commands provided by a user wearing the NED. The controller interfaced with the plurality of sensors is configured to combine data acquired by each of the plurality of sensors. The processor coupled to the controller is configured to extract one or more features from the combined data, compare the one or more extracted features with a pre-determined set of commands, and determine a command of the user based on the comparison.
Legal claims defining the scope of protection, as filed with the USPTO.
1. A system for subvocalized speech recognition, the system comprising: a plurality of sensors coupled to a near-eye display (NED) configured to capture non-audible and subvocalized commands provided by a user wearing the NED, the plurality of sensors comprising an air microphone, a camera, and a non-audible murmur (NAM) microphone, the camera oriented toward a mouth of the user and configured to: generate one or more video signals from capturing a profile view of lips of the user, and detect a non-audible speech based on movement of the lips in the one or more video signals, the NAM microphone configured to: sit against a skin behind an ear of the user, and measure vibrations in the skin during a mouthed subvocalized speech by the user; a controller interfaced with the plurality of sensors configured to combine data acquired by all the sensors; and a processor coupled to the controller, the processor configured to: extract one or more features from the combined data, compare the one or more extracted features with a pre-determined set of commands, determine a command of the user based on the comparison, the command provided by the user to the NED as a whisper in a subvocalized manner, and provide the command to the NED for operating the NED, the NED being a part of an artificial reality system and the commands are directed at the artificial reality system.
2. The system of claim 1 , wherein the plurality of sensors further comprising an in-ear proximity sensor.
3. The system of claim 1 , wherein the camera is integrated in a frame of the NED in proximity of the mouth for capturing images of the lips.
4. The system of claim 2 , wherein the in-ear proximity sensor is located inside an ear canal of the ear.
5. The system of claim 4 , wherein the in-ear proximity sensor is coupled to a frame of the NED and is configured to measure one or more deformations of the ear canal when the user moves a jaw during the non-audible speech.
6. The system of claim 1 , wherein the NAM microphone is coupled to a frame of the NED.
7. The system of claim 1 , wherein the processor is further configured to perform a frame-based classification to classify a sequence of the combined data into the command based on the extracted one or more features.
8. The system of claim 1 , wherein the processor is further configured to perform a sequence-based classification to classify a sequence of the combined data into the command based on a hidden Markov model associated with the extracted one or more features.
9. The system of claim 1 , wherein the processor is further configured perform a sequence-to-sequence phoneme classification to classify a sequence of the combined data into the command based on an acoustic model and a language model.
10. The system of claim 9 , wherein the acoustic model is configured as a recurrent neural network.
11. The system of claim 9 , wherein the language model is trained using an n-gram model.
12. The system of claim 9 , wherein the processor is further configured to: run the acoustic model to map the sequence of the combined data into a phoneme sequence; and map the phoneme sequence into the command using the language model.
13. A system configured to: acquire, by a plurality of sensors coupled to a near-eye display (NED), data with features related to non-audible and subvocalized commands provided by a user, the plurality of sensors comprising an air microphone, a camera, and a non-audible murmur (NAM) microphone, the camera oriented toward a mouth of the user and configured to: generate one or more video signals from capturing a profile view of lips of the user, and detect a non-audible speech based on movement of the lips in the one or more video signals, the NAM microphone configured to: sit against a skin behind an ear of the user, and measure vibrations in the skin during a mouthed subvocalized speech by the user; combine data acquired by all the sensors; perform feature extraction on the combined data based on machine learning to determine a command of the user, the command provided by the user to the NED as a whisper in a subvocalized manner; and provide the commands to an artificial reality system for operating the artificial reality system, the NED being a part of the artificial reality system.
14. The system of claim 13 , wherein the plurality of sensors further comprising an in-ear proximity sensor.
15. The system of claim 13 , wherein the camera is integrated in a frame of the NED in proximity of the mouth for capturing images of the lips.
16. The system of claim 14 , wherein the in-ear proximity sensor is configured to measure one or more deformations of an ear canal of the ear when the user moves a jaw during the non-audible speech.
17. The system of claim 13 , wherein the system is further configured to perform a sequence-based classification to classify a sequence of the combined data into the command based on a hidden Markov model associated with the extracted features of the combined data.
18. The system of claim 13 , wherein the system is further configured perform a sequence-to-sequence phoneme classification to classify a sequence of the combined data into the command based on an acoustic model and a language model.
Cooperative Patent Classification codes for this invention. Click any code to explore related patents in that topic.
October 23, 2017
May 26, 2020
Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.