US-10573300

Method and apparatus for automatic speech recognition

PublishedFebruary 25, 2020

Assigneenot available in USPTO data we have

Inventorsnot available in USPTO data we have

Technical Abstract

The invention provides a method of automatic speech recognition. The method includes receiving a speech signal, dividing the speech signal into time windows, for each time window determining acoustic parameters of the speech signal within that window, and identifying phonological features from the acoustic parameters, such that a sequence of phonological features are generated for the speech signal, separating the sequence of phonological features into a sequence of zones, and comparing the sequences of zones to a lexical entry comprising a sequence of phonological segments to a stored lexicon to identify one or more words in the speech signal.

Patent Claims

17 claims

Legal claims defining the scope of protection, as filed with the USPTO.

1. A method of automatic speech recognition, the method comprising the steps of: receiving a speech signal, dividing the speech signal into time windows, for each time window; determining acoustic parameters of the speech signal within that window, and identifying phonological features from the acoustic parameters, such that a sequence of phonological features are generated for the speech signal, separating the sequence of phonological features into a sequence of zones, and comparing the sequence of zones with lexical entries comprising sequences of phonological segments in a stored lexicon to identify one or more words in the speech signal, wherein: the step of separating the sequence of phonological features into a sequence of zones comprises identifying stable and unstable zones, wherein the unstable zones are in between stable zones; and the step of separating the sequence of phonological features into a sequence of zones comprises determining an instability score for each time point in the sequence of phonological features, the instability score being determined by comparing features extracted at a time-point with those at time-points preceding the time point, back to a configurable number; the method further comprising determining an unstable zone penalty for each feature in a first unstable zone depending on features present in matched phonological segments of lexical entries aligned to stable zones on each side of the first unstable zone; and calculating a matching score from the determined unstable zone penalty to identify the one or more words in the speech signal.

2. A method according to claim 1 wherein the lowest unstable zone penalty is selected to contribute to a matching score from one or more penalties.

3. A method according to claim 1 wherein each time window is 20 ms.

4. A method according to claim 1 wherein the acoustic parameters of the speech signal within each time window comprise one or more of; the root mean square amplitude, the fundamental frequency of the speech signal; the frequency of one or more formants F 1 , F 2 , F 3 in the speech signal; and a spectrum of the speech signal.

5. A method according to claim 4 wherein a spectrum of the speech signal is calculated, the method further comprising determining: an overall steepness value by calculating the slope of a regression line over the whole spectrum, a first steepness value by calculating the slope of a regression line over a first frequency range, and a second steepness value by calculating the slope of a regression line over a second frequency range.

6. A method according to claim 1 comprising generating a sequence of phonological features by determining the speech features active in each time window and outputting the speech features in chronological order.

7. A method according to claim 1 wherein the step of separating the sequence of phonological features into a sequence of zones further comprises: comparing the instability scores with an instability threshold and a minimum stable zone length, wherein a sequence of time points having a length greater than the minimum stable zone length and an instability score less than then instability threshold are determined to form a stable zone, such that features lying within the stable zone are deemed to be part of the same phonological segment.

8. A method according to claim 7 wherein the minimum stable zone length is 30 ms.

9. A method according to claim 7 wherein the instability score for a time point is increased for one or more of the following: for each feature present in the preceding time point but not present in the time point; for each feature present in the time point but not present in the preceding time point; and where the time point and the preceding time point comprise phonological features forming mutually exclusive pairs.

10. A method according to claim 7 wherein comparing the sequence of zones with lexical entries in a stored lexicon to identify one or more words in the speech signal comprises the steps of: for a lexical entry comprising a description of a word in terms of phonological segments; matching the stable zones to the phonological segments of the lexical entry, for each stable zone, determining a penalty for each feature depending on the features present in the matched phonological segment of the lexical entry, and calculating a matching score from the determined penalties to identify the one or more words in the speech signal.

11. A method according to claim 10 wherein no penalty is determined for a feature in the stable zone if the same feature is present in the matched phonological segment of the lexical entry.

12. A method according to claim 10 wherein the penalty is dependent on the fraction of the stable zone in which the feature is active.

13. A method according to claim 10 comprising comparing the zonally-classified sequence to a plurality of lexical entries and identifying a word from the lexical entry with the lowest matching score.

14. An apparatus comprising a processing circuitry and a machine readable medium containing instructions, which when read by the processing circuitry cause the processing circuitry to; receive a speech signal, divide the speech signal into time windows, for each time window: determine acoustic parameters of the speech signal within that window, and identify phonological features from the acoustic parameters, such that a sequence of phonological features are generated for the speech signal, separate the sequence of phonological features into a sequence of zones, and compare the sequence of zones with lexical entries comprising sequences of phonological segments in a stored lexicon to identify one or more words in the speech signal; wherein: the step of separating the sequence of phonological features into a sequence of zones comprises identifying stable and unstable zones, wherein the unstable zones, and in between stable zones; the step of separating the sequence of phonological features into a sequence of zones comprises determining an instability score for each time point in the sequence of phonological features, the instability score being determined by comparing features extracted at a time-point with those at time-points preceding the time point, back to a configurable number; the instructions further causing the processing circuitry to: determine an unstable zone penalty for each feature in a first unstable zone depending on features present in matched phonological segments of lexical entries aligned to stable zones on each side of the first unstable zone; and calculate a matching score from the determined unstable zone penalty to identify the one or more words in the speech signal.

15. A non-transitory machine readable medium containing instructions, which when read by the processing circuitry cause the processing circuitry to; receive a speech signal, divide the speech signal into time windows, for each time window: determine acoustic parameters of the speech signal within that window, and identify phonological features from the acoustic parameters, such that a sequence of phonological features are generated for the speech signal, separate the sequence of phonological features into a sequence of zones, and compare the sequence of zones with lexical entries comprising sequences of phonological segments in a stored lexicon to identify one or more words in the speech signal; wherein: the step of separating the sequence of phonological features into a sequence of zones comprises identifying stable and unstable zones, wherein the unstable zones, and in between stable zones; the step of separating the sequence of phonological features into a sequence of zones comprises determining an instability score for each time point in the sequence of phonological features, the instability score being determined by comparing features extracted at a time-point with those at time-points preceding, the time point, back to a configurable number; the instructions further causing the processing circuitry to: determine an unstable zone penalty for each feature in a first unstable zone depending on features present in matched phonological segments of lexical entries aligned to stable zones on each side of the first unstable zone; and calculate a matching score from the determined unstable zone penalty to identify the one or more words in the speech signal.

16. A method according to claim 5 , wherein the first frequency range is from 300 Hz to 1500 Hz and the second frequency range is from 1500 Hz to 5000 Hz.

17. A method according to claim 13 , further comprising only comparing the sequential phonological segments to a lexical entry if the number of phonological segments in the lexical entry is within a limited range of the number of zones in the sequential phonological segments.

Classification Codes (CPC)

Cooperative Patent Classification codes for this invention. Click any code to explore related patents in that topic.

G10L

Patent Metadata

Filing Date

August 22, 2018

Publication Date

February 25, 2020

Want to explore more patents?

Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.

Browse All Patents Try Prior Art Search