Identifying Collocations in a Corpus of Text in a Distributed Computing Environment

PublishedJanuary 19, 2016

Assigneenot available in USPTO data we have

InventorsXiong Zhang Hung-chih Yang Danny Lange

Technical Abstract

Patent Claims

20 claims

Legal claims defining the scope of protection, as filed with the USPTO.

1. A method executed by at least one computing device in a distributed computing environment, the method comprising: identifying n-grams of differing lengths in a corpus of text, the corpus of text comprising tokens arranged in an order, each identified n-gram comprises a plurality of sequential tokens in the corpus of text; for each n-gram identified in the corpus of text, identifying a prefix of the n-gram and a suffix of the n-gram; computing, for each identified n-gram in the corpus of text and in a single pass over the corpus of text, a metric that is indicative of a frequency of occurrence of the n-gram in the corpus of text relative to an expected frequency of occurrence of the n-gram in the corpus of text if the tokens were arranged randomly in the corpus of text, the metric computed based upon the identifying of the prefix of the n-gram and the suffix of the n-gram, the computing of the metric performed by the at least one computing device in the distributed computing environment; and when the metric is above a predefined threshold, assigning a label to the n-gram that indicates that the n-gram is representative of content of the corpus of text.

2. The method of claim 1 , wherein assigning the label to the n-gram comprises: labeling the n-gram as being a collocation in the corpus of text.

3. The method of claim 1 , further comprising: generating a first key/value pair and a second key/value pair for each identified n-gram, the first key/value pair corresponding to the prefix of the n-gram, the second key/value pair corresponding to the suffix of the n-gram, the first key/value pair comprising a first key that comprises: a first value that identifies that the first key corresponds to the prefix; and the prefix, the second key/value pair comprising a second key that comprises: a second value that identifies that the second key corresponds to the suffix; and the suffix, wherein the metric for the several n-grams is computed based upon the first key/value pair and the second key/value pair.

4. The method of claim 3 , wherein the first key comprises a third value that identifies length of the n-gram, and wherein the second key comprises the third value.

5. The method of claim 1 , wherein identifying n-grams of the differing lengths in the corpus of text is performed by a first computing device in the distributed computing environment, and computing the metric is performed by a second computing device in the distributed computing environment.

6. The method of claim 1 , further comprising assigning a second label to the n-gram based upon the metric, the second label indicates that the n-gram is new vernacular.

7. The method of claim 1 , wherein identifying the prefix of the n-gram comprises identifying multiple tokens of the n-gram to include in the prefix.

8. The method of claim 1 , further comprising: computing a first frequency value that is indicative of a number of occurrences of the prefix and the suffix as a prefix/suffix combination in the identified n-grams; computing a second frequency value that is indicative of a number of occurrences of the prefix, as a prefix and without the suffix, in the identified n-grams; computing a third frequency value that is indicative of a number of occurrences of the suffix, as a suffix and without the prefix, in the identified n-grams, wherein the first frequency value, the second frequency value, and the third frequency value are computed subsequent to identifying the prefix and the suffix of each identified n-gram, and computing the metric for each n-gram in the identified n-grams based upon the first frequency value, the second frequency value, and the third frequency value.

9. The method of claim 8 , further comprising, for each n-gram in the identified n-grams: comparing the first frequency value with a predefined threshold; and computing the metric for the n-gram only if the first frequency value for the n-gram is greater than the predefined threshold.

10. The method of claim 8 , further comprising: counting a number of n-grams of length N for a particular value of N in the corpus of text; computing a fourth frequency value that is indicative of a number of n-grams of length N that include neither the prefix nor the suffix based upon the counting of the number of n-grams of length N in the corpus of text; and computing the metric for n-grams of length N for the particular value of N based upon the fourth frequency value.

11. A distributed computing system that facilitates labelling an n-gram as being representative of content of a corpus of text, the system comprising: a plurality of computing devices that are in communication with one another, the plurality of computing devices comprising: respective processing units; and memory that comprises instructions that, when collectively executed by the processing units, cause the processing units to perform acts comprising: receiving the corpus of text, the corpus of text comprising a plurality of tokens that are arranged in an order; computing, for each n-gram in a plurality of identified n-grams in the corpus of text, a metric that is indicative of a frequency of occurrence of the n-gram relative to an expected frequency of occurrence of the n-gram, wherein the identified n-grams comprise n-grams of differing length, and wherein the metric for each n-gram is computed in one pass over the corpus of text; and for each n-gram in the plurality of n-grams, assigning a label to the n-gram if the metric computed for the n-gram is above a predefined threshold, the label indicates that the n-gram is representative of the content of the corpus of text.

12. The system of claim 11 , wherein the corpus of text consumes multiple terabytes of computer-readable data storage.

13. The system of claim 11 , wherein computing the metric comprises: identifying, for each n-gram in the identified n-grams, a prefix of the n-gram and a suffix of the n-gram; and computing the metric for the n-gram based at least in part upon the identifying the prefix of the n-gram and the suffix of the n-gram.

14. The system of claim 13 , wherein at least one of the prefix or the suffix comprises at least two tokens.

15. The system of claim 13 , wherein computing the metric comprises computing, for each n-gram, a plurality of frequency values, the frequency values comprising: a first frequency value that is indicative of a number of occurrences of the n-gram in the corpus of text; a second frequency value that is indicative of a number of occurrences of the prefix in the corpus of text, as a prefix and without the suffix; and a third frequency value that is indicative of a number of occurrences of the suffix in the corpus of text, as a suffix and without the prefix, wherein the metric is computed based upon the first frequency value, the second frequency value, and the third frequency value.

16. The system of claim 15 , wherein computing the metric for each n-gram comprises: determining a number of n-grams in the corpus of text with length of the n-gram; and computing a fourth frequency value that is indicative of a number of n-grams that include neither the prefix nor the suffix based upon the number of n-grams in the corpus of text with length of the n-gram, wherein the metric for the n-gram is computed based upon the fourth frequency value.

17. The system of claim 11 , wherein the acts further comprise: providing the n-gram to a third party based upon the metric, wherein the metric indicates that use of the n-gram is trending upwards.

18. Non-transitory computer-readable media comprising instructions that, when executed by at least one processor, cause the at least one processor to perform acts comprising: parsing a corpus of text to identify n-grams of differing lengths therein, the corpus of text comprises a plurality of words in an order, each identified n-gram comprises at least two words that appear sequentially in the corpus of text; for each identified n-gram, identifying a prefix and suffix of the n-gram; for each identified n-gram, computing, in a single pass over the corpus of text, a value that is indicative of a frequency of occurrence of the n-gram in the corpus of text relative to an expected frequency of occurrence of the n-gram in the corpus of text if the words in the corpus of text were arranged randomly, the value computed based upon the identifying of the prefix and the suffix of the n-gram; and for each identified n-gram, assigning a label to the n-gram that indicates that the n-gram is representative of content of the corpus of text when the value computed for the n-gram is above a predefined threshold.

19. The computer readable storage device of claim 18 , wherein, for each n-gram comprising at least three words, at least one of the prefix or the suffix of the n-gram comprises at least two words.

20. The computer-readable storage device of claim 18 , the acts further comprising assigning the n-gram to the corpus of text as being representative of content of the corpus of text.

Patent Metadata

Filing Date

Unknown

Publication Date

January 19, 2016

Inventors

Xiong Zhang

Hung-chih Yang

Danny Lange

Want to explore more patents?

Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.

Browse All Patents Try Prior Art Search