Processing Noisy Data and Determining Word Similarity

PublishedMarch 11, 2008

Assigneenot available in USPTO data we have

InventorsHua Wu Ming Zhou Chang-Ning Huang

Technical Abstract

Patent Claims

16 claims

Legal claims defining the scope of protection, as filed with the USPTO.

1. A method of determining similarity between words, comprising: receiving as an input a first word and a first dependency structure that includes the first word; receiving a data structure indicative of a second word and a second dependency structure that includes the second word; selecting one of a plurality of different weighting measures to weight a similarity measure based on whether a frequency indicator indicative of a frequency of occurrence of the second dependency structure in training data meets a frequency threshold value; and calculating the similarity between the first and second words based on the similarity measure weighted with the selected weighting measure.

2. The method of claim 1 wherein the plurality of weighting measures includes a co-occurrence weighting measure and a mutual information (MI) weighting measure.

3. The method of claim 2 wherein weighting the similarity measure comprises: weighting the similarity measure with the co-occurrence frequency measure if the frequency indicator indicates the frequency of occurrence is below the frequency threshold value.

4. The method of claim 3 wherein weighting the similarity measure comprises: weighting the similarity measure with the MI measure if the frequency indicator indicates the frequency of occurrence is above the frequency threshold value.

5. The method of claim 1 wherein receiving a data structure indicative of a second word comprises: accessing a data store that stores records that include words and associated dependency structures and frequency indicators.

6. The method of claim 5 wherein the associated dependency structures and frequency indicators in the data store are stored as vectors associated with the words, and wherein accessing a data store comprises: accessing the words and associated vectors.

7. The method of claim 5 wherein accessing the data store comprises: identifying candidate words in the data store by reducing the search space of records in the data store.

8. The method of claim 7 wherein identifying candidate words comprises: accessing a lexical knowledge base to identify possible candidate words in the data store.

9. A natural language processing system, comprising: a data store storing head words and associated attributes, each of the attributes including a related word that was related to the head word in a training corpus, a relation type indicator indicating a type of relation between the head word and the related word, and a frequency indicator indicative of a frequency with which the attribute occurred relative to the head word in the training corpus; and a similarity generator configured to receive an input word and an associated input dependency structure and to access the data store and calculate a similarity between the input word and head words in the data store based on the input word and associated input dependency structure and the head words and associated dependency structures using a similarity measure that weights a similarity corresponding to a given head word with a first weighting measure if the frequency indicator associated with the given head word meets a predetermined frequency threshold value and with a second weighting measure different from the first weighting measure, if the frequency indicator does not meet the predetermined frequency threshold value.

10. The system of claim 9 wherein the similarity generator is configured to select a co-occurrence frequency weighting measure if the frequency indicator is below the predetermined threshold value.

11. The system of claim 10 wherein the similarity generator is configured to select a mutual information weighting measure if the frequency indicator is above the predetermined threshold value.

12. The system of claim 9 and further comprising: a lexical knowledge base, the similarity generator being configured to access the lexical knowledge base to identify a subset of the head words in the data store as candidate words prior to calculating the similarity.

13. The system of claim 9 wherein the data store stores the attributes as vectors.

14. A system for in calculating similarity between words using annotated data, comprising: a parser configured to receive a textual input and parse the textual input into dependency structures including words and relation types indicative of relations between the words in the textual input and generate a vector corresponding to each dependency structure, the vector including a related word, a relation type indicator, and a frequency indicator indicating a frequency with which the dependency structure occurred in the textual input; a data store configured to store the words and corresponding vectors regardless of the frequency with which the dependency structures occurred in the textual input; and a similarity generator configured to receive an input word and an associated input dependency structure and to access the data store and calculate a similarity between the input word and words in the data store based on the input word and associated input dependency structure and the words and associated dependency structures in the data store using a similarity measure that weights a similarity corresponding to a given word in the data store with a first weighting measure if the frequency indicator associated with the given word meets a predetermined frequency threshold value and with a second weighting measure, different from the first weighting measure, if the frequency indicator does not meet the predetermined frequency threshold value.

15. The system of claim 14 wherein the frequency indicator comprises a normalized count value.

16. The system of claim 14 wherein the parser is configured to parse the textual input into dependency triples.

Patent Metadata

Filing Date

Unknown

Publication Date

March 11, 2008

Inventors

Hua Wu

Ming Zhou

Chang-Ning Huang

Want to explore more patents?

Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.

Browse All Patents Try Prior Art Search