Matching a Query to a Set of Sentences Using a Multidimensional Relevancy Determination

PublishedOctober 20, 2020

Assigneenot available in USPTO data we have

InventorsJING ZHAI Richard Chun-Ching Wang Weide Zhang

Technical Abstract

Patent Claims

18 claims

Legal claims defining the scope of protection, as filed with the USPTO.

1. A computer-implemented method, comprising: extracting one or more tokens from a sentence received by a search server as a query, each token comprising one or more words, the one or more words forming at least a partial sentence; determining, automatically, a domain category for the sentence; determining a set of candidate sentences having the same domain category as the sentence, wherein the candidate sentences contain one or more of the tokens or one or more similar words to the tokens, wherein the one or more tokens or one or more similar words of each candidate sentence comprise words specific to the domain category, wherein the set of candidate sentences are determined from an index, and wherein at least part of the index is created from information obtained by processing each candidate sentence, the processing including extracting one or more words as tokens from each candidate sentence; determining an interrogative intent of the sentence, comprising performing a semantic analysis on the sentence; for each candidate sentence, determining an aggregate similarity score between the candidate sentence and the sentence, wherein determining the aggregate similarity score includes determining a multi-dimensional token similarity score between one or more tokens of the sentence and one or more corresponding tokens of the candidate sentence, comprising at least two of: an edit distance score, a token similarity score, or an intent similarity score, wherein tokens comprising words specific to the domain category are given a higher weighting value when determining each of the at least two of: an edit distance score, a token similarity score, or an intent similarity score, wherein determining the aggregate similarity score further includes determining an intent similarity score between the determined interrogative intent for the sentence and a determined interrogative intent for the candidate sentence, wherein extracting one or more words as token from the sentence and extracting one or more words as tokens from each candidate sentence further include: determining a position of each token within the sentence and the candidate sentence respectively, and wherein determining the token similarity score includes comparing the position of the one or more tokens of the sentence with the position of the one or more corresponding tokens of the candidate sentence; and providing query results corresponding to one or more of the set of candidate sentences ranked based on the determined aggregate similarity scores of each candidate sentence.

2. The method of claim 1 , wherein the sentence comprises a phrase, a clause, a partial sentence or a full sentence having at least a subject and an object.

3. The method of claim 1 , wherein the similar words for each token are associated with the token in a precompiled dictionary, and wherein determining the aggregate similarity score further includes determining a confidence score between the similar words and the one or more corresponding tokens in the candidate sentence.

4. The method of claim 1 , wherein extracting one or more words as tokens from the sentence and extracting one or more words as tokens from each candidate sentence further include determining a role and an importance weighting of each token within the sentence and the candidate sentence respectively, and wherein determining the token similarity score further includes comparing the role and the importance weighting of the one or more tokens of the sentence with the role and the importance weighting of the one or more corresponding tokens of the candidate sentence.

5. The method of claim 4 , wherein the token similarity score is calculated as a Dice similarity coefficient, a Jaccard similarity coefficient, or a Cosine similarity coefficient.

6. The method of claim 1 , wherein determining the edit distance score includes determining a character-level edit distance between characters of the one or more tokens of the sentence and characters of the one or more corresponding tokens of the candidate sentence.

7. The method of claim 6 , wherein determining the edit distance score further includes determining a token-level edit distance between the tokens of the sentence and the tokens of the candidate sentence.

8. The method of claim 7 , wherein the edit distances are calculated using one or more of a Levenshtein distance, a Longest Common Subsequence (LCS) distance, a Hamming distance, or a Jaro-Winkler distance.

9. A non-transitory machine-readable medium having instructions stored therein, which when executed by a processor, cause the processor to perform operations comprising: extracting one or more tokens from a sentence received by a search server as a query, each token comprising one or more words, the one or more words forming at least a partial interrogative sentence; determining, automatically, a domain category for the sentence; determining a set of candidate sentences having the same domain category as the sentence, wherein the candidate sentences contain one or more of the tokens or one or more similar words to the tokens, wherein the one or more tokens or one or more similar words of each candidate sentence comprise words specific to the domain category, wherein the set of candidate sentences are determined from an index, and wherein at least part of the index is created from information obtained by processing each candidate sentence, the processing including extracting one or more words as tokens from each candidate sentence; determining an interrogative intent of the sentence, comprising performing a semantic analysis on the sentence; for each candidate sentence, determining an aggregate similarity score between the candidate sentence and the sentence, wherein determining the aggregate similarity score includes determining a multi-dimensional token similarity score between one or more tokens of the sentence and one or more corresponding tokens of the candidate sentence, comprising at least two of: an edit distance score, a token similarity score, or an intent similarity score, wherein tokens comprising words specific to the domain category are given a higher weighting value when determining each of the at least two of: an edit distance score, a token similarity score, or an intent similarity score, wherein determining the aggregate similarity score further includes determining an intent similarity score between the determined interrogative intent for the sentence and a determined interrogative intent for the candidate sentence, wherein extracting one or more words as token from the sentence and extracting one or more words as tokens from each candidate sentence further include: determining a position of each token within the sentence and the candidate sentence respectively, and wherein determining the token similarity score includes comparing the position of the one or more tokens of the sentence with the position of the one or more corresponding tokens of the candidate sentence; and providing query results corresponding to one or more of the set of candidate sentences ranked based on the determined aggregate similarity scores of each candidate sentence.

10. The medium of claim 9 , wherein the sentence comprises a phrase, a clause, a partial sentence, or a full sentence having at least a subject and an object.

11. The medium of claim 9 , wherein the similar words for each token are associated with the token in a precompiled dictionary, and wherein determining the aggregate similarity score further includes determining a confidence score between the similar words and the one or more corresponding tokens in the candidate sentence.

12. The medium of claim 9 , wherein extracting one or more words as tokens from the sentence and extracting one or more words as tokens from each candidate sentence further include determining a role and an importance weighting of each token within the sentence and the candidate sentence respectively, and wherein determining the token similarity score further includes comparing the role and the importance weighting of the one or more tokens of the sentence with the role and the importance weighting of the one or more corresponding tokens of the candidate sentence.

13. The medium of claim 12 , wherein the token similarity score is calculated as a Dice similarity coefficient, a Jaccard similarity coefficient, or a Cosine similarity coefficient.

14. The medium of claim 9 , wherein determining the edit distance score includes determining a character-level edit distance between characters of the one or more tokens of the sentence and characters of the one or more corresponding tokens of the candidate sentence.

15. The medium of claim 14 , wherein determining the edit distance score further includes determining a token-level edit distance between the tokens of the sentence and the tokens of the candidate sentence.

16. The medium of claim 15 , wherein the edit distances are calculated using one or more of a Levenshtein distance, a Longest Common Subsequence (LCS) distance, a Hamming distance, or a Jaro-Winkler distance.

17. A system, comprising: a processor; and a memory coupled to the processor, the memory storing instructions, which when executed by the processor, cause the processor to perform operations comprising: extracting one or more tokens from a sentence received by a search server as a query, each token comprising one or more words, the one or more words forming at least a partial interrogative sentence; determining, automatically a domain category for the sentence; identifying similar words for each token of the sentence from a precompiled dictionary; determining a set of candidate sentences having the same domain category as the sentence, wherein the candidate sentences contain one or more of the tokens or one or more of the similar words, wherein the one or more tokens or one or more similar words of each candidate sentence comprise words specific to the domain category, wherein the set of candidate sentences are determined from an index, and wherein at least part of the index is created from information obtained by processing each candidate sentence, the processing including extracting one or more words as tokens from each candidate sentence; determining an interrogative intent of the sentence, comprising performing a semantic analysis on the sentence; for each candidate sentence, determining an aggregate similarity score between the candidate sentence and the sentence, wherein determining the aggregate similarity score includes determining a multi-dimensional token similarity score between one or more tokens of the sentence and one or more corresponding tokens of the candidate sentence, comprising at least two of: an edit distance score, a token similarity score, or an intent similarity score, wherein tokens comprising words specific to the domain category are given a higher weighting value when determining each of the at least two of: an edit distance score, a token similarity score, or an intent similarity score, wherein determining the aggregate similarity score further includes determining an intent similarity score between the determined interrogative intent for the sentence and a determined interrogative intent for the candidate sentence, wherein extracting one or more words as token from the sentence and extracting one or more words as tokens from each candidate sentence further include: determining a position of each token within the sentence and the candidate sentence respectively, and wherein determining the token similarity score includes comparing the position of the one or more tokens of the sentence with the position of the one or more corresponding tokens of the candidate sentence; and providing query results corresponding to one or more of the set of candidate sentences ranked based on the determined aggregate similarity scores of each candidate sentence.

18. The system of claim 17 , wherein determining the aggregate similarity score further includes determining one or more of an edit distance score between one or more tokens of the sentence and one or more corresponding tokens of the candidate sentence, a confidence score between one or more of the similar words and one or more corresponding tokens of the candidate sentence; and an intent similarity score between a determined intent for the sentence and a determined intent for the candidate sentence.

Patent Metadata

Filing Date

Unknown

Publication Date

October 20, 2020

Inventors

JING ZHAI

Richard Chun-Ching Wang

Weide Zhang

Want to explore more patents?

Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.

Browse All Patents Try Prior Art Search