Legal claims defining the scope of protection. Each claim is shown in both the original legal language and a plain English translation.
1. A computer-implemented method, comprising: extracting one or more tokens from a sentence received by a search server as a query, each token comprising one or more words, the one or more words forming at least a partial sentence; determining, automatically, a domain category for the sentence; determining a set of candidate sentences having the same domain category as the sentence, wherein the candidate sentences contain one or more of the tokens or one or more similar words to the tokens, wherein the one or more tokens or one or more similar words of each candidate sentence comprise words specific to the domain category, wherein the set of candidate sentences are determined from an index, and wherein at least part of the index is created from information obtained by processing each candidate sentence, the processing including extracting one or more words as tokens from each candidate sentence; determining an interrogative intent of the sentence, comprising performing a semantic analysis on the sentence; for each candidate sentence, determining an aggregate similarity score between the candidate sentence and the sentence, wherein determining the aggregate similarity score includes determining a multi-dimensional token similarity score between one or more tokens of the sentence and one or more corresponding tokens of the candidate sentence, comprising at least two of: an edit distance score, a token similarity score, or an intent similarity score, wherein tokens comprising words specific to the domain category are given a higher weighting value when determining each of the at least two of: an edit distance score, a token similarity score, or an intent similarity score, wherein determining the aggregate similarity score further includes determining an intent similarity score between the determined interrogative intent for the sentence and a determined interrogative intent for the candidate sentence, wherein extracting one or more words as token from the sentence and extracting one or more words as tokens from each candidate sentence further include: determining a position of each token within the sentence and the candidate sentence respectively, and wherein determining the token similarity score includes comparing the position of the one or more tokens of the sentence with the position of the one or more corresponding tokens of the candidate sentence; and providing query results corresponding to one or more of the set of candidate sentences ranked based on the determined aggregate similarity scores of each candidate sentence.
2. The method of claim 1 , wherein the sentence comprises a phrase, a clause, a partial sentence or a full sentence having at least a subject and an object.
This invention relates to natural language processing, specifically methods for analyzing and extracting meaningful components from sentences. The problem addressed is the need to accurately identify and process different structural elements within sentences, such as phrases, clauses, partial sentences, or full sentences, to improve tasks like parsing, translation, or semantic analysis. The method involves analyzing a sentence to determine its structural components, ensuring that each component includes at least a subject and an object. This ensures that the extracted elements are grammatically and semantically coherent, enabling more precise processing in downstream applications. The approach may involve parsing techniques, syntactic analysis, or machine learning models to identify and validate the presence of subjects and objects within the sentence components. By ensuring that each analyzed segment contains these essential elements, the method enhances the reliability of natural language processing tasks, such as information extraction, machine translation, or question answering systems. The invention is particularly useful in applications requiring high accuracy in understanding sentence structure, such as legal document analysis, medical text processing, or automated content generation.
3. The method of claim 1 , wherein the similar words for each token are associated with the token in a precompiled dictionary, and wherein determining the aggregate similarity score further includes determining a confidence score between the similar words and the one or more corresponding tokens in the candidate sentence.
This invention relates to natural language processing (NLP) and text similarity analysis, specifically improving the accuracy of semantic matching between sentences. The problem addressed is the challenge of accurately comparing sentences with similar meanings but different word choices, which traditional methods often fail to capture due to lexical variations. The method involves comparing a candidate sentence to a reference sentence by analyzing tokens (words or subwords) in both. For each token in the candidate sentence, a precompiled dictionary provides a set of similar words that share semantic meaning. The system then calculates an aggregate similarity score between the candidate and reference sentences by evaluating how well the tokens and their associated similar words align with the reference sentence. This includes computing a confidence score that quantifies the likelihood that the similar words accurately represent the intended meaning of the tokens in context. The confidence score helps refine the similarity assessment, ensuring that lexical variations do not mislead the comparison. By incorporating this dictionary-based approach and confidence scoring, the method enhances the precision of semantic matching in NLP applications.
4. The method of claim 1 , wherein extracting one or more words as tokens from the sentence and extracting one or more words as tokens from each candidate sentence further include determining a role and an importance weighting of each token within the sentence and the candidate sentence respectively, and wherein determining the token similarity score further includes comparing the role and the importance weighting of the one or more tokens of the sentence with the role and the importance weighting of the one or more corresponding tokens of the candidate sentence.
This invention relates to natural language processing (NLP) and text similarity analysis, specifically improving the accuracy of sentence similarity scoring by incorporating token roles and importance weighting. The problem addressed is the limitation of traditional similarity measures that treat all words equally, leading to imprecise comparisons between sentences. The solution involves extracting tokens from both a reference sentence and candidate sentences, then analyzing each token's role (e.g., subject, object, modifier) and assigning an importance weight based on its syntactic or semantic significance. When comparing sentences, the system calculates a token similarity score by matching corresponding tokens while accounting for their roles and weights, ensuring that semantically or syntactically critical words contribute more to the overall similarity score. This approach enhances precision in applications like document retrieval, plagiarism detection, and semantic search by prioritizing meaningful linguistic relationships over superficial word matches. The method dynamically adjusts for contextual relevance, improving accuracy in diverse linguistic contexts.
5. The method of claim 4 , wherein the token similarity score is calculated as a Dice similarity coefficient, a Jaccard similarity coefficient, or a Cosine similarity coefficient.
This invention relates to methods for calculating token similarity scores in natural language processing or text analysis systems. The problem addressed is the need for efficient and accurate similarity measurements between tokens or text segments, which is crucial for tasks such as text classification, information retrieval, and machine translation. The method involves computing a similarity score between two tokens or text segments using a statistical similarity coefficient. Specifically, the similarity score is calculated using one of three well-known coefficients: the Dice similarity coefficient, the Jaccard similarity coefficient, or the Cosine similarity coefficient. These coefficients measure the overlap or similarity between sets or vectors representing the tokens, providing a numerical value that quantifies how similar the tokens are. The Dice coefficient is defined as twice the size of the intersection of the two sets divided by the sum of the sizes of the sets. The Jaccard coefficient is the size of the intersection divided by the size of the union of the sets. The Cosine coefficient measures the cosine of the angle between two vectors in a multi-dimensional space, often used in text analysis where tokens are represented as vectors. This approach allows for flexible and computationally efficient similarity comparisons, which can be applied in various natural language processing applications where token similarity is a key factor. The use of these coefficients ensures robustness and adaptability to different types of text data.
6. The method of claim 1 , wherein determining the edit distance score includes determining a character-level edit distance between characters of the one or more tokens of the sentence and characters of the one or more corresponding tokens of the candidate sentence.
This invention relates to natural language processing and text similarity analysis, specifically improving the accuracy of sentence comparison by incorporating character-level edit distance metrics. The problem addressed is the limitation of traditional token-based similarity measures, which may overlook subtle differences in character sequences that affect meaning or correctness. The solution involves enhancing sentence similarity assessment by calculating a character-level edit distance between tokens in the original and candidate sentences. This edit distance quantifies the number of character insertions, deletions, or substitutions required to transform one token into another, providing finer-grained similarity metrics. The method integrates this character-level analysis with higher-level token comparisons to produce a more precise similarity score. This approach is particularly useful in applications like text correction, plagiarism detection, and machine translation evaluation, where character-level discrepancies can significantly impact results. By combining token and character-level analysis, the invention improves the robustness of text similarity assessments, reducing false positives and negatives in automated text processing systems.
7. The method of claim 6 , wherein determining the edit distance score further includes determining a token-level edit distance between the tokens of the sentence and the tokens of the candidate sentence.
This invention relates to natural language processing and text similarity analysis, specifically improving the accuracy of edit distance calculations between sentences. The problem addressed is the need for more precise text comparison methods that account for token-level differences, which traditional edit distance algorithms often overlook. The invention enhances text similarity assessment by incorporating a token-level edit distance calculation between the tokens of a sentence and the tokens of a candidate sentence. This involves analyzing individual word or subword units rather than just character-level differences, providing a more nuanced understanding of textual variations. The method builds on a broader approach that includes generating candidate sentences, calculating an initial edit distance score, and refining it with token-level analysis. By considering token-level discrepancies, the invention improves the detection of meaningful differences in sentence structure, vocabulary, or phrasing, which is particularly useful in applications like machine translation, plagiarism detection, and text generation. The token-level edit distance complements the overall edit distance score, ensuring that both coarse and fine-grained textual variations are accurately measured. This refinement leads to more reliable text similarity assessments, benefiting systems that rely on precise linguistic comparisons.
8. The method of claim 7 , wherein the edit distances are calculated using one or more of a Levenshtein distance, a Longest Common Subsequence (LCS) distance, a Hamming distance, or a Jaro-Winkler distance.
This invention relates to methods for comparing and analyzing sequences, such as text strings, DNA sequences, or other ordered data, to determine their similarity or dissimilarity. The problem addressed is the need for efficient and accurate measurement of differences between sequences, which is crucial in applications like text processing, bioinformatics, and data matching. The method involves calculating edit distances between sequences to quantify their dissimilarity. Edit distances measure the number of operations (insertions, deletions, substitutions, or transpositions) required to transform one sequence into another. The invention specifies the use of multiple well-known distance metrics, including the Levenshtein distance, Longest Common Subsequence (LCS) distance, Hamming distance, and Jaro-Winkler distance. Each metric has distinct properties: the Levenshtein distance accounts for all possible edit operations, the LCS distance focuses on the longest shared subsequence, the Hamming distance measures exact character mismatches, and the Jaro-Winkler distance emphasizes prefix similarity, making it useful for short strings. By employing these metrics, the method provides flexibility in choosing the most appropriate distance measure based on the specific requirements of the application. This allows for more accurate comparisons in tasks such as spell-checking, plagiarism detection, genetic sequence alignment, and record linkage. The invention enhances the robustness and adaptability of sequence comparison techniques in various domains.
9. A non-transitory machine-readable medium having instructions stored therein, which when executed by a processor, cause the processor to perform operations comprising: extracting one or more tokens from a sentence received by a search server as a query, each token comprising one or more words, the one or more words forming at least a partial interrogative sentence; determining, automatically, a domain category for the sentence; determining a set of candidate sentences having the same domain category as the sentence, wherein the candidate sentences contain one or more of the tokens or one or more similar words to the tokens, wherein the one or more tokens or one or more similar words of each candidate sentence comprise words specific to the domain category, wherein the set of candidate sentences are determined from an index, and wherein at least part of the index is created from information obtained by processing each candidate sentence, the processing including extracting one or more words as tokens from each candidate sentence; determining an interrogative intent of the sentence, comprising performing a semantic analysis on the sentence; for each candidate sentence, determining an aggregate similarity score between the candidate sentence and the sentence, wherein determining the aggregate similarity score includes determining a multi-dimensional token similarity score between one or more tokens of the sentence and one or more corresponding tokens of the candidate sentence, comprising at least two of: an edit distance score, a token similarity score, or an intent similarity score, wherein tokens comprising words specific to the domain category are given a higher weighting value when determining each of the at least two of: an edit distance score, a token similarity score, or an intent similarity score, wherein determining the aggregate similarity score further includes determining an intent similarity score between the determined interrogative intent for the sentence and a determined interrogative intent for the candidate sentence, wherein extracting one or more words as token from the sentence and extracting one or more words as tokens from each candidate sentence further include: determining a position of each token within the sentence and the candidate sentence respectively, and wherein determining the token similarity score includes comparing the position of the one or more tokens of the sentence with the position of the one or more corresponding tokens of the candidate sentence; and providing query results corresponding to one or more of the set of candidate sentences ranked based on the determined aggregate similarity scores of each candidate sentence.
This invention relates to a search system that processes natural language queries, particularly interrogative sentences, to improve search accuracy by leveraging domain-specific terminology and semantic analysis. The system extracts tokens (words or phrases) from a user's query, which forms at least part of an interrogative sentence, and automatically determines the domain category of the query. It then retrieves candidate sentences from an index that share the same domain category and contain tokens or similar words from the query, with domain-specific words receiving higher weighting. The system performs semantic analysis to determine the interrogative intent of the query and each candidate sentence. For each candidate, it calculates an aggregate similarity score by comparing tokens, their positions, and intents, using multiple metrics such as edit distance, token similarity, and intent similarity. Domain-specific tokens influence these scores more heavily. The candidate sentences are ranked based on these scores, and the top results are returned as query responses. The index is pre-built by processing sentences to extract tokens and categorize them by domain, ensuring efficient retrieval of relevant candidates. This approach enhances search precision by prioritizing domain-relevant terms and semantic context.
10. The medium of claim 9 , wherein the sentence comprises a phrase, a clause, a partial sentence, or a full sentence having at least a subject and an object.
The invention relates to natural language processing and text analysis, specifically improving the accuracy of sentence segmentation in text data. The problem addressed is the difficulty in accurately identifying sentence boundaries in unstructured text, particularly when dealing with complex or incomplete sentences, such as phrases, clauses, or partial sentences. Traditional segmentation methods often fail to handle these variations, leading to errors in downstream applications like machine translation, text summarization, or sentiment analysis. The invention provides a method for analyzing text data to identify sentence boundaries. The method involves processing a text input to detect potential sentence segments, where each segment may be a phrase, clause, partial sentence, or full sentence. Each segment must include at least a subject and an object to be considered a valid sentence. The method further includes validating the detected segments by checking for grammatical completeness, ensuring that the segments meet linguistic criteria for being a sentence. This approach improves segmentation accuracy by distinguishing between complete sentences and incomplete fragments, reducing errors in text processing tasks. The invention also includes a system for implementing this method, which may involve machine learning models trained on labeled text data to recognize sentence structures. The system can be integrated into larger natural language processing pipelines to enhance text analysis workflows. By accurately segmenting text into meaningful units, the invention supports more reliable text processing in applications requiring precise linguistic analysis.
11. The medium of claim 9 , wherein the similar words for each token are associated with the token in a precompiled dictionary, and wherein determining the aggregate similarity score further includes determining a confidence score between the similar words and the one or more corresponding tokens in the candidate sentence.
This invention relates to natural language processing (NLP) and text similarity analysis, specifically improving the accuracy of semantic matching between sentences. The problem addressed is the challenge of accurately comparing sentences with similar meanings but different word choices, which traditional lexical or syntactic methods often fail to capture. The invention involves a method for computing an aggregate similarity score between a reference sentence and a candidate sentence by analyzing tokens (words or subwords) in both sentences. For each token in the candidate sentence, similar words are identified from a precompiled dictionary. A confidence score is then calculated between these similar words and the corresponding tokens in the candidate sentence. The aggregate similarity score is derived by combining these confidence scores, providing a more nuanced measure of semantic similarity. The precompiled dictionary contains mappings of similar words for each token, enabling efficient lookup during comparison. The confidence score reflects the likelihood that a similar word from the dictionary is semantically equivalent to the token in context. By incorporating these scores, the method improves the robustness of text similarity assessments, particularly for sentences with paraphrased or rephrased content. This approach enhances applications like document retrieval, plagiarism detection, and machine translation by better capturing semantic relationships beyond exact word matches.
12. The medium of claim 9 , wherein extracting one or more words as tokens from the sentence and extracting one or more words as tokens from each candidate sentence further include determining a role and an importance weighting of each token within the sentence and the candidate sentence respectively, and wherein determining the token similarity score further includes comparing the role and the importance weighting of the one or more tokens of the sentence with the role and the importance weighting of the one or more corresponding tokens of the candidate sentence.
This invention relates to natural language processing (NLP) and information retrieval, specifically improving the accuracy of sentence similarity matching by incorporating token roles and importance weighting. The problem addressed is the limitation of traditional similarity scoring methods that treat all words equally, leading to inaccurate comparisons between sentences with similar but contextually different meanings. The system extracts tokens from a source sentence and candidate sentences, then assigns each token a role (e.g., subject, object, modifier) and an importance weight based on its syntactic and semantic contribution. The similarity score between sentences is computed by comparing not just the tokens themselves but also their roles and importance weights. For example, a verb in a sentence may be weighted higher than an adjective, and a subject noun may be more critical than a preposition. By aligning tokens with matching roles and similar importance, the system achieves more precise semantic matching. This approach enhances applications like duplicate detection, question answering, and document retrieval by reducing false positives where structurally similar but semantically different sentences are incorrectly matched. The method ensures that contextually significant words influence the similarity score more than less important ones, improving overall accuracy.
13. The medium of claim 12 , wherein the token similarity score is calculated as a Dice similarity coefficient, a Jaccard similarity coefficient, or a Cosine similarity coefficient.
This invention relates to systems and methods for calculating token similarity scores in natural language processing (NLP) or text analysis applications. The technology addresses the challenge of accurately measuring the similarity between tokens (e.g., words, phrases, or subword units) in text data, which is essential for tasks like text classification, information retrieval, and machine translation. The invention improves upon existing similarity metrics by providing a flexible framework that supports multiple similarity coefficients, including the Dice similarity coefficient, Jaccard similarity coefficient, and Cosine similarity coefficient. These coefficients are well-suited for comparing token sets or vectors, offering different strengths depending on the application. The Dice coefficient emphasizes shared elements, the Jaccard coefficient normalizes by the union of sets, and the Cosine coefficient measures the angle between vectors, making them adaptable to various text analysis scenarios. The invention enhances computational efficiency and accuracy by allowing the selection of the most appropriate similarity metric for a given task, improving the performance of NLP models and text processing systems.
14. The medium of claim 9 , wherein determining the edit distance score includes determining a character-level edit distance between characters of the one or more tokens of the sentence and characters of the one or more corresponding tokens of the candidate sentence.
This invention relates to natural language processing (NLP) and text similarity analysis, specifically improving the accuracy of sentence similarity scoring by incorporating character-level edit distance metrics. The problem addressed is the limitation of traditional token-based similarity measures, which may overlook subtle differences in character sequences that affect meaning or correctness. The solution involves computing a character-level edit distance between tokens of a reference sentence and corresponding tokens of a candidate sentence, then integrating this metric into an overall edit distance score. This approach enhances precision by capturing fine-grained variations, such as typos, inflectional differences, or formatting discrepancies, that token-level methods might miss. The edit distance score is derived by comparing individual characters within tokens, allowing for operations like insertion, deletion, or substitution at the character level. This method is particularly useful in applications requiring high-fidelity text comparison, such as plagiarism detection, machine translation evaluation, or automated proofreading. By combining character-level granularity with broader token-based analysis, the system achieves more nuanced and reliable similarity assessments. The invention may be implemented in software or hardware systems processing textual data, where accurate similarity scoring is critical.
15. The medium of claim 14 , wherein determining the edit distance score further includes determining a token-level edit distance between the tokens of the sentence and the tokens of the candidate sentence.
This invention relates to natural language processing and text similarity analysis, specifically improving the accuracy of edit distance calculations between sentences. The problem addressed is the need for more precise text comparison methods that account for token-level differences, which traditional edit distance algorithms often overlook. The invention enhances text similarity assessment by incorporating token-level edit distance into the scoring process. This involves analyzing individual tokens (words or subwords) within sentences to compute a refined edit distance score. The method compares a sentence and a candidate sentence by breaking them into tokens, then calculating the token-level edit distance, which measures the number of insertions, deletions, or substitutions required to transform one token sequence into another. This token-level analysis is integrated into the overall edit distance score, providing a more nuanced evaluation of textual similarity. The approach improves applications like plagiarism detection, machine translation evaluation, and text correction systems by capturing finer-grained differences between sentences. The invention ensures that the edit distance score accurately reflects the structural and semantic variations between texts, enhancing the reliability of text comparison tasks.
16. The medium of claim 15 , wherein the edit distances are calculated using one or more of a Levenshtein distance, a Longest Common Subsequence (LCS) distance, a Hamming distance, or a Jaro-Winkler distance.
This invention relates to a system for analyzing and comparing text data using multiple distance metrics to improve accuracy in identifying similarities or differences between text strings. The system addresses the problem of accurately measuring textual similarity, which is crucial for applications such as plagiarism detection, document comparison, and data deduplication. Traditional methods often rely on a single distance metric, which may not capture all nuances of textual variations. The system calculates edit distances between text strings using a combination of different distance metrics, including Levenshtein distance, Longest Common Subsequence (LCS) distance, Hamming distance, and Jaro-Winkler distance. Each metric provides a unique perspective on textual similarity: Levenshtein distance measures the minimum number of single-character edits required to change one string into another, LCS distance evaluates the longest sequence of characters common to both strings, Hamming distance counts the number of positions at which corresponding characters differ, and Jaro-Winkler distance emphasizes matching prefixes, making it useful for short strings. By integrating these metrics, the system enhances the robustness of text comparison, reducing false positives and negatives in similarity assessments. The system may also normalize or weight these distances to optimize performance for specific applications. This approach improves the reliability of text analysis in fields requiring precise similarity measurements.
17. A system, comprising: a processor; and a memory coupled to the processor, the memory storing instructions, which when executed by the processor, cause the processor to perform operations comprising: extracting one or more tokens from a sentence received by a search server as a query, each token comprising one or more words, the one or more words forming at least a partial interrogative sentence; determining, automatically a domain category for the sentence; identifying similar words for each token of the sentence from a precompiled dictionary; determining a set of candidate sentences having the same domain category as the sentence, wherein the candidate sentences contain one or more of the tokens or one or more of the similar words, wherein the one or more tokens or one or more similar words of each candidate sentence comprise words specific to the domain category, wherein the set of candidate sentences are determined from an index, and wherein at least part of the index is created from information obtained by processing each candidate sentence, the processing including extracting one or more words as tokens from each candidate sentence; determining an interrogative intent of the sentence, comprising performing a semantic analysis on the sentence; for each candidate sentence, determining an aggregate similarity score between the candidate sentence and the sentence, wherein determining the aggregate similarity score includes determining a multi-dimensional token similarity score between one or more tokens of the sentence and one or more corresponding tokens of the candidate sentence, comprising at least two of: an edit distance score, a token similarity score, or an intent similarity score, wherein tokens comprising words specific to the domain category are given a higher weighting value when determining each of the at least two of: an edit distance score, a token similarity score, or an intent similarity score, wherein determining the aggregate similarity score further includes determining an intent similarity score between the determined interrogative intent for the sentence and a determined interrogative intent for the candidate sentence, wherein extracting one or more words as token from the sentence and extracting one or more words as tokens from each candidate sentence further include: determining a position of each token within the sentence and the candidate sentence respectively, and wherein determining the token similarity score includes comparing the position of the one or more tokens of the sentence with the position of the one or more corresponding tokens of the candidate sentence; and providing query results corresponding to one or more of the set of candidate sentences ranked based on the determined aggregate similarity scores of each candidate sentence.
This system improves search query processing by analyzing interrogative sentences to retrieve relevant results. The system extracts tokens from a received query sentence, where each token consists of one or more words forming at least part of an interrogative sentence. It automatically determines the domain category of the query and identifies similar words for each token using a precompiled dictionary. The system then retrieves candidate sentences from an index that share the same domain category and contain the query tokens or their similar words, with domain-specific words receiving higher weighting. The index is built by processing candidate sentences to extract tokens and categorize them by domain. The system performs semantic analysis to determine the interrogative intent of the query and each candidate sentence. It calculates an aggregate similarity score for each candidate sentence by comparing tokens based on edit distance, token similarity, and intent similarity, with domain-specific tokens receiving higher priority. Token positions within sentences are also considered in similarity scoring. Finally, the system ranks and provides query results based on the aggregate similarity scores of the candidate sentences. This approach enhances search accuracy by leveraging domain-specific context and semantic analysis to match interrogative queries with relevant responses.
18. The system of claim 17 , wherein determining the aggregate similarity score further includes determining one or more of an edit distance score between one or more tokens of the sentence and one or more corresponding tokens of the candidate sentence, a confidence score between one or more of the similar words and one or more corresponding tokens of the candidate sentence; and an intent similarity score between a determined intent for the sentence and a determined intent for the candidate sentence.
This invention relates to natural language processing (NLP) systems for evaluating the similarity between sentences. The problem addressed is accurately assessing how closely related two sentences are, which is crucial for applications like search engines, chatbots, and document retrieval. The system compares a sentence with one or more candidate sentences to determine their similarity, using multiple scoring metrics to improve accuracy. The system calculates an aggregate similarity score by analyzing different aspects of the sentences. It computes an edit distance score, which measures the number of changes (insertions, deletions, or substitutions) needed to transform one sentence into another at the token level. Additionally, it evaluates a confidence score, which assesses the likelihood that similar words in the sentences have the same meaning. The system also determines an intent similarity score by comparing the inferred intents of both sentences, ensuring that semantically similar sentences with different phrasing are correctly identified. By combining these scores, the system provides a more robust similarity assessment than traditional methods that rely solely on lexical or syntactic matching. This approach enhances the accuracy of NLP applications that depend on sentence similarity, such as information retrieval, question answering, and automated content analysis.
Unknown
October 20, 2020
Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.