Legal claims defining the scope of protection. Each claim is shown in both the original legal language and a plain English translation.
1. A system comprising: one or more processors; and a non-transitory computer readable medium storing a plurality of instructions, which when executed, cause the one or more processors to: create, by a database system, a first trie by using values stored in a first field by multiple records, a second trie by using values stored in a second field by the multiple records, and a third trie by using values stored in a third field by the multiple records; associate, by the database system, a node in the third trie with a record of the multiple records by using a value stored in the third field by the record; associate, by the database system, the node with a first dispersion measure and a second dispersion measure, the first dispersion measure being based on values stored in the first field by records associated with the node and second dispersion measure being based on values stored in the second field by records associated with the node; identify, by the database system, using a prospective value stored in the third field by a prospective record, a branch sequence in the third trie as a match key for the prospective record; identify, by the database system, using the match key for the prospective record, a subset of the multiple records, which match the prospective record; determine, by the database system whether the first dispersion measure is greater than the second dispersion measure, in response to a determination that a count of records in the subset exceeds a threshold, identify, by the database system, another branch sequence in the first trie as another match key for a prospective record, in response to a determination that the first dispersion measure is greater than the second dispersion measure; and identify, by the database system, using the match key and the other match key for the prospective record, at least one record of the subset as a multi-key match with the prospective record.
The system improves database search efficiency by using multiple trie data structures to index and retrieve records based on field values. The problem addressed is the inefficiency of traditional single-key searches in large datasets, which can return excessive matches or miss relevant records. The system creates three tries from values in three different fields of records. Each node in the third trie is linked to records sharing the same value in the third field and is annotated with two dispersion measures. The first dispersion measure quantifies the variability of values in the first field among records linked to the node, while the second dispersion measure quantifies the variability of values in the second field among the same records. When a new record is processed, the system uses its third field value to locate a branch sequence in the third trie, which serves as a match key to retrieve a subset of records. If the subset exceeds a threshold size, the system compares the dispersion measures. If the first dispersion measure is higher, indicating greater variability in the first field, the system also uses a branch sequence from the first trie as an additional match key. The combined keys refine the search, identifying records that match the new record across multiple fields, improving accuracy and reducing false positives. This approach optimizes multi-field record matching in large datasets.
2. The system of claim 1 , wherein creating the third trie comprises: tokenizing, by the database system, the values stored in the third field by the multiple records; and creating, by the database system, the third trie from the tokenized values, each branch in the third trie labeled with one of the tokenized values, each node storing a count indicating a number of the multiple records associated with a tokenized value sequence beginning from a root of the third trie.
A database system processes records containing multiple fields, where each record includes a third field storing values that need to be efficiently searched and analyzed. The system creates a third trie data structure to optimize these operations. The process involves tokenizing the values stored in the third field across multiple records, breaking them into individual tokens. The system then constructs the third trie from these tokenized values, where each branch is labeled with one of the tokens. Each node in the trie stores a count representing the number of records associated with a specific sequence of tokens starting from the root of the trie. This structure allows for fast prefix-based searches and frequency analysis of token sequences within the third field. The trie enables efficient retrieval of records based on partial or complete token sequences, improving search performance and enabling statistical analysis of the data. The system may also create additional tries for other fields, each optimized for their respective data, ensuring comprehensive and scalable data processing capabilities.
3. The system of claim 1 , wherein associating the node in the third trie with the record comprises: tokenizing, by the database system, the value stored in the third field by the record; identifying, by the database system, each node, beginning from a root of the third trie, corresponding to a token value sequence associated with the tokenized value, until a node is identified that stores a count less than a node threshold; identifying, by the database system, a branch sequence comprising each identified node as a match key for the record; and associating, by the database system, the match key with the node, and the record with the match key.
This invention relates to database systems that use trie data structures for efficient record association and retrieval. The problem addressed is the need for scalable and accurate methods to associate records with nodes in a trie, particularly when dealing with large datasets and complex value fields. The system improves upon prior art by dynamically adjusting node associations based on tokenized field values and node counts. The system tokenizes a value stored in a record's field and traverses a trie structure starting from its root. Each node in the trie corresponds to a token value sequence derived from the tokenized field value. The traversal continues until a node is found where the stored count is below a predefined threshold. The sequence of nodes encountered during this traversal forms a branch sequence, which serves as a match key for the record. The match key is then associated with the identified node, and the record is linked to this match key. This approach ensures that records are efficiently mapped to trie nodes while maintaining performance and accuracy, even as the dataset grows. The node threshold dynamically controls the granularity of associations, allowing the system to balance between precision and computational overhead. This method is particularly useful in applications requiring fast lookups and updates in large-scale databases.
4. The system of claim 1 , wherein the values stored in the first field by records associated with the node in the third trie are associated with a node in the first trie, the node in the first trie being at a same node depth as the node in the third trie, and the values stored in the second field by the records associated with the node in the third trie are associated with a node in the second trie, the node in the second trie being at the same node depth as the node in the third trie.
A system for managing and querying hierarchical data structures, such as tries, is disclosed. The system addresses the challenge of efficiently storing and retrieving data in multi-dimensional or multi-level trie structures, where relationships between nodes across different tries need to be maintained. The system includes a third trie that stores records, each containing at least two fields. The first field in these records stores values linked to a corresponding node in a first trie, while the second field stores values linked to a corresponding node in a second trie. The nodes in the first and second tries are positioned at the same depth level as the node in the third trie that holds the records. This alignment ensures that the relationships between nodes across the different tries are preserved, enabling efficient traversal and querying of the hierarchical data. The system is particularly useful in applications requiring multi-dimensional indexing, such as natural language processing, database systems, or network routing, where maintaining consistent node relationships across multiple tries is critical for performance and accuracy.
5. The system of claim 1 , wherein identifying the branch sequence as the match key for the prospective record comprises: tokenizing, by the database system, the prospective value stored in the third field by the prospective record; identifying, by the database system, each node, beginning from a root of the third trie, corresponding to a token value sequence associated with the tokenized prospective value, until a node is identified that stores a count that is less than a node threshold; and identifying, by the database system, a match key associated with the identified node as the match key for the prospective record.
This invention relates to database systems that use trie data structures for efficient record matching. The problem addressed is improving the accuracy and efficiency of identifying matching records in a database by leveraging tokenized values and trie-based indexing. The system processes a prospective record by tokenizing a value stored in a specific field of the record. The tokenized value is then traversed through a trie data structure, starting from the root node, to identify a sequence of nodes corresponding to the token sequence. The traversal continues until a node is found where the stored count of associated records falls below a predefined threshold. The match key associated with this node is then selected as the match key for the prospective record. This approach ensures that the match key is derived from a branch sequence in the trie that is statistically significant, improving the reliability of record matching while optimizing performance. The system dynamically adjusts the match key selection based on the node counts, allowing for adaptive and efficient record linkage in large-scale databases.
6. The system of claim 1 , wherein identifying the branch sequence as the match key for the prospective record is based on a post-list size associated with the branch sequence.
The system relates to data processing and search optimization, specifically improving the efficiency of record matching in large datasets. The problem addressed is the computational inefficiency in identifying matching records, particularly when dealing with complex branching structures in data. Traditional methods often rely on exhaustive searches or simple key matching, which can be slow and resource-intensive for large or hierarchical datasets. The system includes a method for identifying a branch sequence as a match key for a prospective record. The branch sequence is a structured path or sequence of data elements that defines a unique or highly specific subset of records. The system evaluates the suitability of a branch sequence as a match key by considering the post-list size associated with it. The post-list size refers to the number of records that would be retrieved or matched if the branch sequence were used as the key. A smaller post-list size indicates a more precise and efficient match key, reducing the number of records that need to be processed further. The system selects the branch sequence with the smallest post-list size as the optimal match key, ensuring faster and more accurate record matching. This approach minimizes computational overhead and improves search performance in large-scale data environments.
7. The system of claim 1 , wherein identifying the record of the subset is based on using the match key and the other match key for the prospective record, in response to a determination that an estimated count of the multiple records that match the prospective record, based on the count of the subset and a dispersion measure corresponding to the other match key, does not exceed the threshold, and is further based on using an additional match key for the prospective record, in response to a determination that the estimated count of the multiple records that match the prospective record exceeds the threshold.
The invention relates to a data matching system designed to efficiently identify and retrieve records from a database while minimizing false positives and computational overhead. The system addresses the challenge of balancing matching accuracy with performance when comparing a prospective record against multiple existing records using multiple match keys. A match key is a derived attribute or combination of attributes used to compare records for similarity. The system first evaluates a prospective record using a primary match key to estimate how many existing records in the database are likely to match. If the estimated count of potential matches is below a predefined threshold, the system uses the primary match key to directly identify a subset of matching records. However, if the estimated count exceeds the threshold, indicating a high likelihood of multiple matches, the system employs an additional, more specific match key to refine the search and reduce the subset to a manageable size. This additional match key may involve a different combination of attributes or a stricter matching criterion to improve precision. The system also incorporates a dispersion measure associated with the match keys, which quantifies the variability or uniqueness of the match key values across the database. This measure helps refine the estimated count of matching records, ensuring that the system selects the most appropriate match key or combination of keys to achieve both accuracy and efficiency in record retrieval. The approach optimizes the matching process by dynamically adjusting the matching strategy based on the estimated match count and the uniqueness of the match keys.
8. A computer program product comprising computer-readable program code to be executed by one or more processors when retrieved from a non-transitory computer-readable medium, the program code including instructions to: create, by a database system, a first trie by using values stored in a first field by multiple records, a second trie by using values stored in a second field by the multiple records, and a third trie by using values stored in a third field by the multiple records; associate, by the database system, a node in the third trie with a record of the multiple records by using a value stored in the third field by the record; associate, by the database system, the node with a first dispersion measure and a second dispersion measure, the first dispersion measure being based on values stored in the first field by records associated with the node and second dispersion measure being based on values stored in the second field by records associated with the node; identify, by the database system, using a prospective value stored in the third field by a prospective record, a branch sequence in the third trie as a match key for the prospective record; identify, by the database system, using the match key for the prospective record, a subset of the multiple records, which match the prospective record; determine, by the database system whether the first dispersion measure is greater than the second dispersion measure, in response to a determination that a count of records in the subset exceeds a threshold, identify, by the database system, another branch sequence in one of the first trie_ as another match key for a prospective record, in response to a determination that the first dispersion measure is greater than the second dispersion measure; and identify, by the database system, using the match key and the other match key for the prospective record, at least one record of the subset as a multi-key match with the prospective record.
This invention relates to database systems and methods for efficiently identifying matching records using multiple fields. The problem addressed is the challenge of accurately and efficiently matching records in large datasets where multiple fields may contribute to the matching criteria. The solution involves constructing multiple trie data structures, each representing values from different fields in the database records. A first trie is built using values from a first field, a second trie from a second field, and a third trie from a third field. Each node in the third trie is associated with a record and two dispersion measures. The first dispersion measure is based on the diversity of values in the first field among records linked to the node, while the second dispersion measure is based on the diversity of values in the second field among those records. When a new prospective record is processed, the system uses its third field value to locate a branch sequence in the third trie, which serves as a match key. This key retrieves a subset of records that match the prospective record. If the subset exceeds a predefined threshold, the system compares the dispersion measures. If the first dispersion measure is greater, the system identifies another branch sequence in the first trie as an additional match key. The prospective record is then matched against the subset using both keys, resulting in a multi-key match. This approach improves matching accuracy by dynamically selecting the most discriminative fields based on dispersion analysis.
9. The computer program product of claim 8 , wherein creating the third trie comprises: tokenizing, by the database system, the values stored in the third field by the multiple records; and creating, by the database system, the third trie from the tokenized values, each branch in the third trie labeled with one of the tokenized values, each node storing a count indicating a number of the multiple records associated with a tokenized value sequence beginning from a root of the third trie.
This invention relates to database systems that use trie data structures to efficiently index and search field values in records. The problem addressed is the need for fast and scalable search operations on large datasets, particularly when dealing with multi-field queries or tokenized values. The invention describes a method for creating and using a trie data structure to index values stored in a field of a database record. The trie is constructed by tokenizing the values in the field and building a hierarchical structure where each branch is labeled with a tokenized value. Each node in the trie stores a count indicating how many records in the database are associated with a particular sequence of tokenized values starting from the root of the trie. This allows for efficient prefix-based searches and counting of matching records. The trie can be used to process queries by traversing the structure to find matches for search terms, leveraging the stored counts to quickly determine the number of records that satisfy the query conditions. The invention also supports the creation of multiple tries for different fields in the database, enabling multi-field search operations. The counts stored in the nodes allow for efficient aggregation of results across different fields. This approach improves search performance by reducing the need for full table scans and enabling faster query processing in large-scale database systems.
10. The computer program product of claim 8 , wherein associating the node in the third trie with the record comprises: tokenizing, by the database system, the value stored in the third field by the record; identifying, by the database system, each node, beginning from a root of the third trie, corresponding to a token value sequence associated with the tokenized value, until a node is identified that stores a count less than a node threshold; identifying, by the database system, a branch sequence comprising each identified node as a match key for the record; and associating, by the database system, the match key with the node, and the record with the match key.
This invention relates to database systems that use trie data structures for efficient record association and retrieval. The problem addressed is optimizing the storage and lookup of records based on tokenized field values, particularly when dealing with large datasets where traditional indexing methods may be inefficient. The system involves a database that stores records, each containing multiple fields. A trie data structure is used to index values from a specific field (the third field in this case) for fast searching. When a record is processed, the value in the third field is tokenized into a sequence of tokens. The system then traverses the trie starting from its root, following nodes that match the token sequence until it encounters a node where the stored count is below a predefined threshold. The sequence of nodes encountered during this traversal forms a match key, which is then associated with the record. This match key allows the database to quickly retrieve records based on partial or full token matches, improving search performance. The threshold ensures that the trie remains balanced and efficient, preventing excessive branching. This method is particularly useful for applications requiring fast, flexible searches on tokenized data, such as text-based queries or pattern matching.
11. The computer program product of claim 8 , wherein the values stored in the first field by records associated with the node in the third trie are associated with a node in the first trie, the node in the first trie being at a same node depth as the node in the third trie, and the values stored in the second field by the records associated with the node in the third trie are associated with a node in the second trie, the node in the second trie being at the same node depth as the node in the third trie.
This invention relates to data structures and algorithms for efficient string or sequence matching, particularly in systems requiring fast lookups or pattern recognition. The problem addressed is optimizing memory usage and search performance in applications like autocomplete, spell-checking, or network routing, where multiple related tries (prefix trees) are used to store and retrieve data. The invention describes a system using three interconnected tries: a first trie for storing primary data, a second trie for storing secondary data, and a third trie that links nodes from the first and second tries. Each node in the third trie contains records with two fields. The first field stores values associated with a node in the first trie at the same depth, while the second field stores values associated with a node in the second trie at the same depth. This structure allows efficient traversal and retrieval of related data across the tries without redundant storage, improving memory efficiency and search speed. The system ensures that when a node in the third trie is accessed, the corresponding nodes in the first and second tries at the same depth can be quickly referenced, enabling coordinated operations across the tries. This is particularly useful in applications where multiple related datasets must be accessed simultaneously, such as in natural language processing or network packet inspection. The invention optimizes both storage and retrieval by maintaining consistent depth alignment between the tries.
12. The computer program product of claim 8 , wherein identifying the branch sequence as the match key for the prospective record comprises: tokenizing, by the database system, the prospective value stored in the third field by the prospective record; identifying, by the database system, each node, beginning from a root of the third trie, corresponding to a token value sequence associated with the tokenized prospective value, until a node is identified that stores a count that is less than a node threshold; and identifying, by the database system, a match key associated with the identified node as the match key for the prospective record.
This invention relates to database systems and specifically to methods for identifying match keys in a database using trie data structures. The problem addressed is efficiently determining a match key for a prospective record in a database by analyzing field values and leveraging a trie structure to optimize the matching process. The system involves a database with records containing multiple fields, including a third field storing a prospective value. The database system tokenizes this prospective value, breaking it into individual tokens. The system then traverses a third trie data structure, starting from its root, to find nodes corresponding to the sequence of token values derived from the prospective value. During traversal, the system checks each node's stored count against a predefined node threshold. If a node's count falls below this threshold, the traversal stops, and the match key associated with that node is selected as the match key for the prospective record. This approach ensures that the match key is derived from a token sequence that meets a minimum frequency requirement, improving the accuracy and efficiency of record matching in the database. The trie structure allows for fast lookup and comparison of token sequences, reducing computational overhead while maintaining high precision in key identification.
13. The computer program product of claim 8 , wherein identifying the branch sequence as the match key for the prospective record is based on a post-list size associated with the branch sequence.
This invention relates to data processing systems, specifically methods for optimizing database queries by identifying and using efficient match keys for record retrieval. The problem addressed is the computational inefficiency in querying large datasets, particularly when traditional indexing methods fail to optimize performance for complex or nested data structures. The invention describes a computer program product that improves query performance by dynamically selecting a branch sequence as a match key for prospective records. A branch sequence refers to a hierarchical or nested data path within a dataset, such as a tree or graph structure. The selection of the branch sequence as the match key is based on a post-list size associated with the branch sequence. The post-list size represents the number of records that would be retrieved if the branch sequence were used as the match key, with a smaller post-list size indicating higher query efficiency. The system evaluates multiple branch sequences and selects the one with the smallest post-list size to minimize the number of records processed during a query, thereby improving search speed and reducing computational overhead. This approach is particularly useful in systems handling complex data structures where traditional indexing methods are ineffective.
14. The computer program product of claim 8 , wherein identifying the record of the subset is based on using the match key and the other match key for the prospective record, in response to a determination that an estimated count of the multiple records that match the prospective record, based on the count of the subset and a dispersion measure corresponding to the other match key, does not exceed the threshold, and is further based on using an additional match key for the prospective record, in response to a determination that the estimated count of the multiple records that match the prospective record exceeds the threshold.
This invention relates to data matching and deduplication in computer systems, specifically improving the efficiency and accuracy of identifying duplicate records in large datasets. The problem addressed is the computational cost and potential inaccuracy of traditional matching techniques when processing large volumes of records, particularly when using multiple match keys (e.g., name, address, or identifier fields) to determine record similarity. The invention describes a method for selecting records from a dataset where a prospective record is compared against a subset of records using a primary match key. If an estimated count of potential matches, based on the subset size and a dispersion measure of a secondary match key, falls below a predefined threshold, the system identifies the record using only the primary and secondary keys. If the estimated count exceeds the threshold, an additional match key is used to refine the selection, reducing false positives and improving accuracy. The dispersion measure quantifies how evenly the secondary match key values are distributed, helping predict the likelihood of duplicates. This adaptive approach balances computational efficiency with matching precision, dynamically adjusting the number of keys used based on the estimated match count. The system avoids unnecessary comparisons when the subset is small or the secondary key is highly discriminative, while ensuring thorough validation when duplicates are likely. This method is particularly useful in databases, data warehouses, or applications requiring real-time deduplication.
15. A method comprising: creating, by a database system, a first trie by using values stored in a first field by multiple records, a second trie by using values stored in a second field by the multiple records, and a third trie by using values stored in a third field by the multiple records; associating, by the database system, a node in the third trie with a record of the multiple records by using a value stored in the third field by the record; associating, by the database system, the node with a first dispersion measure and a second dispersion measure, the first dispersion measure being based on values stored in the first field by records associated with the node and second dispersion measure being based on values stored in the second field by records associated with the node; identifying, by the database system, using a prospective value stored in the third field by a prospective record, a branch sequence in the third trie as a match key for the prospective record; identifying, by the database system, using the match key for the prospective record, a subset of the multiple records, which match the prospective record; determining, by the database system whether the first dispersion measure is greater than the second dispersion measure, in response to a determination that a count of records in the subset exceeds a threshold, identifying, by the database system, another branch sequence in the first trie_ as another match key for a prospective record, in response to a determination that the first dispersion measure is greater than the second dispersion measure; and identifying, by the database system, using the match key and the other match key for the prospective record, at least one record of the subset as a multi-key match with the prospective record.
This invention relates to database systems and methods for efficiently identifying matching records using multiple fields. The problem addressed is the challenge of accurately and efficiently finding records in a database that match a prospective record based on multiple fields, especially when dealing with large datasets where direct comparisons are computationally expensive. The method involves creating three tries (prefix trees) from values stored in three different fields of multiple records in a database. Each trie is constructed using values from a specific field, allowing for fast prefix-based searches. Nodes in the third trie are associated with records based on their values in the third field. Each node is also linked to two dispersion measures: one based on the diversity of values in the first field among records associated with that node, and another based on the diversity of values in the second field. When a prospective record is processed, its value in the third field is used to identify a branch sequence in the third trie, which serves as a match key. This key is used to retrieve a subset of records that match the prospective record. If the subset exceeds a predefined threshold, the system compares the dispersion measures. If the first dispersion measure (based on the first field) is greater than the second (based on the second field), another branch sequence is identified in the first trie as an additional match key. The prospective record is then matched against the subset using both the original and the additional match keys, resulting in a multi-key match for improved accuracy. This approach optimizes record matching by dynamically selecting the most discriminative fields based on dispersion analysis.
16. The method of claim 15 , wherein creating the third trie comprises: tokenizing, by the database system, the values stored in the third field by the multiple records; and creating, by the database system, the third trie from the tokenized values, each branch in the third trie labeled with one of the tokenized values, each node storing a count indicating a number of the multiple records associated with a tokenized value sequence beginning from a root of the third trie.
This invention relates to database systems that use trie data structures to efficiently index and search field values across multiple records. The problem addressed is the need for scalable and fast search operations on large datasets, particularly when dealing with fields containing complex or multi-token values. The method involves creating a trie data structure for a field in a database table, where the field contains values that can be broken down into tokens. The database system tokenizes the values stored in the field across multiple records, then constructs a trie where each branch is labeled with a tokenized value. Each node in the trie stores a count indicating how many records are associated with a particular token sequence starting from the root. This allows for efficient prefix-based searches and counting of records matching specific token patterns. The method can be applied to multiple fields in the same table, with each field having its own trie. For example, a third field in the table can be processed similarly by tokenizing its values and building a third trie, where each node again stores a count of records associated with a token sequence. This approach enables fast lookups and aggregations without scanning the entire dataset, improving query performance in large-scale databases.
17. The method of claim 15 , wherein associating the node in the third trie with the record comprises: tokenizing, by the database system, the value stored in the third field by the record; identifying, by the database system, each node, beginning from a root of the third trie, corresponding to a token value sequence associated with the tokenized value, until a node is identified that stores a count less than a node threshold; identifying, by the database system, a branch sequence comprising each identified node as a match key for the record; and associating, by the database system, the match key with the node, and the record with the match key.
This invention relates to database systems that use trie data structures for efficient record association and retrieval. The problem addressed is optimizing the storage and lookup of records based on tokenized field values, particularly when dealing with large datasets where traditional indexing methods may be inefficient. The method involves a database system that maintains multiple tries, each corresponding to a different field in a record. For a given record, the system tokenizes the value stored in a specific field (e.g., a text field) and traverses a trie associated with that field. Starting from the root of the trie, the system identifies nodes corresponding to the sequence of token values derived from the field value. The traversal continues until a node is found where the stored count (indicating the number of records associated with that node) is below a predefined threshold. The sequence of nodes encountered during this traversal forms a "branch sequence," which serves as a match key for the record. The system then associates this match key with the identified node and links the record to the match key. This approach allows for efficient record retrieval by leveraging the hierarchical structure of the trie and dynamically adjusting the granularity of associations based on node counts. The method ensures that records are grouped in a way that balances storage efficiency and query performance.
18. The method of claim 15 , wherein the values stored in the first field by records associated with the node in the third trie are associated with a node in the first trie, the node in the first trie being at a same node depth as the node in the third trie, and the values stored in the second field by the records associated with the node in the third trie are associated with a node in the second trie, the node in the second trie being at the same node depth as the node in the third trie.
This invention relates to data structures and methods for efficient data retrieval, particularly in systems using multiple trie data structures. The problem addressed is optimizing storage and retrieval of data in systems where multiple tries are used to represent different aspects of the same information, such as in network routing, text processing, or database indexing. The invention involves a method for managing data in a system with at least three tries: a first trie, a second trie, and a third trie. Each node in the third trie is associated with records that store values in two fields. The values in the first field correspond to a node in the first trie, and the values in the second field correspond to a node in the second trie. Importantly, the nodes in the first and second tries that are referenced by these fields are at the same depth as the node in the third trie. This alignment ensures that the data structure maintains consistency and allows for efficient traversal and retrieval operations across the multiple tries. The method enables faster lookups and updates by leveraging the structured relationships between the tries, reducing the need for redundant storage or complex traversal logic. This approach is particularly useful in applications requiring high-performance data access, such as network packet routing or natural language processing.
19. The method of claim 15 , wherein identifying the branch sequence as the match key for the prospective record is based on a post-list size associated with the branch sequence, and comprises: tokenizing, by the database system, the prospective value stored in the third field by the prospective record; identifying, by the database system, each node, beginning from a root of the third trie, corresponding to a token value sequence associated with the tokenized prospective value, until a node is identified that stores a count that is less than a node threshold; and identifying, by the database system, a match key associated with the identified node as the match key for the prospective record.
This invention relates to database systems and methods for efficiently identifying match keys in a trie data structure. The problem addressed is optimizing the selection of match keys for records in a database, particularly when dealing with large datasets where performance and accuracy are critical. The solution involves using a trie structure to store and retrieve match keys based on tokenized values from record fields, with a focus on balancing search efficiency and memory usage. The method involves tokenizing a prospective value stored in a record field and traversing a trie data structure to identify a node associated with the tokenized value. The traversal begins at the root of the trie and continues until a node is found where the stored count is below a predefined threshold. The count represents the number of records associated with that node, and the threshold ensures that the selected match key is neither too broad nor too specific. Once the appropriate node is identified, the match key associated with that node is selected for the prospective record. This approach allows the database system to dynamically adjust the granularity of the match key based on the distribution of values in the dataset, improving search performance and reducing memory overhead. The method is particularly useful in systems where records are frequently updated or where the dataset is highly variable.
20. The method of claim 15 , wherein identifying the record of the subset is based on using the match key and the other match key for the prospective record, in response to a determination that an estimated count of the multiple records that match the prospective record, based on the count of the subset and a dispersion measure corresponding to the other match key, does not exceed the threshold, and is further based on using an additional match key for the prospective record, in response to a determination that the estimated count of the multiple records that match the prospective record exceeds the threshold.
This invention relates to data matching and deduplication, specifically improving the efficiency and accuracy of identifying matching records in a dataset. The problem addressed is the computational cost and potential inaccuracy of traditional matching methods when dealing with large datasets, where multiple records may partially match a prospective record based on different match keys. The method involves selecting a subset of records from a dataset that partially match a prospective record using a primary match key. The subset is then refined by applying an additional match key to identify the most relevant records. The refinement process is guided by an estimated count of matching records, which is calculated using the size of the subset and a dispersion measure associated with the secondary match key. If the estimated count exceeds a predefined threshold, the method further refines the subset by applying an additional match key to improve accuracy. This adaptive approach ensures that the matching process remains efficient while maintaining high precision, particularly in datasets where multiple records may share similar attributes. The dispersion measure helps quantify the variability of the secondary match key, allowing the system to dynamically adjust the matching criteria based on data distribution. This method is particularly useful in applications such as customer data deduplication, fraud detection, and record linkage in large-scale databases.
Unknown
October 27, 2020
Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.