Legal claims defining the scope of protection, as filed with the USPTO.
1. A method comprising, at a computer system: generating a first plurality of hash signatures, wherein each hash signature of the first plurality of hash signatures are generated based on first profile metadata for a different column in a first plurality of columns in a first dataset stored at a first data source; generating a second plurality of hash signatures, wherein each hash signature of the second plurality of hash signatures are generated based on second profile metadata for a different column in a second plurality of columns in a second dataset stored at a second data source; and determining a set of column pairs based on comparison of the first plurality of hash signatures with the second plurality of hash signatures, wherein each column pair in the set of column pairs includes a different first column of the first plurality of columns and a different second column of the second plurality of columns based on similarity between a first hash signature for the different first column and a second hash signature for the different second column.
2. The method according to claim 1 , further comprising: generating information about a comparison of similarity of the first dataset to the second dataset; and generating a graphical interface to display the information about the comparison of similarity of the first dataset to the second dataset.
3. The method according to claim 2 , wherein the graphical interface provides metrics indicating granular similarity between a column of the first dataset and a column of the second dataset based on a comparison of metrics between the column of the first dataset and the column of the second dataset.
4. The method according to claim 3 , wherein the graphical interface is interactive.
5. The method according to claim 1 , further comprising: receiving input corresponding to a selection for combining the first dataset with the second dataset based on a measure of similarity between the first dataset and the second dataset based; and generating a transform script for generating a third dataset based on the input for combining the first dataset and the second dataset.
6. The method according to claim 1 , wherein a hash signature based on profile metadata for a column is generated using a scalar hashing function.
7. The method according to claim 6 , wherein profile metadata includes a type profile of the column, a subtype profile of the column, a compounding attribute of the column, a pattern of data in the column, one or more delimiters of the column, or a combination thereof.
8. The method according to claim 1 , wherein the comparison to determine set of column pairs includes analyzing the first plurality of hash signatures with the second plurality of hash signatures.
9. The method according to claim 1 , wherein the computer system is a cloud-based storage system.
10. A system comprising: one or more processors; and a memory accessible to the one or more processors, the memory storing instructions that, upon execution by the one or more processors, causes the one or more processors to: generate a first plurality of hash signatures, wherein each hash signature of the first plurality of hash signatures are generated based on first profile metadata for a different column in a first plurality of columns in a first dataset stored at a first data source; generate a second plurality of hash signatures, wherein each hash signature of the second plurality of hash signatures are generated based on second profile metadata for a different column in a second plurality of columns in a second dataset stored at a second data source; and determine a set of column pairs based on comparison of the first plurality of hash signatures with the second plurality of hash signatures, wherein each column pair in the set of column pairs includes a different first column of the first plurality of columns and a different second column of the second plurality of columns based on similarity between a first hash signature for the different first column and a second hash signature for the different second column.
11. The system according to claim 10 , wherein the instructions, upon execution by the one or more processors, further causes the one or more processors to: generate information about a comparison of similarity of the first dataset to the second dataset; and generate a graphical interface to display the information about the comparison of similarity of the first dataset to the second dataset.
12. The system according to claim 11 , wherein the graphical interface provides metrics indicating granular similarity between a column of the first dataset and a column of the second dataset based on a comparison of metrics between the column of the first dataset and the column of the second dataset.
13. The system according to claim 11 , wherein the graphical interface is interactive.
14. The system according to claim 10 , wherein a hash signature based on profile metadata for a column is generated using a scalar hashing function.
15. The system according to claim 14 , wherein profile metadata includes a type profile of the column, a subtype profile of the column, a compounding attribute of the column, a pattern of data in the column, one or more delimiters of the column, or a combination thereof.
16. The system according to claim 10 , wherein the comparison to determine set of column pairs includes analyzing the first plurality of hash signatures with the second plurality of hash signatures.
17. The system according to claim 10 , wherein the system is a cloud-based storage system.
18. A non-transitory computer readable medium storing instructions that are executable by one or more processors to cause the one or more processors to: generate a first plurality of hash signatures, wherein each hash signature of the first plurality of hash signatures are generated based on first profile metadata for a different column in a first plurality of columns in a first dataset stored at a first data source; generate a second plurality of hash signatures, wherein each hash signature of the second plurality of hash signatures are generated based on second profile metadata for a different column in a second plurality of columns in a second dataset stored at a second data source; and determine a set of column pairs based on comparison of the first plurality of hash signatures with the second plurality of hash signatures, wherein each column pair in the set of column pairs includes a different first column of the first plurality of columns and a different second column of the second plurality of columns based on similarity between a first hash signature for the different first column and a second hash signature for the different second column.
19. The non-transitory computer readable medium according to claim 18 , wherein a hash signature based on profile metadata for a column is generated using a scalar hashing function.
20. The non-transitory computer readable medium according to claim 18 , further storing instructions that are executable by one or more processors to cause the one or more processors to: generate information about a comparison of similarity of the first dataset to the second dataset; and generate a graphical interface to display the information about the comparison of similarity of the first dataset to the second dataset.
Unknown
November 2, 2021
Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.