Legal claims defining the scope of protection, as filed with the USPTO.
1. A computerized method for dataset verification in a zero-trust computing environment, the method comprising: receiving a sample dataset comprising records; generating a sample vector for the entire sample dataset normalized by the total number or records in the sample dataset; selecting a subset of data from the sample dataset; generating a matrix from the subset of data; divide the matrix into a series of vectors; generating an example vector, wherein the example vector is generated by summing the series of vectors and normalizing the sum by the number of records in the subset of data; calculating a difference between the sample vector-set and the example vector; when the difference is below a threshold, applying a machine learning algorithm to the subset of data; and when the difference is above a threshold, rejecting the subset of data.
2. The method of claim 1, wherein the calculating the difference is by framing the distance as a p-value in a hypothesis test, compared against a threshold.
3. The method of claim 1, wherein the generating the sample vector includes: encoding the dataset according to a set of classes; generating a matrix of the encoded dataset, wherein each row of the matrix is a patient, and each column is a class or subset of classes in the set of classes; converting the generated matrix into a series of vector spaces; summing the vector spaces; and normalizing the summed vector spaces by the total number of records.
4. The method of claim 1, wherein a plurality of vector sets are from a plurality of data stewards and the plurality of vector sets are combined into the sample vector.
5. The method of claim 1, wherein the example vector is generated from a synthetic dataset.
6. A zero-trust computing system for dataset verification, the system comprising: a datastore, hosted within a data steward computing environment comprising, a non-transitory memory device for receiving a sample dataset comprising records, and receiving a sample vector set, wherein the sample vector set is generated from the entire sample dataset normalized by the total number or records in the sample dataset; and a runtime server, hosted within the data steward computing environment, comprising a processor unit for selecting a subset of data from the sample dataset, generating a matrix from the subset of data, dividing the matrix into a series of vectors, generating an example vector, wherein the example vector is generated by summing the series of vectors and normalizing the sum by the number of records in the subset of data, calculating a difference between the sample vector and the example vector, when the difference is below a threshold, applying a machine learning algorithm to the subset of data; and when the difference is above a threshold, rejecting the subset of data.
7. The system of claim 6, wherein the calculating the difference is by framing the distance as a p-value in a hypothesis test, compared against a threshold.
8. The system of claim 6, wherein the generating the sample vector includes: encoding the dataset according to a set of classes; generating a matrix of the encoded dataset, wherein each row of the matrix is a patient, and each column is a class or subset of classes in the set of classes; converting the generated matrix into a series of vector spaces; summing the vector spaces; and normalizing the summed vector spaces by the total number of records.
9. The system of claim 6, wherein a plurality of vector sets are from a plurality of data stewards and the plurality of vector sets are combined into the sample vector.
10. The system of claim 6, wherein the example vector is generated from a synthetic dataset.
Unknown
September 23, 2025
Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.