US-10430440

Apparatus program and method for data property recognition

PublishedOctober 1, 2019

Assigneenot available in USPTO data we have

Inventorsnot available in USPTO data we have

Technical Abstract

A data property recognition apparatus, includes a storage unit; a model data acquisition processor acquiring a plurality of model sets of data entries, each data entry individually representing an identified property common to the model set and being of a data type common to the model set; a feature vector generation processor receiving an input set of data entries, recognizing a data type common to the input set of data entries from among a plurality of supported data types, selecting a set of statistical characteristics representing the input set of data entries in dependence upon the recognised data type, generating a value of each of the selected set of statistical characteristics from the input set of data entries, and outputting a feature vector composed of the generated values of statistical characteristics.

Patent Claims

10 claims

Legal claims defining the scope of protection, as filed with the USPTO.

1. A data set reconciliation apparatus, comprising a processor, a memory, and a user interface, wherein: the user interface is configured to accept from a user a specification of a plurality of model sets of data entries from a first data source, and a specification of a plurality of query sets of data entries from further data sources, the first data source and the further data sources being disparate data sources with heterogeneous schema; and the processor is configured to perform a process including: acquiring the plurality of model sets of data entries specified by the user, each individual model set of data entries being a plurality of data entries individually representing an identified property common to the model set of data entries and being of a data type common to the model set of data elements; for each of the acquired plurality of model sets of data entries as an input set of data entries, recognising a data type common to the input set of data entries from among plural supported data types, selecting a set of statistical characteristics representing the input set of data entries in dependence upon the recognised data type, generating a value of each of the selected set of statistical characteristics from the plurality of data entries, and outputting a feature vector composed of the generated values of statistical characteristics; for each of the acquired plurality of model sets of data entries, obtaining the output feature vector and submitting to the memory the output feature vector in association with the identified property common to the model set of data entries; storing, in the memory, the submitted feature vectors in association with the respective identified property, as a reference set of feature vectors; obtaining the plurality of query sets of data entries from the further data sources specified by the user; recognising a data type common to the query set of data entries from among plural supported data types, selecting one of the sets of statistical characteristics representing input sets of data entries in dependence upon the recognised data type of the query set of data entries, generating a value of each of the selected set of statistical characteristics from the query set of data entries, and outputting a feature vector composed of the generated values of statistical characteristics; executing comparisons between the feature vector output for the query set of data entries and the stored reference set of feature vectors to identify a best match feature vector among the stored reference set of feature vectors for the feature vector output for the query set of data entries, recognising the identified property stored in association with the best match feature vector as a recognised data property, from the first data source specified by the user, represented by the individual data entries in the query set of data entries, from the further data source specified by the user, and outputting the recognised data property; and submitting, to the memory, for storage in a data store accessible via a single database management system, a copy of the first data source and a copy of each of the further data sources, wherein the query sets of data entries from the further data sources, for which a recognised data property was output by the processor, are stored in association with the respectively output recognised data property.

2. The data set reconciliation apparatus according to claim 1 , further comprising: a query processor, the query processor being configured to obtain a query set of data entries for which an identity of a property commonly represented by the individual data entries is sought, to submit the query set of data entries to the feature vector generation processor, to execute comparisons between the output feature vector and the stored reference set of feature vectors to identify a best match feature vector among the stored reference set of feature vectors for the output feature vector, to recognise the identified property stored in association with the best match feature vector as a data property represented by the individual data entries in the query set of data entries, and to output the recognised data property.

3. The data set reconciliation apparatus according to claim 1 , wherein the set of statistical characteristics for an input set of data entries recognised as being of the numeric type, comprises two or more from among the following: number of data entries; minimum value; maximum value; first quartile value; third quartile value; median value; mean; standard deviation; variance; and most repeated data entry.

4. The data set reconciliation according to claim 1 , wherein the set of statistical characteristics for an input set of data entries recognised as being of the string type, comprises two or more from among the following: number of data entries; alphabet of data entries; average number of characters per data entry; average number of white spaces per data entry; average number of full stops per data entry; average number of commas per data entry; average number of semicolons per data entry; most repeated data entry; longest common substring; and percentage of unique entries.

5. The data set reconciliation apparatus according to claim 1 , wherein the set of statistical characteristics for an input set of data values recognised as being of the numerical time-series type, comprises two or more from among the following: number of data values; number of entries; minimum numerical value; maximum numerical value; first quartile numerical value; third quartile numerical value; median numerical value; mean of numerical values; standard deviation; variance; covariance; skewness; kurtosis; start date; and end date.

6. The data set reconciliation apparatus according to claim 1 , wherein the plurality of supported data types comprises numeric, string, and numerical time-series.

7. The data set reconciliation apparatus according to claim 2 , wherein the query processor is configured to submit to the storage unit the query set of data entries and the recognised data property, and the storage unit being configured to store the query set of data entries in association with the recognised data property as a recognised set of data entries.

8. The data set reconciliation apparatus according to claim 7 , wherein the storage unit is configured to store the reference set of feature vectors in association with the respective identified property and in association with the respective model set of data entries for which the feature vector was generated.

9. The data set reconciliation according to claim 7 , wherein the executed comparisons are between the output feature vector and each of the stored reference set of feature vectors, and the comparison comprises: comparing a data type of the set of data entries represented by the feature vector from the reference set with a data type of the query set of data entries represented by the output feature vector; if the data types are different, excluding the feature vector from the reference set from mathematical comparison; and if the data types are the same, performing a mathematical comparison between the output feature vector and the feature vector from the reference set to obtain a similarity value, the feature vector for which the greatest similarity value is obtained being the best match feature vector.

10. The data set reconciliation apparatus according to claim 8 , further comprising: a reference feature vector update processor, configured, upon submission of the recognised set of data entries stored in the storage unit, to compile a composite set of data entries comprising each of the reference set of data entries and each of the other recognised sets of data entries stored in association with the same identified property as the submitted recognised set of data entries, submitting the composite set of data entries to the feature vector generation processor, obtaining the feature vector output by the feature vector generation processor as an updated reference feature vector, and replacing the existing feature vector in the reference set of feature vectors stored in association with the identified property with the updated reference feature vector.

Classification Codes (CPC)

Cooperative Patent Classification codes for this invention. Click any code to explore related patents in that topic.

G06F G06N G06V H04N

Patent Metadata

Filing Date

August 17, 2017

Publication Date

October 1, 2019

Want to explore more patents?

Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.

Browse All Patents Try Prior Art Search