Techniques for generating hash values for instances of distinct data values. In the techniques, each distinct data value is mapped to hash value generation information which describes how to generate a unique hash value for instances of the distinct data value. The hash value generation information for a distinct data value is then used to generate the hash value for an instance of the distinct data value. The hash value generation information may indicate whether a collision has occurred in generating the hash values for instances of the distinct data values and if so, how the collision is to be resolved. The techniques are employed to normalize RDF triples by generating the UIDS employed in the normalization from the triples' lexical values.
Legal claims defining the scope of protection, as filed with the USPTO.
1. A method of making normalized representations of a batch of instances of data values in a database system, the method comprising: using at least one processor in a computing system to perform a process, the process comprising: mapping a distinct data value of a batch of instances of data values into a transformation generation information for the distinct data value in a database system; determining an entry in a first mapping table for the distinct data value, the entry including a normalized representation which is generated by selecting and performing a transformation on the distinct data value based at least in part upon the transformation generation information, in which the transformation generation information indicates how to generate the normalized representation for the distinct data value according to a collision resolution transformation; generating a second mapping table by querying the first mapping table, the second mapping table including a first entry for the distinct data value based at least in part upon the transformation generation information; and transforming each instance of the batch of instances of data values into a transformed value by selecting and performing the transformation on a representation of the each instance based at least in part upon a determination of whether a corresponding distinct data value entry for the each instance exists in the second mapping table.
2. The method set forth in claim 1 , wherein the action of transforming the each instance includes: when the instance's distinct data value has an entry in the second mapping table, using the transformation generation information and the collision resolution transformation to generate the normalized representation, in which the corresponding distinct data value entry for the each instance exists in the second mapping table.
3. The method set forth in claim 2 , wherein the batch of instances of data includes RDF (Resource Description Framework) triples that include lexical values, and the normalized representation is used to determine a corresponding normalized representation of a corresponding RDF triple in the batch.
4. The method set forth in claim 3 , wherein the lexical values include one or more non-canonical literal values that are not yet expressed in a canonical form but have a lithe canonical form; and the method further comprises: for each distinct data value that comprises a non-canonical literal value generating a canonical form for the non-canonical literal value; in the action of determining the entry in the first mapping table, generating one or more entries for both the each distinct data value that comprises the non-canonical literal value and one or more distinct data values each of which comprises a canonical literal value; and the entry for the each distinct data value that comprises the non-canonical literal value includes a transformation indication for a corresponding canonical literal value; and the action of transforming the each instance further comprises transforming the each instance, which comprises a corresponding non-canonical literal value, to generate the normalized representation according to a transformation method specified for a distinct literal value corresponding to the each instance in the entry in the first mapping table; transforming the instance of the non-canonical value into an instance of the non-canonical value's canonical form; and hashing the instance of the value's canonical form according to the hash method specified for the distinct canonical value corresponding to the instance of the non-canonical literal value in the corresponding distinct literal value's entry in the first mapping table.
5. The method set forth in claim 1 wherein: each distinct data value whose normalized representation is in a preexisting table including one or more normalized representations has an entry in the first mapping table; and the method further comprises: for each distinct data value belonging to the batch of instances, generating a corresponding entry in a third mapping table in the database system for the each distinct data value, and generating a normalized representation in the third mapping table transforming the each distinct data value according to the default transformation; querying the third mapping table and the first mapping table to determine whether transforming the each distinct data values according to the default transformation resulted in one or more collision sets of values for the each distinct data value, wherein different values of the one or more collision sets of values are transformed to a same normalized value; for each value of the one or more collision sets of values, resolving collision by transforming all but a single value of the values belonging to the one or more collision sets of values according to collision resolution method, whereby, if a first value of the values is in the first mapping table, the first value is not the single value that is transformed; setting transformation indications in entries of the third mapping table for the single value that is transformed to indicate that the normalized representation has been created according to the collision resolution method; merging the third mapping table into the first mapping table; and for each instance of data value in the batch of instances of data values, adding a corresponding normalized representation of the each instance of data value to a preexisting table.
6. The method set forth in claim 5 , wherein the action of adding the corresponding normalized representation of the instance of data value to the preexisting table includes: adding the corresponding normalized representation to a second working table including one or more normalized representations; and adding the second working table including the one or more normalized representations to the preexisting table.
7. The method set forth in claim 6 , wherein the batch of instances of data values includes RDF (Resource Description Framework) triples, and instances of data values in the batch comprise lexical values belonging to the RDF triples; entries in the preexisting table include respective normalized representations of the RDF triples; entries in the first mapping table contain a distinct lexical value, the distinct lexical value's normalized representation, and the hash method indication; entries in the third mapping table include a distinct lexical value from the batch of instances of data values, a normalized representation of the distinct lexical value, and a transformation indication; and entries in the second mapping table include one or more normalized representations of the RDF triples in the batch of instances of data values.
8. The method set forth in claim 7 , wherein the RDF (Resource Description Framework) triples in the preexisting table belong to one or more RDF models, the RDF triples in the preexisting table are grouped by the one or more RDF models, and each model of the one or more RDF models occupies a separately-indexed partition of the preexisting table, and the RDF triples in the batch belong to a single model and the second mapping table is added to a partition of the single model.
9. The method set forth in claim 8 , wherein the second table is added by a technique that does not require moving the one or more normalized representations of the RDF triples in the second table.
10. The method of claim 1 , further comprising: performing the transformation on the distinct data value to determine a first transformed distinct data value; determining whether there exists a collision between the first transformed distinct data value and a second transformed distinct data value that is determined by performing the transformation on a second distinct data value of the batch of instances of data values; and determining whether the distinct data value comprises a literal value that is in a canonical form.
11. The method of claim 10 , the action of transforming the each instance further comprising: identifying or determining a unique identifier for the each instance, in which the unique identifier uniquely identifies the each instance of the batch of instances of data values.
12. The method of claim 11 , the unique identifier comprising the transformed value, in which the collision is determined to exist between the first transformed distinct data value and the second transformed distinct data value.
13. The method of claim 11 , the action of identifying or determining the unique identifier comprising: combining the transformed value with the each instance of the batch of instances of data values to determine a combined transformed value; and generating the unique identifier by performing the first transformation on the combined transformed value.
14. The method of claim 1 , the transformation generation information indicating what transformation is to be performed on the distinct data value.
15. The method of claim 1 , the first entry for the distinct data value in the second mapping table indicating non-collision between the normalized representation of the distinct data value and one or more other normalized representations of data values in the batch.
16. The method of claim 1 , the action of transforming the each instance comprising: selecting the default transformation without accounting for collision resolution, where the corresponding distinct data value entry does not exist in the second mapping table.
17. An apparatus, comprising: at least one processor in a computing system that is to map a distinct data value of a batch of instances of data values into a transformation generation information for the distinct data value in a database system; determine an entry in a first mapping table for the distinct data value, the entry including a normalized representation which is generated by performing a transformation on the distinct data value based at least in part upon the transformation generation information, in which the transformation generation information indicates how to generate the normalized representation for the distinct data value according to a collision resolution transformation; generate a second mapping table by querying the first mapping table, the second mapping table including a first entry for the distinct data value based at least in part upon the transformation generation information; and transform each instance of the batch of instances of data values into a transformed value by at least selecting and applying a default transformation on the each instance based at least in part upon a determination of whether a corresponding distinct data value entry for the each instance exists in the second mapping table; and one or more memory elements to store the first mapping table, the second mapping table, and the transformed value for the each instance of the batch of instances of data values.
18. The apparatus of claim 17 , in which the at least one processor is further to perform the transformation on the distinct data value to determine a first transformed distinct data value; determine whether there exists a collision between the first transformed distinct data value and a second transformed distinct data value that is determined by performing the transformation on a second distinct data value of the batch of instances of data values; and determine whether the distinct data value comprises a literal value that is in a canonical form.
19. The apparatus of claim 18 , the at least one processor that is to transform the each instance is further to identify or determine a unique identifier for the each instance, in which the unique identifier uniquely identifies the each instance of the batch of instances of data values.
20. The apparatus of claim 18 , the at least one processor that is to identify or determine the unique identifier is further to combine the transformed value with the each instance of the batch of instances of data values to determine a combined transformed value; and generate the unique identifier by performing the first transformation on the combined transformed value.
21. The apparatus of claim 17 , in which the at least one processor that is to transform the each instance is further to select and perform the default transformation without accounting for collision resolution, where the corresponding distinct data value entry does not exist in the second mapping table.
22. An article of manufacture comprising a non-transitory computer accessible storage medium storing thereupon a sequence of instructions which, when executed by at least one processor, causes the at least one processor to perform a method, the method comprising: using the at least one processor in a computing system to perform a process, the process comprising: mapping a distinct data value of a batch of instances of data values into a transformation generation information for the distinct data value in a database system; determining an entry in a first mapping table for the distinct data value, the entry including a normalized representation which is generated by performing a transformation on the distinct data value based at least in part upon the transformation generation information, in which the transformation generation information indicates how to generate the normalized representation for the distinct data value according to a collision resolution transformation; generating a second mapping table by querying the first mapping table, the second mapping table including a first entry for the distinct data value based at least in part upon the transformation generation information; and transforming each instance of the batch of instances of data values into a transformed value by at least selecting and applying a default transformation on the each instance based at least in part upon a determination of whether a corresponding distinct data value entry for the each instance exists in the second mapping table.
23. The article of manufacture of claim 22 , the process further comprising: performing the transformation on the distinct data value to determine a first transformed distinct data value; determining whether there exists a collision between the first transformed distinct data value and a second transformed distinct data value that is determined by performing the transformation on a second distinct data value of the batch of instances of data values; and determining whether the distinct data value comprises a literal value that is in a canonical form.
24. The article of manufacture of claim 23 , the action of transforming the each instance comprising: identifying or determining a unique identifier for the each instance, in which the unique identifier uniquely identifies the each instance of the batch of instances of data values.
25. The article of manufacture of claim 23 , the action of identifying or determining the unique identifier comprising: combining the transformed value with the each instance of the batch of instances of data values to determine a combined transformed value; and generating the unique identifier by performing the first transformation on the combined transformed value.
26. The article of manufacture of claim 22 , in which the action of transforming the each instance comprises: selecting and performing the default transformation without accounting for collision resolution, where the corresponding distinct data value entry does not exist in the second mapping table.
Cooperative Patent Classification codes for this invention. Click any code to explore related patents in that topic.
August 8, 2008
December 13, 2011
Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.