A system, method, and computer-readable medium that facilitate in-database supervised discretisation mechanisms which improve data classification are provided. The disclosed mechanisms provide an efficient, automatic, and repeatable way to perform data discretisation without human intervention. Efficient processing of large and complex unknown data is provided that advantageously does not require the data being analyzed to be processed outside the database. The disclosed mechanisms may use an External Stored Procedure to avoid multiple joins of large tables and minimize the number of full table scans and, consequently, provide better performance than contemporary mechanisms. The disclosed system produces intermediate results in tables which may be conveyed to a visualization subsystem thereby providing users a better understanding of the data distribution in each category. Further, the disclosed system and method introduce a novel similarity-based solution to merge intervals when chi-square testing is not reliable and thereby improves the quality of the interval merge process.
Legal claims defining the scope of protection, as filed with the USPTO.
1. A method of performing database analytics that facilitates data discretization, comprising: receiving, by a processing module, a query including a selection applied on a first attribute of a base table and a second attribute of the base table, wherein the first attribute comprises numeric values and the second attribute comprises a plurality of discrete values; generating, by the processing module, a contingency table that includes a respective row having an attribute for each numeric value of the first attribute of the base table and a plurality of corresponding attributes for each respective discrete value of the second attribute of the base table, wherein each of the plurality of corresponding attributes store a count of a number of occurrences of one of the discrete values of the second attribute that are identified in a common row of the base table with the numeric value of the first attribute of the base table; identifying adjacent rows of the contingency table suitable to be merged based on an evaluation of similarities between the plurality of corresponding attributes of the contingency table; and assigning a discretized value to the numeric value of the contingency table.
2. The method of claim 1 , further comprising generating, by the processing module, a summary table that includes a respective row having an attribute value that corresponds to the first attribute of the base table and a corresponding attribute that has a value of the second attribute of the base table.
3. The method of claim 2 , wherein the summary table includes a frequency attribute in which a count of a number of instances of the value of the second attribute identified in a common row of the base table for the attribute value corresponding to the first attribute of the base table.
4. The method of claim 2 , wherein the contingency table is generated from the summary table.
5. The method of claim 1 , wherein the evaluation of the similarity between the plurality of corresponding attributes of the contingency table is performed by chi-square testing.
6. The method of claim 1 , wherein a minimum value and a maximum value of the first attribute of the base table that is discretized are associated with the discretized value.
7. The method of claim 1 , wherein assigning a discretized value to the numeric value of the contingency table comprises assigning a plurality of discretized values to respective numeric values of the contingency table.
8. A computer-readable medium having computer-executable instructions for execution by a processing system, the computer-executable instructions for performing database analytics that facilitate data discretization, the computer-executable instructions, when executed, cause the processing system to: receive, by a processing module, a query including a selection applied on a first attribute of a base table and a second attribute of the base table, wherein the first attribute comprises numeric values and the second attribute comprises a plurality of discrete values; generate, by the processing module, a contingency table that includes a respective row having an attribute for each numeric value of the first attribute of the base table and a plurality of corresponding attributes for each respective discrete value of the second attribute of the base table, wherein each of the plurality of corresponding attributes store a count of a number of occurrences of one of the discrete values of the second attribute that are identified in a common row of the base table with the numeric value of the first attribute of the base table; identify, by the processing module, adjacent rows of the contingency table suitable to be merged based on an evaluation of similarities between the plurality of corresponding attributes of the contingency table; and assign, by the processing module, a discretized value to the numeric value of the contingency table.
9. The computer-readable medium of claim 8 , further comprising instructions that, when executed, cause the processing system to generate a summary table that includes a respective row having an attribute value that corresponds to the first attribute of the base table and a corresponding attribute that has a value of the second attribute of the base table.
10. The computer-readable medium of claim 9 , wherein the summary table includes a frequency attribute in which a count of a number of instances of the value of the second attribute identified in a common row of the base table for the attribute value corresponding to the first attribute of the base table.
11. The computer-readable medium of claim 9 , wherein the contingency table is generated from the summary table.
12. The computer-readable medium of claim 8 , wherein the evaluation of the similarity between the plurality of corresponding attributes of the contingency table is performed by chi-square testing.
13. The computer-readable medium of claim 8 , wherein a minimum value and a maximum value of the first attribute of the base table that is discretized are associated with the discretized value.
14. The computer-readable medium of claim 8 , wherein the instructions that assign a discretized value to the numeric value of the contingency table comprise instructions that, when executed, cause the processing system to assign a plurality of discretized values to respective numeric values of the contingency table.
15. A computer system having a database management system configured to perform database analytics that facilitates data discretization, comprising: at least one storage medium on which the database management system and a base table is stored; and at least one processing module that receives a query including a selection applied on a first attribute of a base table and a second attribute of the base table, wherein the first attribute comprises numeric values and the second attribute comprises a plurality of discrete values, generates a contingency table that includes a respective row having an attribute for each numeric value of the first attribute of the base table and a plurality of corresponding attributes for each respective discrete value of the second attribute of the base table, wherein each of the plurality of corresponding attributes store a count of a number of occurrences of one of the discrete values of the second attribute that are identified in a common row of the base table with the numeric value of the first attribute of the base table, identifies adjacent rows of the contingency table suitable to be merged based on an evaluation of similarities between the plurality of corresponding attributes of the contingency table, and assigns a discretized value to the numeric value of the contingency table.
16. The system of claim 15 , wherein the processing module generates a summary table that includes a respective row having an attribute value that corresponds to the first attribute of the base table and a corresponding attribute that has a value of the second attribute of the base table.
17. The system of claim 16 , wherein the summary table includes a frequency attribute in which a count of a number of instances of the value of the second attribute identified in a common row of the base table for the attribute value corresponding to the first attribute of the base table.
18. The system of claim 16 , wherein the contingency table is generated from the summary table.
19. The system of claim 15 , wherein the evaluation of the similarity between the plurality of corresponding attributes of the contingency table is performed by chi-square testing.
20. The system of claim 15 , wherein a minimum value and a maximum value of the first attribute of the base table that is discretized are associated with the discretized value.
Cooperative Patent Classification codes for this invention. Click any code to explore related patents in that topic.
December 31, 2009
March 13, 2012
Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.