Aggregation queries are performed by first identifying outlier values, aggregating the outlier values, and sampling the remaining data after pruning the outlier values. The sampled data is extrapolated and added to the aggregated outlier values to provide an estimate for each aggregation query. Outlier values are identified by selecting values outside of a selected sliding window of data having the lowest variance. An index is created for the outlier values. The outlier data is removed from the window of data, and separately aggregated. The remaining data without the outliers is then sampled in one of many known ways to provide a statistically relevant sample that is then aggregated and extrapolated to provide an estimate for the remaining data. This sampled estimate is combined with the outlier aggregate to form an estimate for the entire set of data. Further methods involve the use of weighted sampling and weighted selection of outlier values for low selectivity queries, or queries having group by.
Legal claims defining the scope of protection, as filed with the USPTO.
1. A computer-implemented method of estimating a result of a database query for a set of values in a database, the method comprising: identifying outlier values in the database, wherein the outlier values occur in the database with a first frequency; aggregating the outlier values; sampling non-outlier values, wherein the non-outlier values occur in the database with a second frequency greater than the first frequency; aggregating the sampled values; estimating the result of the database query for an aggregation of the set of values including a combination of the aggregated outlier and aggregated sampled values; and storing the estimated result.
2. The method of claim 1 wherein the values are sorted prior to determining the outlier values.
3. The method of claim 2 wherein a window is slid across the multiple sorted values and a variance is determined for values within each position of the window.
4. The method of claim 3 wherein a threshold is used to determine the size of the window.
5. The method of claim 4 wherein the threshold is equal to the number of outliers to be considered.
6. The method of claim 4 wherein the threshold is dependent on the amount of memory available to devote to outliers.
7. A computer-implemented system for estimating a result of a database query for a set of values in a database, the system comprising: means for identifying outlier values in the database, wherein the outlier values occur in the database with a first frequency; means for aggregating the outlier values; means for sampling non-outlier values, wherein the non-outlier values occur in the database with a second frequency greater than the first frequency; means for aggregating the sampled values; means for estimating the result of the database query for an aggregation of the set of values including a combination of the aggregated outlier and aggregated sampled values; and storing the estimated result.
8. A computer-implemented system for estimating a result of a database query for a set of values in a database, the system comprising: a module that identifies outlier values relevant to a query, wherein the outlier values occur in the database with a first frequency; a module that aggregates the outlier values; a module that samples non-outlier values, wherein the non-outlier values occur in the database with a second frequency greater than the first frequency; a module that aggregates the sampled values; a module that estimates the result of the database query for an aggregation of the set of values including a combination of the aggregated outlier and aggregated sampled values; and storing the estimated result.
9. A computer readable medium excluding carrier waves having instructions stored thereon, which when executed by at least one processor, cause the processor to perform a method of estimating a result of a database query for a set of values in a database, the method comprising: identifying outlier values, wherein the outlier values occur in the database with a first frequency; aggregating the outlier values; sampling non-outlier values, wherein the non-outlier values occur in the database with a second frequency greater than the first frequency; aggregating the sampled values; estimating the result of the database query for an aggregation of the set of values including a combination of the aggregated outlier and aggregated sampled values; and storing the estimated result.
Cooperative Patent Classification codes for this invention. Click any code to explore related patents in that topic.
June 22, 2004
March 13, 2007
Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.