Single-Pass Low-Storage Arbitrary Probabilistic Location Estimation for Massive Data Sets

PublishedJuly 11, 2006

Assigneenot available in USPTO data we have

InventorsJohn C. Liechty James P. McDermott Dennis K.J. Lin

Technical Abstract

Patent Claims

47 claims

Legal claims defining the scope of protection, as filed with the USPTO.

1. A computer-assisted method for providing an estimate of a summary of a data set generated by an unknown distribution, comprising: selecting a subset of data points from the data set; applying a scoring rule to each data point of the subset of data points based on a summary of a set of estimated relative locations and assigned weight for each data point to provide a separate score for each data point; selectively retaining data points to track based on the score for each data point; and providing by a processor an estimate of the summary of the data set based on the retained data points; wherein determining the estimated relative location for each point comprises determining the point's relative location to retained data point's, applying a linear interpolation to determine the point's relative location when the point is not at a boundary, and applying a nonlinear interpolation to determine the points relative location when the point in at a boundary.

2. The method of claim 1 wherein the summary of the data set is selected from the set comprising a cumulative density function, a probability density function, a parametric summary, a semi-parametric sunimaxy, and a non-parametric sununary of the data set.

3. The method of claim 1 wherein the summary of a set of estimated relative location is a single rank estimate.

4. The method of claim 1 wherein between 20 and 100 data points are tracked.

5. The method of claim 1 wherein the estimate is a set of probabilistic location.

6. The method of claim 5 wherein the estimate is a single point quantile estimate.

7. The method of claim 1 wherein the estiniatecl relative location for each data point is a function of the previous and current relative location and weights for each of the data points.

8. The method of claim 1 wherein the step of selectively retaining a data point includes retaining data points having the smallest individual score and discarding data points having the largest individual scores.

9. The method of claim 1 wherein 100 or less data points are retained.

10. A computer-assisted method for providing an estixnatc of a summary of a data set generated by an unknown distribution, comprising: (a) inputting m data points m a data set; (b) assigning a relative location to each said m data points; (c) assigning a weight to each said m data points; (d) inputting a subset, n, of the remaining data points; (e) estimating a relative location for each said m and n data points; (f) assigning a weight to each said m and n data points; (g) scoring each said m and n data points based on the relative location and weight for each of the m and n data points to provide an individual score for each of the m and n data points; (h) retaining a subset of said m and n data points, their associated estimated relative locations and weights, the subsets having fewer data points than m and n, the retained data points becoming the m data points; (i) repeating steps (d) through (h) until all data points have been analyzed; (j) providing the estimate of the summary of the data set based on said m data points.

11. The method of claim 10 wherein the summary of the data set is selected from the set comprising a cumulative density function, a probability density function, a parametric summary, a semi-parametric summary of the data set.

12. The method of claim 10 wherein the estimated relative location is a rank estimate.

13. The method of claim 12 , wherein the rank assigned to each said m data point is a function of an actual rank of the m data points after said points have been partially or fully sorted.

14. The method for claim 13 , wherein the rank assigned to each said m data point is the actual rank of the m data points after said points have been sorted.

15. The method of claim 12 , wherein the estimated rank for each said m data point is a function that uses in part or in its entirety any or all of the following as arguments: said n data points, any of the rank estimates for said m data points, the total number of data points that have been inputted and the total number of data points in the data set.

16. The method of claim 15 , wherein the estimated rank for each said m data point is the previous rank estimate for the data point plus the number of said n data points with a value lower than the said data point.

17. The method of claim 12 , wherein the estimated rank for each said n data point is a function that uses in part or in its entirety the previous and current estimated ranks and weights for the said m data points.

18. The method of claim 17 , wherein the estimated rank for one of the said n data points where said point is not a new maximum or minimum with regards to all of the data points considered up to this point, is a function of tbe current rank estimate and value of the said m data points which are immediately above and below said point.

19. The metbod of claim 18 , wherein the estimated rank for one of n data points where said point is not immediately adjacent to the largest or smallest of the m data points, is a linear interpolation of the current rank estimate of the m data points which are immediately above and below the said point.

20. The method of claim 18 , wherein the estimated rank for one of n data points where said point is immediately adjacent to either the largest or smallest of the m data points is a non-linear interpolation of the current rank estimate of the m data points which are immediately above and below said point.

21. The method of claim 17 , wherein the estimated rank for each said n data point is equal to the value 1 where that point is smaller than all of the m data points and the remaining n data points and wherein the estimated rank for each said n data point where that point is smaller than all of the said m data points and larger than at least one of the remaining n data points, is a function of the value of the estimated rank of the smallest of the m data points and the n data points.

22. The method of claim 17 , wherein the estimated rank for each n data point where that point is larger than all of the m data points and the remaining n data points, is equal to the total number of data points that have been inputted and where in the estimated rank for each n data point where the point is larger than all of the m data points and smaller than at least one of the remaining n data points, is a function of the value of the estimated rank of the largest of the m data points and the n data points.

23. The method of claim 12 , wherein the initial assigned weight to the m data points is a function that uses in part or in its entirety any or all of the following as arguments: the total number of data points in the data set the total number of data points in the initial group, the number of data points that are to be inputted during each iteration, and the estimated rank of the m data points.

24. The method of claim 12 , wherein the initial assigned weight to said m data points is equal to a constant.

25. The method of claim 12 , wherein the weight assigned to the m and n data points, for all but the initial assigning of weights for the m data points, is a function that uses in part or in its entirety any or all of the previously assigned weights for the m data points, the current weights for the n data points and any of to estimated ranks for the m and n data points.

26. The method of claim 25 , wherein the weight assigned to each of said m data points is equal to the weight initially assigned to that said point.

27. The method of claim 25 , wherein the weight assigned to each of said n data paints is equal to a function of the distances, as defined by any metric, between any of the estimated ranks of said m data points and any of the estimated ranks of said remaining n data points and the actual values of said m and n data points.

28. The method of claim 27 , wherein the weight assigned to each of n data points is equal to the smaller of the two distances, as defined byte absolute value of the distance derived by standard subtraction, between the estimated rank of said data point and the estimated rank of m data points which are immediately above and below said point.

29. The method of claim 12 , wherein the subset of inputted data points, n, are determined by comparing elements of the n data points and said m data points being tracked.

30. The method of claim 29 , wherein some or all data points in the subset that have a value exactly equal to the value of one of said m data points being tracked are used to calculate the estimated rank and assigned weight for said m data points being tracked and then discarded.

31. The method of claim 30 , wherein all data points are to be discarded are discarded by assigning a score of minus infinity to said data points.

32. The method of claim 29 , wherein two or more of the subset of inputted data points have the same value, all of the data points with equal value are used to calculate the estimated rank and assigned weight for m data points being tracked and all but one of said group are discarded, unless the group of data points with equal value are exactly equal to the value of one of said m data points being tracked m which case all of the data points in the group of equal data points will be discarded.

33. The method of claim 32 , wherein all data points to be discarded are discarded by assigning a score of minus infinity to said data points.

34. The method of claim 12 , wherein the subset of data points selected are the next n data points in the data set, as determined by the order in which they were recorded, unless there is less than n data points left in the data set, where the subset will be the remaining data points.

35. The method of claim 34 , wherein the size of the subset of data points being selected in equal to one.

36. The method of claim 29 , wherein the score for each m and n data points is a function that uses in part or in its entirety any or all of the following as arguments: any or all previous weights, ranks or scores, any or all previously assigned weights for said m data points, the current weight for said n data points, any of estimated ranks for m and n data points, actual values of the m and n data points, the total number of data points inputted and the total number of data points in the data set.

37. The method of claim 36 , wherein the score for each m and n data points is a function of the estimated ranks and assigned weights for said m and n data points and the number of data points inputted.

38. The method of claim 37 , wherein the score for each m and n data points is a function of estimated rank, assigned weight and a target rank, the target rank being a fixed proportion of the number of data points inputted, assigned to each said data point where said data point is not the largest or smallest of said m and n data points.

39. The method of claim 38 , wherein the score for each m and n data points, where the said data point does not have the largest or smallest value of the m and n data points is equal to the distance, as defined by any metric, between the estimated rank and the target rank multiplied by any function of the assigned weight for said data point.

40. The method of claim 39 , wherein the score for each m and n data points, where the said data point does not have the largest or smallest value of the m and n said data points is equal to the absolute value of the estinuted rank minus the target rank divided by the assigned weight for said data point.

41. The method of claim 37 , wherein the score for each m and n data points is equal to zero, where the data point has the largest or smallest value of the m and n data points.

42. The method of claim 12 , wherein a subset of m and n data points and their associated estimated ranks and weights are retained based on a comparison of the score calculated for each said data point.

43. The method of claim 42 , wherein the m data points with the smallest score of m and n data points are retained along with their associated estimated ranks and assigned weights.

44. The method for claim 10 , wherein the summary of the data set is a cumulative density ftinction estimated as a ftinction of the value and the estimated rank of said m data points.

45. The method for claim 44 , wherein an unknown quantile is estimated as a function of the value and the estimated rank of said m data points.

46. The method for claim 45 , wherein the estimate of an unknown quantile is equal to the value of the data point of said m data points with the smallest distance, as defined by any metric, between the estimated rank and the target rank associated with said quantile.

47. The method for claim 46 , wherein the estimate of an unknown quantile is equal to the value of the data point of said m data points with the smallest distance, using the absolute value of the difference between the estimated rank and the target rank associated with this quantile, as defined by the proportion of the total data set associajed with this quantile.

Patent Metadata

Filing Date

Unknown

Publication Date

July 11, 2006

Inventors

John C. Liechty

James P. McDermott

Dennis K.J. Lin

Want to explore more patents?

Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.

Browse All Patents Try Prior Art Search