A method and system for predicting customer behavior based on the geography of a data network are provided. Furthermore, a method and system for evaluating the training of a predictive algorithm to determine if the algorithm does not adequately take into consideration the influences of data network geography are also provided. The method and system generate frequency distributions of a customer database data set, training data set and testing data set and compare the frequency distributions of data network geographical characteristics to determine if there are discrepancies. If the discrepancies are above a predetermined tolerance, one or more of the data sets may not be representative of the customer database taking into account data network geographical influences on customer behavior. Thus, recommendations for improving the training data set and/or testing data set are then provided such that the data set is more representative of the data network geographical influences. Once trained, the predictive algorithm may be utilized to predict customer behavior taking into account the influences of data network geography.
Legal claims defining the scope of protection, as filed with the USPTO.
1. A data processing machine implemented method of selecting data sets for use with a predictive algorithm based on data network geographical information, comprising data processing machine implemented steps of: generating, by the data processing machine, a first statistical distribution of a training data set; generating, by the data processing machine, a second statistical distribution of a testing data set; using, by the data processing machine, the first statistical distribution and the second statistical distribution to identify a discrepancy between the first statistical distribution and the second statistical distribution with respect to the data network geographical information by comparing at least one of the first statistical distribution and the second statistical distribution to a statistical distribution of a customer database to determine if at least one of the training data set and the testing data set are geographically representative of a customer population represented by the customer database; modifying, by the data processing machine, selection of entries in one or more of the training data set and the testing data set based on the discrepancy between the first statistical distribution and the second statistical distribution; and using the modified selection of entries by the predictive algorithm.
2. The method of claim 1 , wherein the first statistical distribution and the second statistical distribution are distributions of a number of data network links from a customer data network geographical location to a web site data network geographical location.
3. The method of claim 1 , wherein the first statistical distribution and the second statistical distribution are distributions of a size of a click stream for arriving at a web site data network geographical location.
4. The method of claim 1 , wherein comparing the first statistical distribution and the second statistical distribution includes comparing one or more of a mean, mode, and standard deviation of the first statistical distribution to one or more of a mean, mode, and standard deviation of the second statistical distribution.
5. The method of claim 1 , wherein the first statistical distribution and the second statistical distribution are distributions of a weighted data network geographical distance between a customer data network geographical location and a web site data network geographical locations.
6. The method of claim 1 , wherein the first statistical distribution and the second statistical distribution are distributions of a weighted click stream for arriving at a web site data network geographical locations.
7. The method of claim 1 , wherein modifying selection of entries in one or more of the training data set and the testing data set includes generating recommendations for improving selection of entries in one or more of the training data set and the testing data set, and wherein the method of claim 1 further comprises re-generating at least one of the first statistical distribution and the second statistical distribution based upon the recommendations.
8. The method of claim 1 , wherein the training data set and the testing data set are selected from a customer information database comprising information with respect to customers who have purchased any of goods and services over a data network, wherein the data network geographic information pertains to geographic information of the data network.
9. The method of claim 1 , wherein the first statistical distribution and second statistical distribution are frequency distributions of number of data network links between a customer geographical location and one or more web site data network geographical locations, and size of a click stream for arriving at one or more web site data network geographical locations.
10. The method of claim 1 , wherein comparing at least one of the first statistical distribution and the second statistical distribution to a statistical distribution of a customer database includes: generating a composite data set from the training data set and the testing data set; and generating a composite statistical distribution from the composite data set that was generated from the training data set and the testing data set.
11. The method of claim 1 , wherein modifying selection of entries in one or more of the training data set and the testing data set includes changing one of a random selection algorithm and a seed value for the random selection algorithm, and then re-comparing the first statistical distribution and the second statistical distribution.
12. The method of claim 1 , wherein using the modified selection of entries by the predictive algorithm includes training the predictive algorithm using at least one of the training data set and the testing data set if the discrepancy is within a predetermined tolerance.
13. The method of claim 12 , wherein the predictive algorithm is a discovery based data mining algorithm.
14. An apparatus for selecting data sets for use with a predictive algorithm based on data network geographical information, comprising: a statistical engine; a comparison engine coupled to the statistical engine, wherein the statistical engine generates a first statistical distribution of a training data set and a second distribution of a testing data set, the comparison engine uses the first statistical distribution and the second distribution to identify a discrepancy between the first statistical distribution and the second distribution with respect to the data network geographical information by comparing at least one of the first statistical distribution and the second statistical distribution to a statistical distribution of a customer database to determine if at least one of the training data set and the testing data set are geographically representative of a customer population represented by the customer database, modifies selection of entries in one or more of the training data set and the testing data set based on the discrepancy between the first statistical distribution and the second distribution, and provides the modified selection of entries for use by the predictive algorithm; and a predictive algorithm device that uses the modified selection of entries and the predictive algorithm.
15. The apparatus of claim 14 , wherein the first statistical distribution and the second statistical distribution are distributions of a number of data network links from a customer data network geographical location to a web site data network geographical location.
16. The apparatus of claim 14 , wherein the first statistical distribution and the second statistical distribution are distributions of a size of a click stream to arrive at a web site data network geographical location.
17. The apparatus of claim 14 , wherein the comparison engine compares the first statistical distribution and the second statistical distribution by comparing one or more of a mean, mode, and standard deviation of the first statistical distribution to one or more of a mean, mode, and standard deviation of the second statistical distribution.
18. The apparatus of claim 14 , wherein the first statistical distribution and the second statistical distribution are distributions of a weighted number of data network links between a customer data network geographical location and a web site data network geographical location.
19. The apparatus of claim 14 , wherein the first statistical distribution and the second statistical distribution are distributions of a weighted size of a click stream to arrive at a web site data network geographical location.
20. The apparatus of claim 14 , wherein the comparison engine modifies selection of entries in one or more of the training data set and the testing data set by generating recommendations for improving selection of entries in one or more of the training data set and the testing data set, and wherein the statistical engine re-generates at least one of the first statistical distribution and the second statistical distribution based upon the recommendations.
21. The apparatus of claim 14 , further comprising a training data set/testing data set selection device that selects the training data set and the testing data set from a customer information database comprising information with respect to customers who have purchased any of goods and services over a data network, wherein the data network geographic information pertains to geographic information of the data network.
22. The apparatus of claim 14 , wherein the first statistical distribution and second statistical distribution are frequency distributions of a number of data network links between a customer data network geographical location and one or more web site data network geographical locations, and a size of a click stream to arrive at one or more web site data network geographical locations.
23. The apparatus of claim 14 , wherein the comparison engine compares at least one of the first statistical distribution and the second statistical distribution to a statistical distribution of a customer database by: generating a composite data set from the training data set and the testing data set; and generating a composite statistical distribution from the composite data set that was generated from the training data set and the testing data set.
24. The apparatus of claim 14 , wherein the comparison engine modifies selection of entries in one or more of the training data set and the testing data set by changing one of a random selection algorithm and a seed value for the random selection algorithm, and then re-comparing the first statistical distribution and the second statistical distribution.
25. The apparatus of claim 14 , wherein the predictive algorithm device is trained using at least one of the training data set and the testing data set if the discrepancy is within a predetermined tolerance.
26. The apparatus of claim 25 , wherein the predictive algorithm is a discovery based data mining algorithm.
27. A data processing machine implemented method of predicting customer behavior based on data network geographical influences, comprising data processing machine implemented steps of: obtaining data network geographical information regarding a plurality of customers, the data network geographic information comprising frequency distributions of both (i) number of data network links between a customer geographical location and one or more web site data network geographical locations, and (ii) size of a click stream for arriving at the one or more web site data network geographical locations; training a predictive algorithm using the data network geographical information; and using the predictive algorithm to predict customer behavior based on the data network geographical information.
28. An apparatus for predicting customer behavior based on data network geographical influences, comprising: means for obtaining data network geographical information regarding a plurality of customers, the data network geographic information comprising frequency distributions of both (i) number of data network links between a customer geographical location and one or more web site data network geographical locations, and (ii) size of a click stream for arriving at the one or more web site data network geographical locations; means for training a predictive algorithm using the data network geographical information; and means for using the predictive algorithm to predict customer behavior based on the data network geographical information.
29. A computer program product in a computer readable medium comprising instructions for enabling a data processing machine to predict customer behavior based on data network geographical influences, comprising: first instructions for obtaining data network geographical information regarding a plurality of customers, the data network geographic information comprising frequency distributions of both (i) number of data network links between a customer geographical location and one or more web site data network geographical locations, and (ii) size of a click stream for arriving at the one or more web site data network geographical locations; second instructions for training a predictive algorithm using the data network geographical information; and third instructions for using the predictive algorithm to predict customer behavior based on the data network geographical information.
Cooperative Patent Classification codes for this invention. Click any code to explore related patents in that topic.
June 12, 2001
February 9, 2010
Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.