Legal claims defining the scope of protection. Each claim is shown in both the original legal language and a plain English translation.
1. A computer-implemented method for evaluating training data comprising: receiving training data comprising a labeled training set of digital objects, at least some of the digital objects in the labeled training set including a label indicating that the digital object is positive for a respective class selected from a predetermined set of classes with which a classifier is to be trained; grouping the positively labeled digital objects in the labeled training set into positive label groups, one positive label group for each class in the set of classes, each label group comprising digital objects having a label indicating the digital object is positive for the respective class; with a trained categorizer, assigning a score vector to each digital object in the labeled training set of digital objects, the score vector comprising a score for each category of a predetermined set of categories; applying at least one heuristic to the training data to evaluate the training data for training the classifier based on the assigned score vectors and training data labels; and based on the at least one heuristic, providing an evaluation of the training data prior to training the classifier.
A computer program evaluates training data for a classifier. It receives training data containing digital objects, some marked as "positive" for a specific class (e.g., "cat" in an image recognition task). These positive objects are grouped by class. A pre-trained "categorizer" analyzes each object and assigns a score vector indicating the object's similarity to various categories. The system applies heuristics, considering both the score vectors and the object labels, to assess the training data's suitability for training the classifier. Based on these heuristics, it provides an evaluation of the training data before the classifier is trained.
2. The method of claim 1 , wherein applying the at least one heuristic to the training data includes applying a heuristic comprising: for each positive label group, computing a representative score vector based on the score vectors of the digital objects in that positive label group; computing a first distance from a first score vector of a digital object of a first of the positive label groups to the representative score vector of the first of the positive label groups; computing a second distance from the first score vector to the representative score vector of a second of the positive label groups; and comparing the first distance to the second distance.
Building on the previous evaluation method, a specific heuristic is used: For each "positive" class group, a representative score vector is calculated from the score vectors of its members. The distance between a digital object's score vector in one class group and its group's representative vector is computed. Then, the distance between that same digital object's score vector and a *different* class group's representative vector is computed. These two distances are compared to see which class group the digital object is closer to.
3. The method of claim 2 , wherein when the first distance is greater than the second distance, the evaluation includes proposing that that the first and second positive label groups be merged into a common label group.
In the heuristic described previously where distances to different class group representatives are compared, if the distance to the *first* class's representative is *greater* than the distance to the *second* class's representative, the system suggests merging the two class groups into a single group. This indicates potential overlap or misclassification between the two original classes.
4. The method of claim 2 , wherein the first distance and the second distance are Euclidean distances.
In the heuristic described in claim 2 where distances to different class group representatives are compared, the distances used are Euclidean distances, which are straight-line distances in the multi-dimensional space of the score vectors.
5. The method of claim 1 , wherein each digital object in the labeled training set is labeled as positive for at most one of the classes.
In the training data evaluation method, each digital object is labeled as "positive" for *at most one* class. This means an object cannot belong to multiple positive class groups simultaneously.
6. The method of claim 1 , wherein the digital objects are images.
In the training data evaluation method, the digital objects being analyzed are images.
7. The method of claim 1 , further comprising generating a signature for each digital object in the training set, the assigned score vector being based on the signature.
The training data evaluation method generates a "signature" for each digital object. The score vector assigned by the categorizer is based on this signature, not directly on the digital object itself. The signature could be a feature vector or hash representing key aspects of the object.
8. The method of claim 1 , wherein the assigning of the score vectors further comprises setting initial scores in each score vector that are below a noise threshold to a new value which indicates the value is below the noise threshold.
When assigning score vectors, the categorizer sets initial scores in each vector that are below a noise threshold to a special value, indicating they are insignificant. This helps ignore irrelevant category scores.
9. The method of claim 8 , wherein the evaluation comprises outputting a recommendation when the scores of score vectors of at least a proportion of the digital objects in the labeled training set are below the noise threshold.
If, after the score vector assignment and noise filtering (as previously described), a significant portion of the digital objects have score vectors with values below the noise threshold, the system recommends reviewing or modifying the training data. This suggests the objects may be too ambiguous or of low quality for effective training.
10. The method of claim 1 , wherein the applying the at least one heuristic to the training data further comprises: computing a maximum standard deviation (SDmax) based on the score vectors of all the positively labeled digital objects in the labeled training set.
The training data evaluation involves calculating a "maximum standard deviation" (SDmax) based on the score vectors of *all* positively labeled digital objects in the training data. This provides a baseline measure of the overall spread or diversity of the positive data.
11. The method of claim 10 , wherein SDmax is based on distances of the score vectors of all the digital objects in the labeled training set to a same predefined vector.
The maximum standard deviation (SDmax) is based on the distances of the score vectors of all the digital objects in the labeled training set to the *same predefined vector*. This predefined vector serves as a central reference point for calculating the spread of the data.
12. The method of claim 11 , wherein SDmax is a function of the Euclidian distance of each score vector in the labeled training set from the predefined vector.
The maximum standard deviation (SDmax) is calculated as a function of the Euclidean distance of each score vector in the labeled training set from the predefined vector. This means the further a score vector is from the predefined vector, the more it contributes to SDmax.
13. The method of claim 10 , wherein the applying the at least one heuristic to the training data comprises applying a heuristic which comprises: computing a standard deviation for a given positive label group as a function of the distance of the score vector of each digital object in the given positive label group to a representative score vector for the given positive label group; and comparing the standard deviation of the given positive label group to SDmax; and wherein the evaluation optionally includes outputting a recommendation when the given positive label group has a standard deviation greater than SDmax.
A heuristic is used to compare the standard deviation of a *specific* positive class group to the overall maximum standard deviation (SDmax). The standard deviation for the class group reflects how much the score vectors within that group vary around a representative score vector for that group. If the class group's standard deviation is *greater* than SDmax, the system might recommend investigating the class group for potential issues (e.g., outliers, mislabeled objects).
14. The method of claim 13 , wherein the representative score vector for the given positive label group is computed as a function of the score vectors of the score vectors of the digital objects in the given positive label group.
The representative score vector for a class group is calculated as a function of the score vectors of the individual digital objects within that class group. This could be a simple average or a weighted average, giving more importance to certain objects.
15. The method of claim 13 , wherein the standard deviation of the given positive label group is a function of the Euclidian distance of the score vector of each digital object in the given positive label group to the representative score vector for the given positive label group.
The standard deviation of a class group is calculated as a function of the Euclidean distance of each object's score vector in the class group to the class group's representative score vector. Larger distances contribute more to the standard deviation.
16. The method of claim 10 , wherein the applying the at least one heuristic to the training data comprises applying a heuristic which comprises: for at least one positive label group, computing a standard deviation as a function of distances of the score vectors of the digital objects in the respective positive label group to a representative score vector for the respective positive label group; and comparing the standard deviation of each positive label group to SDmax; and where the evaluation optionally includes outputting a recommendation when at least one of the positive label groups has a standard deviation greater than SDmax.
For each positive class group, the system calculates a standard deviation based on the distances of the objects' score vectors to the group's representative score vector. It then compares each group's standard deviation to the overall SDmax. If *any* group has a standard deviation higher than SDmax, a recommendation to investigate that group may be output.
17. The method of claim 10 , wherein the applying the at least one heuristic to the training data comprises applying a heuristic which comprises: identifying, for at least one of the positive label groups, at least two clusters within the positive label group; computing a standard deviation of each of the two clusters; and comparing the standard deviation of each of the two clusters to SDmax; and where the evaluation optionally includes outputting a recommendation when the standard deviation of at least one of the two clusters is lower than SDmax.
For at least one positive class group, the heuristic identifies at least two clusters within the class group. It calculates the standard deviation of each cluster, and compares these standard deviations to the maximum standard deviation (SDmax). The system may recommend action if the standard deviation of at least one cluster is *lower* than SDmax, suggesting the cluster is too tightly grouped or represents a sub-category.
18. The method of claim 17 , further comprising identifying a centroid of the at least one cluster, and at least one of: identifying at least one digital object of the training data, having a score vector that is close in distance to the centroid of the cluster; and suggesting a name based on a predefined category from the categorizer's categories having a highest score in the score vector of the centroid.
If clustering is performed (as described in the previous heuristic), the system identifies the centroid (center) of at least one cluster. Then it can find training objects with score vectors near the centroid, and/or suggest a name for the cluster based on the category from the categorizer that has the highest score in the centroid's score vector.
19. The method of claim 1 , wherein at least some of the digital objects in the labeled training set include a label indicating that the digital object is negative with respect to all classes in the set of classes, and at least one of the at least one heuristic relates to the negatively-labeled objects.
Some digital objects are labeled as "negative" (meaning they do *not* belong to any of the defined classes). At least one of the heuristics specifically considers these negatively labeled objects.
20. The method of claim 19 , wherein at least one of the heuristics which relates to the negatively-labeled objects comprises: grouping the negatively labeled digital objects in the labeled training set into a negative label group; with the trained categorizer, computing a score vector for each of the digital objects in the negative label group; for at least one positive label group, identifying a most distant score vector in the at least one positive label group from the representative score vector of the respective label group; computing a maximum distance based on a distance from the identified most distant score vector to the representative score vector of the at least one positive label group; computing a negative distance based on a distance from a negatively labeled score vector in the negative label group to the representative score vector of the at least one positive label group; and comparing the maximum distance to the negative distance; and wherein the evaluation optionally includes recommending that the negatively labeled score vector be labeled as positive with respect to the at least one positive label group when the negative distance is less than the maximum distance.
A heuristic for negative objects: Negative objects are grouped. For a chosen positive class group, the object with the score vector furthest from its group's representative is found. A maximum distance is calculated from this object to its class's representative. Then, a "negative distance" is computed from a negative object's score vector to the same positive class's representative. If the negative distance is *less* than the maximum distance, the system may recommend relabeling the negative object as positive for that class.
21. The method of claim 19 , wherein at least one of the heuristics which relates to the negatively-labeled objects comprises: computing a maximum standard deviation (SDmax) based on the score vectors of all the positively labeled digital objects in the labeled training set; identifying at least two clusters within a negative label group, the negative label group comprising objects in the training set that are negatively labeled with respect to all the classes; and computing a standard deviation of each of the two clusters, and comparing the standard deviation of each of the two clusters to SDmax; and, wherein the evaluation optionally includes recommending that a new class be added to the set of classes when the standard deviation of at least one of the two clusters is lower than SDmax.
A heuristic considers negative objects grouped into a negative class group, identifies at least two clusters, and calculates the standard deviation of each cluster. If the standard deviation of at least one cluster is *lower* than the overall maximum standard deviation (SDmax), the system may recommend adding a *new* class to the set of classes. This suggests the negative objects might actually represent a distinct, previously unrecognized category.
22. The method of claim 1 , wherein at least some of the digital objects in the labeled training set include a label indicating that the digital object is neutral with respect to one of the classes in the set of classes, the applying of the at least one heuristic comprising applying a heuristic configured for identifying at least one of: a neutral labeled digital object having a score vector that is not sufficiently close to score vectors of other digital objects with the same neutral label; and a neutral labeled digital object having a score vector that is not sufficiently close to score vectors of other digital objects in a positive label group which includes the neutral labeled digital object; and wherein the evaluation optionally includes providing for identifying the neutral labeled digital object and outputting a recommendation based on the applied heuristic.
Some digital objects are labeled as "neutral". A heuristic identifies neutral objects whose score vectors are not close enough to other neutral objects or to positive objects of the expected class. The system may then output a recommendation to review the labeling of these neutral objects.
23. The method of claim 1 , wherein the received training data comprises a set of unlabeled digital objects, the method further comprising: with the categorizer, assigning a score vector to each digital object in the set of unlabeled digital objects, the score vector comprising a score for each category of the set of categories; and wherein the applying of the at least one heuristic comprises applying a heuristic configured for identifying at least one class which is not represented in the set of unlabeled digital objects.
The training data includes unlabeled objects. The categorizer assigns score vectors to these objects. A heuristic identifies classes *not* represented by the unlabeled objects. The system may recommend adding more unlabeled objects representing those missing classes.
24. The method of claim 1 , wherein the evaluation comprises providing for recommending at least one of: confirming or changing a label of at least one of the digital objects in the labeled training set; removing a digital object from the training data; adding a class to the set of classes; adding digital objects to the training set for at least one of the positive label groups; and labeling an unlabeled digital object in the training data with one of the classes.
The evaluation process provides recommendations like: confirming/changing object labels, removing objects, adding classes, adding objects to class groups, or labeling unlabeled objects.
25. The method of claim 24 , further comprising training the classifier with the training data that has been modified based on at least one of the recommendations.
The classifier is trained using the training data *after* it has been modified based on the recommendations provided by the evaluation process.
26. A system comprising memory which stores instructions for performing the method of claim 1 and a processor in communication with the memory for executing the instructions.
A system implements the training data evaluation method. It includes memory storing instructions and a processor executing those instructions.
27. A computer program product comprising a non-transitory recording medium storing instructions, which when executed by a processor, perform the method of claim 1 .
A computer program product consists of a non-transitory storage medium (e.g., disk, flash drive) containing instructions that, when executed by a processor, perform the training data evaluation method.
28. A system for evaluating training data comprising: memory which receives training data to be evaluated for use in training a classifier, the training data including labeled digital objects that are labeled with respect to a predefined set of classes; a trained categorizer for categorizing the digital objects based on respective representations of the digital objects, the categorizer outputting a score vector comprising a score for each of a predefined set of categories, the set of categories differing from the set of classes; optionally, a clustering component for identifying clusters of the digital objects based on respective score vectors of the digital objects; a metric component which computes metrics for at least one of: label groups, each label group including a set of the digital objects having a common label; category groups, each category group including a set of the digital objects having a common most probable category output by the categorizer; and clusters output by the clustering component; an evaluation component which is configured for applying a set of heuristics to the training data based on the computed metrics, the set of heuristics comprising at least one heuristic selected from the group consisting of: a) a heuristic that identifies overlap between two label groups; b) a heuristic which determines when there is more than one cluster for a label group; c) a heuristic which determines when a label group has a standard deviation which exceeds a standard deviation computed over a plurality of the label groups; d) a heuristic which identifies when a digital object with a label which indicates that the digital object is negative with respect to all classes has a score vector which overlaps at least one label group in which the digital objects are all labeled as positive with respect to a same one of the classes; e) a heuristic which identifies when a digital object with a neutral label with respect to at least one of the classes has a score vector which does not overlap a positive label group in which the digital objects are labeled as positive with respect to the same one of the classes; f) a heuristic which identifies when there is insufficient overlap between unlabeled objects in the training data and the label groups that include digital objects which are labeled as positive with respect to one of the classes; g) a heuristic which identifies when there are unlabeled objects in the training data whose score vectors do not overlap any of the label groups that include digital objects which are labeled as positive with respect to one of the classes; h) a heuristic which identifies when there are unlabeled objects in the training data whose score vectors overlap one of the label groups that include digital objects which are labeled as positive with respect to one of the classes; and i) a heuristic which identifies when there are clusters of labeled objects in the training data that are labeled as negative; a recommendation component for outputting recommendations based on results of the applied heuristics to improve the training data; and a processor which implements the categorizer, clustering component, evaluation component, and recommendation component.
A system evaluates training data for a classifier. It includes: memory to receive training data of labeled digital objects; a trained categorizer which outputs a score vector for each object containing a score for each of a predefined set of categories; optionally, a clustering component; a metric component which computes metrics for label groups, category groups, and clusters; an evaluation component which applies heuristics to the training data; a recommendation component for outputting recommendations; and a processor. The heuristics include: a) identifying overlap between label groups; b) determining when there is more than one cluster for a label group; c) determining when a label group's standard deviation exceeds a maximum standard deviation; d) identifying when a negative object overlaps a positive label group; e) identifying when a neutral object doesn't overlap its positive label group; f) identifying when unlabeled objects have insufficient overlap with label groups; g) identifying when unlabeled objects don't overlap any label groups; h) identifying when unlabeled objects overlap one of the label groups; and i) identifying when there are clusters of labeled objects that are labeled as negative.
29. The system of claim 28 , wherein the system includes at least two heuristics selected from the group of heuristics.
The system for evaluating training data uses at least two of the heuristics listed in the previous description.
30. The system of claim 28 , wherein in a) the heuristic includes: for each positive label group, computing a representative score vector based on the score vectors of the digital objects in that positive label group; computing a first distance from a first score vector of a digital object of a first of the positive label groups to the representative score vector of the first of the positive label groups; computing a second distance from the first score vector to the representative score vector of a second of the positive label groups; and comparing the first distance to the second distance; in b) the heuristic includes: clustering the score vectors in a label group into a plurality of clusters; and determining whether a standard deviation for at least one of the clusters is less than a maximum standard deviation computed over a set of the label groups; in c) the heuristic includes: computing a standard deviation for a positive label group based on the score vectors of the digital objects in that positive label group; and comparing the standard deviation for the positive label group to a maximum standard deviation computed over a set of the label groups; in d) the heuristic includes: computing a representative score vector for a positive label group based on the score vectors of the digital objects in that positive label group; and determining whether the score vector of an object with a neutral label is closer to the representative score vector for the positive label group than any of the score vectors of the objects in that positive label group; in e) the heuristic includes: computing a representative score vector for the positive label group in which the digital objects are labeled as positive with respect to the same one of the classes based on the score vectors of the digital objects in the positive label group; and determining whether the score vector of the digital object with the neutral label is further from the representative score vector than a representative distance of the positive label group, the representative distance based on the score vectors and the representative score vector of the positive label group; in f) the heuristic includes: computing a representative score vector for each of the positive label groups based on the score vectors of the digital objects in the respective positive label group; and for all of the unlabeled objects in the training set, determining whether the score vector of the unlabeled object is further from the representative score vector for all of the positive label groups than an average distance of the score vectors of the objects in that positive label group from the representative score vector; in g) the heuristic includes: computing a representative score vector for each of the positive label groups based on the score vectors of the digital objects in the respective positive label group; and for each of the unlabeled objects in the training set, determining whether the score vector of the unlabeled object is further from the representative score vector for all of the positive label groups than a furthest of the score vectors of the objects in that positive label group; in h) the heuristic includes: computing a representative score vector for each of the positive label groups based on the score vectors of the digital objects in the respective positive label group; and for each of a set of unlabeled objects in the training set, determining whether the score vector of the unlabeled object is closer to the representative score vector for at least one of the positive label groups than an average of the score vectors of the objects in that positive label group; and in i) the heuristic includes: clustering digital objects in the training data that are labeled as negative with respect to all of the classes, based on their score vectors; determining a standard deviation for each of the clusters based on the score vectors of the digital objects in the cluster; and determining whether the standard deviation of the cluster has a smaller standard deviation than a maximum standard deviation computed based on positively labeled digital objects.
The system applies specific calculations for the heuristics: a) distance comparisons between label groups; b) clustering and standard deviation comparison within a label group; c) standard deviation comparison for a positive label group versus overall; d) neutral object's distance to a positive label group; e) neutral object's distance compared to the typical distance of its positive label group; f) unlabeled object's distance to *all* positive label groups compared to their average distances; g) unlabeled object's distance to *all* positive label groups compared to their furthest distances; h) unlabeled object's distance to at least one positive label group compared to its average distance; and i) clustering negative objects and comparing their standard deviations to a maximum standard deviation.
31. The system of claim 28 , wherein the metrics include metrics selected from the group consisting of: a representative score vector which is computed for a label group, category group, or cluster as an optionally weighted average of the score vectors in the respective label group, category group, or cluster; a relative distance for a score vector in a label group, category group, or cluster to the representative score vector of that label group, category group, or cluster; a representative distance computed based on the relative distances for the digital objects in label group, category group, or cluster; a maximum distance computed based on a distance from the representative score vector to the most distant score vector in the respective label group, category group, or cluster; a standard deviation computed based on the distances of each score vector in a label group, category group, or cluster to a central vector; and a maximum standard deviation which is a maximum of the standard deviations of each the label groups, category groups, or clusters.
The metrics computed by the system include: representative score vectors (weighted average), relative distances to representative vectors, representative distances within groups, maximum distances to furthest vectors, standard deviations around central vectors, and a maximum standard deviation across all groups.
32. A computer-implemented method of generating suggestions for modifying a training set of digital objects, the method comprising: receiving from a submitter, a labeled training set of digital objects, the training set further comprising: a set of classes for identifying content of digital objects; and for each digital object in the training set, a plurality of labels, one label for each class, the one label for each class indicating that the digital object is positive, negative, or neutral for the respective class; grouping the digital objects in the labeled training set into positive label groups, one positive label group for each class in the set of classes, each label group comprising digital objects having a label indicating the digital object is positive for the respective class; with a categorizer which has been trained on a set of multiple categories, assigning a score vector to each digital object in the labeled training set of digital objects, the score vector comprising a score for each category of the set of categories; computing a representative score vector for each positive label group based on the score vectors of the digital objects in the respective positive label group; applying heuristics including at least one of a first heuristic and a second heuristic and making recommendations to the submitter based on the applied at least one of the first and second heuristics, the first heuristic including computing a first distance from a first score vector of a digital object of a first of the positive label groups to the representative score vector of the first of the positive label groups, computing a second distance from the first score vector to the representative score vector of a second of the positive label groups, and comparing the first distance to the second distance; and the second heuristic including computing a maximum standard deviation as a function of the distance of each score vector in the labeled training set from a central vector, identifying at least two clusters within the negative label group using a clustering algorithm, computing a standard deviation of each of the two clusters, and comparing the standard deviation of each of the two clusters to the maximum standard deviation; and the recommendations to the submitter comprising: for the first heuristic, if the first distance is greater than the second distance, proposing to the submitter at least one of: merging the first and second positive label groups into a common label group, and labeling the digital objects of the first of the positive label groups as neutral with respect to the second of the positive label groups; for the second heuristic, if the standard deviation of at least one of the two clusters is lower than the maximum standard deviation, suggesting to the submitter that a new class be added to the set of classes.
A method generates suggestions for modifying a training set. It receives a labeled training set, defines classes, and assigns positive, negative, or neutral labels. Positive objects are grouped by class. A categorizer assigns score vectors. Representative score vectors are computed for each positive label group. The system uses at least one of two heuristics: 1) Distance comparisons to different class groups as described earlier, recommending merging groups or re-labeling as neutral if distances indicate overlap; 2) Clustering negative objects, and if a cluster's standard deviation is less than a maximum standard deviation, suggesting a new class be added.
33. The method of claim 32 , wherein at least one category in the set of categories is not present in the set of classes.
In the suggestion method, at least one category used by the categorizer is *not* present in the set of classes used for labeling the training data. This allows the categorizer to identify potentially new or unrepresented classes.
Unknown
December 30, 2014
Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.