Processes and systems are directed to training a neural network of an object recognition system. The processes and systems record video streams of people. Sequences of object images are extracted from each video stream, each sequence of object images corresponding to one of the people. A triplet comprising an anchor feature vector and a positive feature vector of the same object and a negative feature vector of a different object of feature vectors are formed for each sequence of object images. The anchor, positive, and negative feature vectors of each triplet are separately input to the neural network to compute corresponding output anchor, positive, and negative vectors. A triplet loss function value computed from the output anchor, positive, and negative vectors. When the triplite loss function value is greater than a threshold, the neural network is retrained using the anchor and positive feature vectors of the sequences of object images.
Legal claims defining the scope of protection, as filed with the USPTO.
1. A process stored in one or more data-storage devices and executed using one or more processors of a computer system to train a neural network of an object recognition system, the process comprising: retrieving one or more video streams from the one or more data-storage devices, each video stream capturing one or more views of one or more objects; forming one or more sequences of object images, each sequence of object images corresponding to one of the one or more objects; forming an object image triplet for each sequence of object images, the object image triplet comprising an anchor object image, a positive object image of the same object and a negative object image of a different object; and retraining the neural network using the object image triplet of the sequences of object images, wherein the retraining the neural network comprises: for each object image triplet, separately inputting the anchor object image, the positive object image, and the negative object image of the object image triplet into the neural network to obtain correspond output anchor, positive, and negative feature vectors, computing a triplet loss function value based on the anchor, positive, and negative feature vectors, and computing a variance of the triplet loss functions; and when the variance of the triplet loss functions is greater than a threshold, retraining the neural network using the anchor, positive, and negative feature vectors of the sequences of object images.
2. The process of claim 1 further comprising capturing the one or more video streams using one or more video cameras over a same period of time, each video camera capturing views of the one or more objects at a different location.
3. The process of claim 1 wherein forming the one or more sequences of object images comprises for each video stream, using object detection to identify an object in each video frame of the video stream; performing object tracking to track each object captured in the video stream from one video frame of the video stream to a next video frame of the video stream; for each object captured in the video stream, using object image extraction to extract a cropped object image from each video frame of the video stream; and form a sequence of cropped object images from the extract object images for each object captured in the video stream.
4. The method of claim 1 wherein forming the object image triplet for each of the one or more sequences of object images comprises: selecting a first object image from the sequence of object images, wherein the first object image is the anchor object image; computing a distance between the anchor object image and each of the images of the sequence of object images; identifying the image in the sequence of object images with a largest distance from the anchor object image as the positive object image; and form the negative object image from a second object image of object that is different from the object in the sequence of object images.
5. The process of claim 1 further comprises initially training the neural network using a labelled set of object images.
6. The process of claim 1 wherein retraining the neural network using the triplet feature vectors of the sequences of objects images for a fixed number of iterations.
7. The process of claim 1 further comprises: inputting object images of objects whose object images are in the labelled object image data set to the neural network to obtain corresponding feature vectors; computing an average fraction of correct matches as a measure of how well the neural network of the object recognition system is performing; computing a variance of the average fraction of correct matches; and when the variance of the average fraction of correct matches is greater than a threshold, retraining the neural network using the object images in the labelled object image data set.
8. An object recognition system, the system comprising: one or more video cameras; one or more processors; one or more data-storage devices; and machine-readable instructions stored in the one or more data-storage devices that when executed using the one or more processors controls the system to execute: recording one or more video streams, each video stream capturing views of one or more objects using one of the one or more video cameras; forming one or more sequences of object images, each sequence of object images corresponding to one of the one or more objects; forming an object image triplet for each sequence of object images, the object image triplet comprising an anchor object image and a positive object image of the same object and a negative object image of a different object; and retraining the neural network using the object image triplet of the sequences of object images, wherein the retraining the neural network comprises: for each object image triplet, separately inputting the anchor object image, the positive object image, and the negative object image of the object image triplet into the neural network to obtain correspond output anchor, positive, and negative feature vectors, computing a triplet loss function value based on the anchor, positive, and negative feature vectors, and computing a variance of the triplet loss functions; and when the variance of the triplet loss functions is greater than a threshold, retraining the neural network using the anchor, positive, and negative feature vectors of the sequences of object images.
9. The system of claim 8 further comprising capturing the one or more video streams using one or more video cameras over a same period of time, each video camera capturing views of the one or more objects at a different location.
10. The system of claim 8 wherein forming the one or more sequences of object images comprises for each video stream, using object detection to identify an object in each video frame of the video stream; performing object tracking to track each object captured in the video stream from one video frame of the video stream to a next video frame of the video stream; for each object captured in the video stream, using object image extraction to extract a cropped object image from each video frame of the video stream; and form a sequence of cropped object images from the extract object images for each object captured in the video stream.
11. The system of claim 8 wherein forming the object image triplet for each of the one or more sequences of object images comprises: selecting a first object image from the sequence of object images, wherein the first object image is the anchor object image; computing a distance between the anchor object image and each of the images of the sequence of object images; identifying the image with a largest distance from the anchor object image as the positive object image; and form the negative object image from a second object image of object that is different from the object in the sequence of object images.
12. The system of claim 8 further comprises initially training the neural network using a labelled set of object images.
13. The system of claim 8 wherein retraining the neural network using the triplet feature vectors of the sequences of objects images for a fixed number of iterations.
14. The system of claim 8 further comprises: inputting object images of objects whose object images are in the labelled object image data set to the neural network to obtain corresponding feature vectors; computing an average fraction of correct matches as a measure of how well the neural network of the object recognition system is performing; computing a variance of the average fraction of correct matches; and when the variance of the average fraction of correct matches is greater than a threshold, retraining the neural network using the object images in the labelled object image data set.
15. A non-transitory computer-readable medium encoded with machine-readable instructions that implement a method carried out by one or more processors of a computer system to perform the operations of retrieving one or more video streams from the one or more data-storage devices, each video stream capturing one or more views of one or more objects; forming one or more sequences of object images, each sequence of object images corresponding to one of the one or more objects; forming an object image triplet for each sequence of object images, the object image triplet comprising an anchor object image and a positive object image of the same object and a negative object image of a different object; and retraining the neural network using the object image triplet of the sequences of object images, wherein the retraining the neural network comprises: for each object image triplet, separately inputting the anchor object image, the positive object image, and the negative object image of the object image triplet into the neural network to obtain correspond output anchor, positive, and negative feature vectors, computing a triplet loss function value based on the anchor, positive, and negative feature vectors, and computing a variance of the triplet loss functions; and when the variance of the triplet loss functions is greater than a threshold, retraining the neural network using the anchor, positive, and negative feature vectors of the sequences of object images.
16. The medium of claim 15 further comprising capturing the one or more video streams using one or more video cameras over a same period of time, each video camera capturing views of the one or more objects at a different location.
17. The medium of claim 15 wherein forming the one or more sequences of object images comprises for each video stream, using object detection to identify an object in each video frame of the video stream; performing object tracking to track each object captured in the video stream from one video frame of the video stream to a next video frame of the video stream; for each object captured in the video stream, using object image extraction to extract a cropped object image from each video frame of the video stream; and form a sequence of cropped object images from the extract object images for each object captured in the video stream.
18. The medium of claim 15 wherein forming the object image triplet for each of the one or more sequences of object images comprises: selecting a first object image from the sequence of object images, wherein the first object image is the anchor object image; computing a distance between the anchor object image and each of the images of the sequence of object images; identifying the image with a largest distance from the anchor object image as the positive object image; and form the negative object image from a second object image of object that is different from the object in the sequence of object images.
19. The medium of claim 15 further comprises initially training the neural network using a labelled set of object images.
20. The medium of claim 15 wherein retraining the neural network using the triplet feature vectors of the sequences of objects images for a fixed number of iterations.
21. The medium of claim 15 further comprises: inputting object images of objects whose object images are in the labelled object image data set to the neural network to obtain corresponding feature vectors; computing an average fraction of correct matches as a measure of how well the neural network of the object recognition system is performing; computing a variance of the average fraction of correct matches; and when the variance of the average fraction of correct matches is greater than a threshold, retraining the neural network using the object images in the labelled object image data set.
Cooperative Patent Classification codes for this invention. Click any code to explore related patents in that topic.
January 2, 2018
March 30, 2021
Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.