Temporally Distributed Neural Networks for Video Semantic Segmentation

PublishedJune 7, 2022

Assigneenot available in USPTO data we have

InventorsFederico Perazzi Zhe Lin Ping Hu Oliver Wang Fabian David Caba Heilbron

Technical Abstract

Patent Claims

17 claims

Legal claims defining the scope of protection, as filed with the USPTO.

1. A method comprising: extracting, from each video frame in a contiguous sequence of video frames, a group of features using one of a plurality of sub-neural networks, the contiguous sequence of video frames comprising a current video frame and one or more additional video frames occurring in the contiguous sequence prior to the current video frame, wherein the group of features extracted from the current video frame is different from another group of features extracted from the one or more additional video frames in the contiguous sequence of video frames; generating a full feature representation for the current video frame by combining the groups of features extracted from the contiguous sequence of video frames, wherein generating the full feature representation for the current video frame comprises: generating, for each video frame in the one or more additional video frames, an affinity value between pixels of the video frame in the one or more additional video frames and the current video frame; and generating the full feature representation for the current video frame based on the affinity value and the groups of features extracted from the contiguous sequence of video frames; segmenting the current video frame based upon the full feature representation to generate a segmentation result, the segmentation result comprising information identifying, for a pixel in the current video frame, a label selected for the pixel based upon the full feature representation, wherein the label is selected from a plurality of labels; and outputting the segmentation result.

2. The method of claim 1 , wherein the groups of features, extracted from the video frames in the contiguous sequence of video frames, together represent a total set of features used for segmenting the current video frame.

3. The method of claim 1 , wherein the plurality of sub-neural networks comprises a first sub-neural network and a second sub-neural network, the first sub-neural network trained to extract a first group of features from a first video frame in the contiguous sequence of video frames, the second sub-neural network trained to extract a second group of features from a second video frame in the contiguous sequence of video frames, wherein the first video frame is different from the second video frame and the first group of features is different from the second group of features.

4. The method of claim 1 , wherein extracting, from each video frame in the contiguous sequence of video frames, a group of features using a different one of the plurality of sub-neural networks comprises: generating at least one of a value feature map, a query map, or a key map, wherein the value feature map comprises features extracted by a sub-neural network of the plurality of sub-neural networks from the video frame, and the query map and the key map comprise information related to correlations between pixels across the video frames or across adjacent video frames in the contiguous sequence.

5. The method of claim 1 , wherein generating the full feature representation for the current video frame further comprises computing a correlation between pixels of a first video frame in the contiguous sequence and a second video frame in the contiguous sequence, where the first video frame is adjacent to the second video frame in the contiguous sequence and occurs before the second video frame in the contiguous sequence.

6. The method of claim 5 , wherein generating the full feature representation for the current video frame further comprises: (a) comparing the first video frame in the contiguous sequence with the second video frame in the contiguous sequence by computing an attention value between the pixels of the first video frame and the pixels of the second video frame, wherein the attention value measures the correlation between the pixels of the first video frame and the pixels of the second video frame; (b) obtaining a value feature map of the first video frame and a value feature map of the second video frame; and (c) updating the value feature map of the second video frame based on the attention value, the Value feature map of the first video frame and the value feature map of the second video frame.

7. The method of claim 1 , further comprising: (a) comparing a first video frame in the contiguous sequence with a second video frame in the contiguous sequence by computing an attention value between pixels of the first video frame and pixels of the second video frame, wherein the attention value measures a correlation between the pixels of the first video frame and the pixels of the second video frame; (b) obtaining a value feature map of the first video frame and a value feature map of the second video frame; (c) updating the value feature map of the second video frame based on the attention value, the value feature map of the first video frame and the value feature map of the second video frame; (d) updating the contiguous sequence of video frames by removing the first video frame from the contiguous sequence of video frames; and repeating (a), (b), (c) and (d) until only the current video frame is left in the contiguous sequence of video frames.

8. The method of claim 7 , further comprising: determining that only the current video frame is left in the contiguous sequence of video frames; and based on the determining, outputting the value feature map for the current video frame, wherein the value feature map represents the full feature representation for the current video frame.

9. The method of claim 1 , wherein the segmentation result comprises an image of the current video frame, wherein each pixel in the image of the current video frame is colored using a color corresponding to the label associated with the pixel.

10. The method of claim 1 , a feature space representing a plurality of features to be used for segmenting video frames in the contiguous sequence of video frames is divided into a number of groups of features, wherein a number of sub-neural networks in the plurality of sub-neural networks is equal to a number of the groups of features.

11. The method of claim 10 , wherein the number of groups of features is four.

12. The method of claim 1 , wherein a number of layers in each sub-neural network from the plurality of sub-neural networks is the same.

13. The method of claim 12 , wherein: a number of layers in each sub-neural network from the plurality of sub-neural networks is the same; and a number of nodes in each sub-neural network from the plurality of sub-neural networks is the same.

14. A system comprising: a memory storing segmented video frames corresponding to a video signal; and one or more processors configured to perform processing comprising: extracting, from each video frame in a contiguous sequence of video frames, a group of features using one of a plurality of sub-neural networks, the contiguous sequence of video frames comprising a current video frame and one or more additional video frames occurring in the contiguous sequence prior to the current video frame, and wherein the group of features extracted from the current video frame is different from another group of features extracted from the one or more additional video frames in the contiguous sequence of video frames; generating a full feature representation for the current video frame by combining the groups of features extracted from the contiguous sequence of video frames, wherein generating the full feature representation for the current video frame comprises: generating, for each video frame in the one or more additional video frames, an affinity value between pixels of the video frame in the one or more additional video frames and the current video frame; and generating the full feature representation for the current video frame based on the affinity value and the groups of features extracted from the contiguous sequence of video frames; segmenting the current video frame based upon the full feature representation to generate a segmentation result, the segmentation result comprising information identifying, for a pixel in the current video frame, a label selected for the pixel based upon the full feature representation, wherein the label is selected from a plurality of labels; and outputting the segmentation result.

15. The system of claim 14 , wherein the groups of features, extracted from the video frames in the contiguous sequence of video frames, together represent a total set of features used for segmenting the current video frame.

16. The system of claim 14 , wherein the plurality of sub-neural networks comprises a first sub-neural network and a second sub-neural network, the first sub-neural network trained to extract a first group of features from a first video frame in the contiguous sequence of video frames, the second sub-neural network trained to extract a second group of features from a second video frame in the contiguous sequence of video frames, wherein the first video frame is different from the second video frame and the first group of features is different from the second group of features.

17. A non-transitory computer-readable medium having program code that is stored thereon, the program code executable by one or more processing devices for performing operations comprising: extracting, from each video frame in a contiguous sequence of video frames, a group of features using one of a plurality of sub-neural networks, the contiguous sequence of video frames comprising a current video frame and one or more additional video frames occurring in the contiguous sequence prior to the current video frame, and wherein the group of features extracted from the current video frame is different from another group of features extracted from the one or more additional video frames in the contiguous sequence of video frames; generating a full feature representation for the current video frame based upon the groups of features extracted from the contiguous sequence of video frames, wherein generating the full feature representation comprises computing a correlation between pixels of a first video frame in the contiguous sequence and a second video frame in the contiguous sequence, where the first video frame is adjacent to the second video frame in the contiguous sequence and occurs before the second video frame in the contiguous sequence; segmenting the current video frame based upon the full feature representation to generate a segmentation result, the segmentation result comprising information identifying, for a pixel in the current video frame, a label selected for the pixel based upon the full feature representation, wherein the label is selected from a plurality of labels; and outputting the segmentation result.

Patent Metadata

Filing Date

Unknown

Publication Date

June 7, 2022

Inventors

Federico Perazzi

Zhe Lin

Ping Hu

Oliver Wang

Fabian David Caba Heilbron

Want to explore more patents?

Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.

Browse All Patents Try Prior Art Search