Image Recognition Method and Apparatus, Terminal, and Storage Medium

PublishedJuly 19, 2022

Assigneenot available in USPTO data we have

InventorsKaihao ZHANG Wenhan LUO Lin MA Wei LIU

Technical Abstract

Patent Claims

20 claims

Legal claims defining the scope of protection, as filed with the USPTO.

1. An image recognition method, performed by a terminal, and comprising: obtaining a target video comprising a target object; extracting a target video frame image from the target video; generating a key point video frame sequence comprised of a plurality of key point video frames according to key point information of the target object and a plurality of video frames in the target video; extracting dynamic timing feature information of the key point video frame sequence by using an RNN model; extracting static structural feature information of the target video frame image describing the structure of the target object by using a convolutional neural network model; and recognizing an attribute type corresponding to a motion or an expression of the target object presented in the target video according to the dynamic timing feature information of the key point video frame sequence and the static structural feature information of the target video frame image.

2. The method according to claim 1 , wherein extracting dynamic timing feature information of the key point video frame sequence comprises: extracting key marker areas from the key point video frame sequence; obtaining unit key point video frame sequences according to the key marker areas in the plurality of key point video frames; inputting the unit key point video frame sequences into the RNN model, to extract dynamic timing feature information; and connecting the dynamic timing feature information, to obtain the dynamic timing feature information of the key point video frame sequence.

3. The method according to claim 1 , wherein extracting static structural feature information of the target video frame image comprises: inputting the target video frame image into the convolutional neural network model; and extracting the static structural feature information of the target video frame image through convolution processing and pooling processing of the convolutional neural network model.

4. The method according to claim 1 , wherein recognizing the attribute type comprises: recognizing, according to a classifier in the RNN model, matching degrees between the dynamic timing feature information of the key point video frame sequence and a plurality of attribute type features in the RNN model, and associating the matching degrees obtained through the dynamic timing feature information of the key point video frame sequence with label information corresponding to the plurality of attribute type features in the RNN model, to obtain a first label information set; recognizing, according to a classifier in the convolutional neural network model, matching degrees between the static structural feature information of the target video frame image and a plurality of attribute type features in the convolutional neural network model, and associating the matching degrees obtained through the static structural feature information of the target video frame image with label information corresponding to the plurality of attribute type features in the convolutional neural network model, to obtain a second label information set; and fusing the first label information set and the second label information set, to obtain the attribute type corresponding to the target object in the target video.

5. The method according to claim 1 , wherein recognizing the attribute type comprises: fusing the dynamic timing feature information of the key point video frame sequence and the static structural feature information of the target video frame image, to obtain fused feature information; recognizing, according to a classifier in the RNN model, matching degrees between the fused feature information and a plurality of attribute type features in the RNN model, and associating the matching degrees obtained through the RNN model with label information corresponding to the plurality of attribute type features in the RNN model, to obtain a first label information set; recognizing, according to a classifier in the convolutional neural network model, matching degrees between the fused feature information and a plurality of attribute type features in the convolutional neural network model, and associating the matching degrees obtained through the convolutional neural network model with label information corresponding to the plurality of attribute type features in the convolutional neural network model, to obtain a second label information set; and fusing the first label information set and the second label information set, to obtain the attribute type corresponding to the target object in the target video.

6. The method according to claim 4 , wherein fusing the first label information set and the second label information set, to obtain the attribute type comprises: performing weighted averaging on the matching degrees associated with the first label information set and the second label information set to obtain a target label information set; and extracting label information from the target label information set to obtain extracted label information, and using the extracted label information as the attribute type.

7. The method according to claim 3 , further comprising: obtaining a first sample image and a second sample image; extracting static structural feature information of the first sample image; extracting static structural feature information of the second sample image; and determining a model loss value according to the static structural feature information of the first sample image and the static structural feature information of the second sample image.

8. The method according to claim 7 , wherein determining the model loss values comprises: generating a first recognition loss value of the first sample image; generating a second recognition loss value of the second sample image; generating a verification loss value according to the static structural feature information of the first sample image and the static structural feature information of the second sample image; and generating the model loss value according to the first recognition loss value of the first sample image, the second recognition loss value of the second sample image, and the verification loss value.

9. An image recognition apparatus, comprising: a memory and a processor coupled to the memory, the processor being configured to: obtain a target video comprising a target object; extract a target video frame image from the target video; generate a key point video frame sequence comprised of a plurality of key point video frames according to key point information of the target object and a plurality of video frames in the target video; extract dynamic timing feature information of the key point video frame sequence by using an RNN model; extract static structural feature information of the target video frame image describing the structure of the target object by using a convolutional neural network model; and recognize an attribute type corresponding to a motion or an expression of the target object presented in the target video according to the dynamic timing feature information of the key point video frame sequence and the static structural feature information of the target video frame image.

10. The apparatus according to claim 9 , wherein the processor is further configured to: extract key marker areas from the key point video frame sequence; obtaining unit key point video frame sequences according to the key marker areas in the plurality of key point video frames; input the unit key point video frame sequences into the RNN model, to extract dynamic timing feature information; and connect the dynamic timing feature information to obtain the dynamic timing feature information of the key point video frame sequence.

11. The apparatus according to claim 9 , wherein the processor is further configured to: input the target video frame image into the convolutional neural network model; and extract the static structural feature information of the target video frame image through convolution processing and pooling processing of the convolutional neural network model.

12. The apparatus according to claim 9 , wherein the processor is further configured to: recognize, according to a classifier in the RNN model, matching degrees between the dynamic timing feature information of the key point video frame sequence and a plurality of attribute type features in the RNN model, and associate the matching degrees obtained through the dynamic timing feature information of the key point video frame sequence with label information corresponding to the plurality of attribute type features in the RNN model, to obtain a first label information set; recognize, according to a classifier in the convolutional neural network model, matching degrees between the static structural feature information of the target video frame image and a plurality of attribute type features in the convolutional neural network model, and associate the matching degrees obtained through the static structural feature information of the target video frame image with label information corresponding to the plurality of attribute type features in the convolutional neural network model, to obtain a second label information set; and fuse the first label information set and the second label information set, to obtain the attribute type corresponding to the target object in the target video.

13. The apparatus according to claim 9 , wherein the processor is further configured to: fuse the dynamic timing feature information of the key point video frame sequence and the static structural feature information of the target video frame image, to obtain fused feature information; recognize, according to a classifier in the RNN model, matching degrees between the fused feature information and a plurality of attribute type features in the RNN model, and associate the matching degrees obtained through the RNN model with label information corresponding to the plurality of attribute type features in the RNN model, to obtain a first label information set; recognize, according to a classifier in the convolutional neural network model, matching degrees between the fused feature information and a plurality of attribute type features in the convolutional neural network model, and associate the matching degrees obtained through the convolutional neural network model with label information corresponding to the plurality of attribute type features in the convolutional neural network model, to obtain a second label information set; and fuse the first label information set and the second label information set, to obtain the attribute type corresponding to the target object in the target video.

14. The apparatus according to claim 12 , wherein the processor is further configured to: perform weighted averaging on the matching degrees associated with the first label information set and the second label information set to obtain a target label information set; and extract label information from the target label information set to obtain extracted label information, and use the extracted label information as the attribute type.

15. The apparatus according to claim 11 , wherein the processor is further configured to: obtain a first sample image and a second sample image; extract static structural feature information of the first sample image; extract static structural feature information of the second sample image; and determine a model loss value according to the static structural feature information of the first sample image and the static structural feature information of the second sample image.

16. The apparatus according to claim 15 , wherein the processor is further configured to: generate a first recognition loss value of the first sample image; generate a second recognition loss value of the second sample image; generate a verification loss value according to the static structural feature information of the first sample image and the static structural feature information of the second sample image; and generate the model loss value according to the first recognition loss value of the first sample image, the second recognition loss value of the second sample image, and the verification loss value.

17. A non-transitory computer-readable storage medium, storing a computer-readable instruction, the computer-readable instruction, when executed by one or more processors, causing the one or more processors to perform: obtaining a target video comprising a target object; extracting a target video frame image from the target video; generating a key point video frame sequence comprised of a plurality of key point video frames according to key point information of the target object and a plurality of video frames in the target video; extracting dynamic timing feature information of the key point video frame sequence by using an RNN model; extracting static structural feature information of the target video frame image describing the structure of the target object by using a convolutional neural network model; and recognizing an attribute type corresponding to the target object in the target video according to the dynamic timing feature information of the key point video frame sequence and the static structural feature information of the target video frame image.

18. The storage medium according to claim 17 , wherein the computer-readable instruction further causes the one or more processors to perform: extracting key marker areas from the key point video frame sequence; obtaining unit key point video frame sequences according to the key marker areas in the plurality of key point video frames; inputting the unit key point video frame sequences into the RNN model, to extract dynamic timing feature information; and connecting the dynamic timing feature information to obtain the dynamic timing feature information of the key point video frame sequence.

19. The storage medium according to claim 17 , wherein recognizing the attribute type comprises: recognizing, according to a classifier in the RNN model, matching degrees between the dynamic timing feature information of the key point video frame sequence and a plurality of attribute type features in the RNN model, and associating the matching degrees obtained through the dynamic timing feature information of the key point video frame sequence with label information corresponding to the plurality of attribute type features in the RNN model, to obtain a first label information set; recognizing, according to a classifier in the convolutional neural network model, matching degrees between the static structural feature information of the target video frame image and a plurality of attribute type features in the convolutional neural network model, and associating the matching degrees obtained through the static structural feature information of the target video frame image with label information corresponding to the plurality of attribute type features in the convolutional neural network model, to obtain a second label information set; and fusing the first label information set and the second label information set, to obtain the attribute type corresponding to the target object in the target video.

20. The storage medium according to claim 17 , wherein recognizing the attribute type comprises: fusing the dynamic timing feature information of the key point video frame sequence and the static structural feature information of the target video frame image, to obtain fused feature information; recognizing, according to a classifier in the RNN model, matching degrees between the fused feature information and a plurality of attribute type features in the RNN model, and associating the matching degrees obtained through the RNN model with label information corresponding to the plurality of attribute type features in the RNN model, to obtain a first label information set; recognizing, according to a classifier in the convolutional neural network model, matching degrees between the fused feature information and a plurality of attribute type features in the convolutional neural network model, and associating the matching degrees obtained through the convolutional neural network model with label information corresponding to the plurality of attribute type features in the convolutional neural network model, to obtain a second label information set; and fusing the first label information set and the second label information set, to obtain the attribute type corresponding to the target object in the target video.

Patent Metadata

Filing Date

Unknown

Publication Date

July 19, 2022

Inventors

Kaihao ZHANG

Wenhan LUO

Lin MA

Wei LIU

Want to explore more patents?

Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.

Browse All Patents Try Prior Art Search