Method and Apparatus and Computer Device for Automatic Semantic Annotation for an Image

PublishedAugust 17, 2021

Assigneenot available in USPTO data we have

InventorsXiao LIU Jiang WANG Shilei WEN Errui DING

Technical Abstract

Patent Claims

12 claims

Legal claims defining the scope of protection, as filed with the USPTO.

1. A method of training an image semantic annotation apparatus, comprising: a. providing a plurality of training images, wherein semantics and visual attribution description of respective training images are known, by automatically parsing, through a computer device, webpages including images to obtain therefrom the plurality of training images, the known semantics, and the known visual attribute descriptions of respective ones of the training images, the known semantics include coarse-grained semantics and fine-grained semantics that are not completely identical to the corresponding coarse-grained semantics for a respective one of the plurality of training images, wherein the coarse-grained semantics corresponds to a coarse-grained classification object, and the different fine-grained semantics corresponds to different fine-grained classification objects belonging to the same coarse-grained classification object, each of the fine-grained classification objects including a plurality of feature parts, the visual attribute descriptions being divided into different groups based on their corresponding feature parts, each of the visual attribute descriptions expressing a local visual appearance of corresponding ones of the feature parts, wherein the webpages include textual data relating to the images, and wherein the known semantics and known visual attribute descriptions result from capturing the textual data; b. inputting the training images each including a given fine-grained classification object to a locator of the image semantic annotation apparatus, wherein the locator is configured to determine a coordinate on the training image, determine a local area on the training image based on the coordinate, and determine a part of the fine-grained classification object within the local area as the feature part; c. determining, by the locator, a plurality of local areas of each input training image, wherein a location of the at least one local area on the input training image is determined by probability distribution sampling of the locator outputs of the plurality of feature parts, wherein each of the determined local areas of each input training image comprises one feature part of the given fine-grained classification object included in the each input training image, and the different determined local areas in the each input training image comprise different features parts of the given fine-grained classification object, and inputting the determined respective local areas into an attribute predictor of the image semantic annotation apparatus, the plurality of local areas including a coordinate on the input training image and having a size less than a size of the input training image; d. obtaining a visual attribute prediction result of each input local area from the attribute predictor, wherein the visual attribute prediction result of each input local area comprises a visual attribute description of the feature part located in the each input local area; e. training the locator and the attribute predictor according to the obtained visual attribute prediction result of each local area and a known visual attribute description of the corresponding training image, and for each of the feature parts to be located on the corresponding training image, repeating steps a to e until convergence to complete training of the locator and of the attribute predictor, wherein the known visual attribute description of the corresponding training image comprises known visual attribute descriptions of the feature parts of the given fine-grained classification object in the corresponding training image; f. selecting at least part of training images from the plurality of training images; g. by the trained locator, locating, on the training image, the plurality of feature parts of the given fine-grained classification object corresponding to each of the selected training images by processing each of the selected training images, wherein the locating comprises determining the coordinate on the each training image, and determining the feature part of the given fine-grained classification object based on the coordinate on the each training image; h. inputting the feature parts located for the given fine-grained classification object of each of the selected training images and the known fine-grained semantic of the given fine-grained classification object in the each training image into a classifier of the image semantics annotation apparatus to train the classifier.

2. The method according to claim 1 , wherein the step e comprises: for each of the local areas, calculating a loss function according to the visual attribute prediction result of the local area and the visual attribute description of the corresponding training image, for training the locator and the attribute predictor.

3. The method according to claim 2 , wherein the step e further comprises: calculating gradients of the locator and the attribute predictor according to a reverse propagation algorithm to determine or update parameters of the locator and the attribute predictor.

4. The method according to claim 1 , wherein the step h comprises: calculating a convolution feature of each located feature part for each of the selected training images; generating a vector for the training image according to the calculated convolution feature of each feature part; and training the classifier by a support vector machine according to the generated vector.

5. The method according to claim 4 , wherein the step h further comprises: for each selected training image, calculating an overall convolution feature of the training image; the step of generating a vector for the image further comprises: generating the vector for the training image according to the calculated overall convolution feature of the image and the calculated convolution feature of each feature part of the image.

6. The method according to claim 1 , wherein a size of each local area is preset, and the locator is configured determine the local area on the training image based on the coordinate and the size of the local area.

7. The method according to claim 1 , wherein the locator is configured to determine a plurality of coordinates on the training image, and determine the local area by using positions of the plurality of coordinates as corners of the local area.

8. The method according to claim 1 , wherein the locator is configured to determine a plurality of coordinates on the training image, and determine the local area by using a position of one of the coordinates as a center point of the local area, and using positions of the other coordinates as a boundary.

9. A computer device that can train itself, comprising: a processor and a memory, the processor being configured to: provide a plurality of training images, wherein semantics and visual attribution description of respective training images are known, by automatically parsing, through the computer device that can train itself, webpages including images to obtain therefrom the plurality of training images, the known semantics, and the known visual attribute descriptions of respective ones of the training images, the known semantics include coarse-grained semantics and fine-grained semantics that are not completely identical to the corresponding coarse-grained semantics for a respective one of the plurality of training images, wherein the coarse-grained semantics corresponds to a coarse-grained classification object, and the different fine-grained semantics corresponds to different fine-grained classification objects belonging to the same coarse-grained classification object, each of the fine-grained classification objects including a plurality of feature parts, the visual attribute descriptions being divided into different groups based on their corresponding feature parts, each of the visual attribute descriptions expressing a local visual appearance of corresponding ones of the feature parts, wherein the webpages include textual data relating to the images, and wherein the known semantics and known visual attribute descriptions result from capturing the textual data; input the training images each including a given fine-grained classification object to a locator, wherein the locator is configured to determine a coordinate on the training image, determine a local area on the training image based on the coordinate, and determine a part of the fine-grained classification object within the local area as the feature part; determine, by the locator, a plurality of local areas of each input training image, wherein a location of the at least one local area on the input training image is determined by probability distribution sampling of the locator outputs of the plurality of feature parts, wherein each of the determined local areas of each input training image comprises one feature part of the given fine-grained classification object, and the different determined local areas in the each input training image comprise different features parts of the given fine-grained classification object, the plurality of local areas including a coordinate on the input training image and having a size less than a size of the input training image; obtain, with the each local area determined, a visual attribute prediction result of each input local area, wherein the visual attribute prediction result of each input local area comprises a visual attribute description of the feature part located in the each input local area; train the computer device according to the obtained visual attribute prediction result of each local area and a known visual attribute description of the corresponding training image, and for each of the feature parts to be located on the corresponding training image, perform the aforementioned operations until convergence to complete training of the computer device, wherein the known visual attribute description of the corresponding training image comprises known visual attribute descriptions of the feature parts of the given fine-grained classification object in the corresponding training image; select at least part of training images from the plurality of training images; locate, on the training image, the plurality of feature parts of the given fine-grained classification object corresponding to each of the selected training images by processing each of the selected training images, wherein the locating comprises determining the coordinate on the each training image, and determining the feature part of the given fine-grained classification object based on the coordinate on the each training image; and input the feature parts located for the given fine-grained classification object of each of the selected training images and the known fine-grained semantic of the given fine-grained classification object in the each training image into a classifier to train the computer device.

10. The computer device according to claim 9 , wherein the processor is further configured to: for each of the local areas, calculate a loss function according to the visual attribute prediction result of the local area and the visual attribute description of the corresponding training image, for training the computer device.

11. The computer device according to claim 10 , wherein the processor is further configured to: calculate gradients according to a reverse propagation algorithm to determine or update corresponding parameters.

12. A non-transitory computer readable non-volatile memory storing computer program, the computer program configured to, when executed by a computer device, make the computer device to perform a method comprising the following steps: a. providing a plurality of training images, wherein semantics and visual attribution description of respective training images are known, by automatically parsing, through a computer device, webpages including images to obtain therefrom the plurality of training images, the known semantics, and the known visual attribute descriptions of respective ones of the training images, the known semantics include coarse-grained semantics and fine-grained semantics that are not completely identical to the corresponding coarse-grained semantics for a respective one of the plurality of training images, wherein the coarse-grained semantics corresponds to a coarse-grained classification object, and the different fine-grained semantics corresponds to different fine-grained classification objects belonging to the same coarse-grained classification object, each of the fine-grained classification objects including a plurality of feature parts, the visual attribute descriptions being divided into different groups based on their corresponding feature parts, each of the visual attribute descriptions expressing a local visual appearance of corresponding ones of the feature parts, wherein the webpages include textual data relating to the images, and wherein the known semantics and known visual attribute descriptions result from capturing the textual data; b. inputting the training images each including a given fine-grained classification object to a locator of the image semantic annotation apparatus, wherein the locator is configured to determine a coordinate on the training image, determine a local area on the training image based on the coordinate, and determine a part of the fine-grained classification object within the local area as the feature part; c. determining, by the locator, a plurality of local areas of each input training image, wherein a location of the at least one local area on the input training image is determined by probability distribution sampling of the locator outputs of the plurality of feature parts, wherein each of the determined local areas of each input training image comprises one feature part of the given fine-grained classification object included in the each input training image, and the different determined local areas in the each input training image comprise different features parts of the given fine-grained classification object, and inputting the determined respective local areas into an attribute predictor of the image semantic annotation apparatus, the plurality of local areas including a coordinate on the input training image and having a size less than a size of the input training image; d. obtaining a visual attribute prediction result of each input local area from the attribute predictor, wherein the visual attribute prediction result of each input local area comprises a visual attribute description of the feature part located in the each input local area; e. training the locator and the attribute predictor according to the obtained visual attribute prediction result of each local area and a known visual attribute description of the corresponding training image, and for each of the feature parts to be located on the corresponding training image, repeating steps a to e until convergence to complete training of the locator and of the attribute predictor, wherein the known visual attribute description of the corresponding training image comprises known visual attribute descriptions of the feature parts of the given fine-grained classification object in the corresponding training image; f. selecting at least part of training images from the plurality of training images; g. via the trained locator, locating, on the training image, the plurality of feature parts of the given fine-grained classification object corresponding to each of the selected training images by processing each of the selected training images, wherein the locating comprises determining the coordinate on the each training image, and determining the feature part of the given fine-grained classification object based on the coordinate on the each training image; h. inputting the feature parts located for the given fine-grained classification object of each of the selected training images and the known fine-grained semantic of the fine-grained classification object in the training image into a classifier of the image semantics annotation apparatus to train the classifier.

Patent Metadata

Filing Date

Unknown

Publication Date

August 17, 2021

Inventors

Xiao LIU

Jiang WANG

Shilei WEN

Errui DING

Want to explore more patents?

Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.

Browse All Patents Try Prior Art Search