Patentable/Patents/US-20260072979-A1

US-20260072979-A1

Image Encoder Training Method and Apparatus, Device, and Medium

PublishedMarch 12, 2026

Assigneenot available in USPTO data we have

InventorsSen YANG Jinxi Xiang Jun Zhang Xiao Han

Technical Abstract

Provided is a method for searching for a whole slide image performed by a computer device, which relate to the field of artificial intelligence. The method includes: cropping a whole slide image into a plurality of tissue images; generating, through an image encoder, image feature vectors respectively corresponding to the plurality of tissue images; clustering the image feature vectors respectively corresponding to the plurality of tissue images, to determine at least one key image from the plurality of tissue images; querying, based on image feature vectors respectively corresponding to the at least one key image, a database to obtain at least one target image package corresponding to the at least one key image; and determining a whole slide image to which at least one candidate tissue image comprised in the at least one target image package respectively belongs as a final search result.

Patent Claims

Legal claims defining the scope of protection, as filed with the USPTO.

obtaining a first sample tissue image and a plurality of second sample tissue images, the second sample tissue images being negative samples in contrastive learning; performing data enhancement on the first sample tissue image, to obtain a first image; and inputting the first image into a first image encoder, to obtain a first feature vector, the first image being a positive sample in the contrastive learning; performing data enhancement on the first sample tissue image, to obtain a second image; and inputting the second image into a second image encoder, to obtain a second feature vector, the second image being an anchor image in the contrastive learning, and the second image being different from the first image; inputting the plurality of second sample tissue images into the first image encoder, to obtain feature vectors respectively corresponding to the plurality of second sample tissue images; clustering the feature vectors respectively corresponding to the plurality of second sample tissue images, to obtain a plurality of clustering centers; and generating a plurality of weights based on similarity values between the plurality of clustering centers and the first feature vector; generating, based on the first feature vector and the second feature vector, a first sub-function configured for representing an error between the anchor image and the positive sample; based on the second feature vector and the feature vectors respectively corresponding to the plurality of second sample tissue images, generating, in combination with the plurality of weights, a second sub-function configured for representing an error between the anchor image and the negative sample; and generating a first weight loss function based on the first sub-function and the second sub-function; and training the first image encoder and the second image encoder based on the first weight loss function. . A method for training an image encoder performed by a computer device, the method comprising:

claim 1 when the first sample tissue image belongs to a first sample tissue image in a first training batch, clustering the feature vectors respectively corresponding to the plurality of second sample tissue images, to obtain a plurality of clustering centers corresponding to the first training batch; and th th th when the first sample tissue image belongs to a first sample tissue image in an ntraining batch, updating a plurality of clustering centers corresponding to an (n−1)training batch to a plurality of clustering centers corresponding to the ntraining batch, n being a positive integer greater than 1. . The method according to, wherein the clustering the feature vectors respectively corresponding to the plurality of second sample tissue images, to obtain a plurality of clustering centers comprises:

claim 2 th th th th th th th th th th for a jclustering center in the plurality of clustering centers in the (n−1)training batch, updating the jclustering center in the (n−1)training batch based on a first sample tissue image of a jcategory in the ntraining batch, to obtain a jclustering center corresponding to the ntraining batch, j being a positive integer. . The method according to, wherein the updating a plurality of clustering centers corresponding to an (n−1)training batch to a plurality of clustering centers corresponding to the ntraining batch comprises:

claim 1 th th for the jclustering center in the plurality of clustering centers, feature vectors comprised in a category of the jclustering center correspond to the same weight. . The method according to, wherein values of the weights are in a negative correlation with the similarity values between the clustering centers and the first feature vector; and

claim 1 inputting the second image into the second image encoder, to obtain a first intermediate feature vector; and inputting the first intermediate feature vector into a first multilayer perceptron MLP, to obtain the second feature vector. . The method according to, wherein the inputting the second image into a second image encoder, to obtain a second feature vector comprises:

claim 1 updating, according to a parameter of the second image encoder, a parameter of the first image encoder in a weighting manner. . The method according to, wherein after the training the first image encoder and the second image encoder based on the first weight loss function, the method further comprises:

claim 1 performing data enhancement on the first sample tissue image, to obtain a third image; and inputting the third image into a third image encoder, to obtain a third feature vector, the third image being an anchor image in the contrastive learning, and the third image being different from the first image and the second image; generating, based on the first feature vector and the third feature vector, a third sub-function configured for representing an error between the anchor image and the positive sample; based on the third feature vector and the feature vectors respectively corresponding to the plurality of second sample tissue images, generating, in combination with the plurality of weights, a fourth sub-function configured for representing an error between the anchor image and the negative sample; and generating a second weight loss function based on the third sub-function and the fourth sub-function; and training the first image encoder and the third image encoder based on the second weight loss function. . The method according to, further comprising:

claim 7 inputting the third image into the third image encoder, to obtain a second intermediate feature vector; and inputting the second intermediate feature vector into a second MLP, to obtain the third feature vector. . The method according to, wherein the inputting the third image into a third image encoder, to obtain a third feature vector comprises:

claim 7 updating, according to a parameter shared between the second image encoder and the third image encoder, a parameter of the first image encoder in a weighting manner. . The method according to, wherein after the training the first image encoder and the second image encoder based on the first weight loss function, the method further comprises:

obtaining a first sample tissue image and a plurality of second sample tissue images, the second sample tissue images being negative samples in contrastive learning; performing data enhancement on the first sample tissue image, to obtain a first image; and inputting the first image into a first image encoder, to obtain a first feature vector, the first image being a positive sample in the contrastive learning; performing data enhancement on the first sample tissue image, to obtain a second image; and inputting the second image into a second image encoder, to obtain a second feature vector, the second image being an anchor image in the contrastive learning, and the second image being different from the first image; inputting the plurality of second sample tissue images into the first image encoder, to obtain feature vectors respectively corresponding to the plurality of second sample tissue images; clustering the feature vectors respectively corresponding to the plurality of second sample tissue images, to obtain a plurality of clustering centers; and generating a plurality of weights based on similarity values between the plurality of clustering centers and the first feature vector; generating, based on the first feature vector and the second feature vector, a first sub-function configured for representing an error between the anchor image and the positive sample; based on the second feature vector and the feature vectors respectively corresponding to the plurality of second sample tissue images, generating, in combination with the plurality of weights, a second sub-function configured for representing an error between the anchor image and the negative sample; and generating a first weight loss function based on the first sub-function and the second sub-function; and training the first image encoder and the second image encoder based on the first weight loss function. . A computer device comprising a processor and a memory, the memory having a computer program stored therein, the computer program being loaded and executed by the processor and causing the computer device to implement a method for training an image encoder including:

claim 10 when the first sample tissue image belongs to a first sample tissue image in a first training batch, clustering the feature vectors respectively corresponding to the plurality of second sample tissue images, to obtain a plurality of clustering centers corresponding to the first training batch; and th th th when the first sample tissue image belongs to a first sample tissue image in an ntraining batch, updating a plurality of clustering centers corresponding to an (n−1)training batch to a plurality of clustering centers corresponding to the ntraining batch, n being a positive integer greater than 1. . The computer device according to, wherein the clustering the feature vectors respectively corresponding to the plurality of second sample tissue images, to obtain a plurality of clustering centers comprises:

claim 11 th th th th th th th th th th for a jclustering center in the plurality of clustering centers in the (n−1)training batch, updating the jclustering center in the (n−1)training batch based on a first sample tissue image of a jcategory in the ntraining batch, to obtain a jclustering center corresponding to the ntraining batch, j being a positive integer. . The computer device according to, wherein the updating a plurality of clustering centers corresponding to an (n−1)training batch to a plurality of clustering centers corresponding to the ntraining batch comprises:

claim 10 th th for the jclustering center in the plurality of clustering centers, feature vectors comprised in a category of the jclustering center correspond to the same weight. . The computer device according to, wherein values of the weights are in a negative correlation with the similarity values between the clustering centers and the first feature vector; and

claim 10 inputting the second image into the second image encoder, to obtain a first intermediate feature vector; and inputting the first intermediate feature vector into a first multilayer perceptron MLP, to obtain the second feature vector. . The computer device according to, wherein the inputting the second image into a second image encoder, to obtain a second feature vector comprises:

claim 10 updating, according to a parameter of the second image encoder, a parameter of the first image encoder in a weighting manner. . The computer device according to, wherein after the training the first image encoder and the second image encoder based on the first weight loss function, the method further comprises:

claim 10 performing data enhancement on the first sample tissue image, to obtain a third image; and inputting the third image into a third image encoder, to obtain a third feature vector, the third image being an anchor image in the contrastive learning, and the third image being different from the first image and the second image; generating, based on the first feature vector and the third feature vector, a third sub-function configured for representing an error between the anchor image and the positive sample; based on the third feature vector and the feature vectors respectively corresponding to the plurality of second sample tissue images, generating, in combination with the plurality of weights, a fourth sub-function configured for representing an error between the anchor image and the negative sample; and generating a second weight loss function based on the third sub-function and the fourth sub-function; and training the first image encoder and the third image encoder based on the second weight loss function. . The computer device according to, wherein the method further comprises:

obtaining a first sample tissue image and a plurality of second sample tissue images, the second sample tissue images being negative samples in contrastive learning; performing data enhancement on the first sample tissue image, to obtain a first image; and inputting the first image into a first image encoder, to obtain a first feature vector, the first image being a positive sample in the contrastive learning; performing data enhancement on the first sample tissue image, to obtain a second image; and inputting the second image into a second image encoder, to obtain a second feature vector, the second image being an anchor image in the contrastive learning, and the second image being different from the first image; inputting the plurality of second sample tissue images into the first image encoder, to obtain feature vectors respectively corresponding to the plurality of second sample tissue images; clustering the feature vectors respectively corresponding to the plurality of second sample tissue images, to obtain a plurality of clustering centers; and generating a plurality of weights based on similarity values between the plurality of clustering centers and the first feature vector; generating, based on the first feature vector and the second feature vector, a first sub-function configured for representing an error between the anchor image and the positive sample; based on the second feature vector and the feature vectors respectively corresponding to the plurality of second sample tissue images, generating, in combination with the plurality of weights, a second sub-function configured for representing an error between the anchor image and the negative sample; and generating a first weight loss function based on the first sub-function and the second sub-function; and training the first image encoder and the second image encoder based on the first weight loss function. . A non-transitory computer-readable storage medium, having a computer program stored therein, the computer program being loaded and executed by a processor of a computer device and causing the computer device to implement a method for training an image encoder including:

claim 17 inputting the second image into the second image encoder, to obtain a first intermediate feature vector; and inputting the first intermediate feature vector into a first multilayer perceptron MLP, to obtain the second feature vector. . The non-transitory computer-readable storage medium according to, wherein the inputting the second image into a second image encoder, to obtain a second feature vector comprises:

claim 17 updating, according to a parameter of the second image encoder, a parameter of the first image encoder in a weighting manner. . The non-transitory computer-readable storage medium according to, wherein after the training the first image encoder and the second image encoder based on the first weight loss function, the method further comprises:

claim 17 performing data enhancement on the first sample tissue image, to obtain a third image; and inputting the third image into a third image encoder, to obtain a third feature vector, the third image being an anchor image in the contrastive learning, and the third image being different from the first image and the second image; generating, based on the first feature vector and the third feature vector, a third sub-function configured for representing an error between the anchor image and the positive sample; based on the third feature vector and the feature vectors respectively corresponding to the plurality of second sample tissue images, generating, in combination with the plurality of weights, a fourth sub-function configured for representing an error between the anchor image and the negative sample; and generating a second weight loss function based on the third sub-function and the fourth sub-function; and training the first image encoder and the third image encoder based on the second weight loss function. . The non-transitory computer-readable storage medium according to, wherein the method further comprises:

Detailed Description

Complete technical specification and implementation details from the patent document.

This application is a continuation application of U.S. patent application Ser. No. 18/642,807, entitled “IMAGE ENCODER TRAINING METHOD AND APPARATUS, DEVICE, AND MEDIUM” filed on Apr. 22, 2024, which is a continuation application of PCT Patent Application No. PCT/CN2023/088875, entitled “IMAGE ENCODER TRAINING METHOD AND APPARATUS, DEVICE, AND MEDIUM” filed on Apr. 18, 2023, which claims priority to Chinese Patent Application No. 202210531184.9, entitled “IMAGE ENCODER TRAINING METHOD AND APPARATUS, DEVICE, AND MEDIUM” filed on May 16, 2022, all of which is incorporated herein by reference in its entirety.

The application relates to U.S. patent application Ser. No. 18/642,802, entitled “IMAGE ENCODER TRAINING METHOD AND APPARATUS, DEVICE, AND MEDIUM” filed on Apr. 22, 2024, which is incorporated herein by reference in its entirety.

This application relates to the field of artificial intelligence, and in particular, to an image encoder training method and apparatus, a device, and a medium.

In the medical field, there is a scenario in which a whole slide image (WSI) searches for a similar whole slide image. Each whole slide image (a large image) includes a large quantity of tissue pathological images (small images).

In the related art, a small image with the best representative capability in the large image is used to represent an entire large image, a database is searched according to a feature vector of the small image with the best representative capability for a target small image most similar to the small image, and a large image corresponding to the target small image is used as a final search result. In the foregoing process, an image encoder is required for extracting the feature vector of the small image. In the related art, when the image encoder is trained, training is performed through contrastive learning. Contrastive learning aims to learn a common feature between an anchor image and a positive sample, and distinguish different features between the anchor image and a negative sample (which is often referred to as pulling the anchor image and the positive sample close to each other, and pulling the anchor image and the negative sample far away from each other).

1 2 In the related art, when the image encoder is trained through contrastive learning, for an image X, an image Xand an image Xobtained by separately performing data enhancement on the image X for two times are used as a pair of positive samples, and the image X and an image Y are used as a pair of negative samples. However, positive and negative sample assumption in the related art is inappropriate in a special scenario. In a scenario, when a tissue region to which a small image selected from one WSI belongs is the same as a tissue region to which a small image selected from another WSI belongs, the two small images are considered as a pair of negative samples. In another scenario, when two small images whose locations are adjacent are selected from the same WSI, the two small images are also considered as a pair of negative samples. Apparently, the two small images selected in the two scenarios shall form a positive sample pair, and during training of the image encoder in the related art, positive samples may be mistakenly pulled away. Due to wrong selection of positive and negative sample pairs, a training direction of the image encoder may be wrong. In this case, wrong positive and negative samples are used to extract features from an image by the image encoder on which training is performed, and consequently, precision of an extracted image feature is relatively low, which is not conducive to a downstream search task.

This application provides an image encoder training method and apparatus, a device, and a medium, which can improve precision of an image feature extracted by an image encoder. The technical solutions are as follows.

cropping the whole slide image into a plurality of tissue images; generating, through an image encoder, image feature vectors respectively corresponding to the plurality of tissue images; clustering the image feature vectors respectively corresponding to the plurality of tissue images, to determine at least one key image from the plurality of tissue images; querying, based on image feature vectors respectively corresponding to the at least one key image, a database to obtain at least one target image package corresponding to the at least one key image; and determining a whole slide image to which at least one candidate tissue image included in the at least one target image package respectively belongs as a final search result. According to one aspect of this application, a method for searching for a whole slide image is performed by a computer device, the method including:

According to another aspect of this application, a computer device is provided, including a processor and a memory, the memory having a computer program stored therein, the computer program being loaded and executed by the processor and causing the computer device to implement the method for searching for a whole slide image.

According to yet another aspect of this application, a non-transitory computer-readable storage medium is provided, having a computer program stored therein, the computer program being loaded and executed by a processor of a computer device and causing the computer device to implement the method for searching for a whole slide image.

The technical solutions provided in the embodiments of this application produce at least the following beneficial effects:

Weights are assigned to negative samples identified in the related art, and a “negative degree” of the negative samples is further distinguished between the negative samples, so that a loss value used in contrastive learning (also referred to as a contrastive learning paradigm) can more accurately pull anchor images and the negative samples away from each other, and the impact of potential false negative samples is reduced, thereby better training an image encoder. In this way, the image encoder obtained through training can better distinguish different features between the anchor images and the negative samples, thereby improving precision of an image feature extracted by the image encoder, and further improving accuracy of a result of a downstream search task.

To make objectives, technical solutions, and advantages of this application clearer, the following further describes implementations of this application in detail with reference to the accompanying drawings.

First, terms involved in the embodiments of this application are briefly introduced:

Whole slide image (WSI): The WSI is a visual digital image obtained by using a digital scanner to scan a conventional slide image to acquire a high-resolution image, and then using a computer to seamlessly splice acquired fragmented images. Through specific software, the WSI may be zoomed in and out in any ratio, movably browsed in any direction, and the like. A data volume of one WSI generally ranges from hundreds of megabytes (MB) to even several gigabytes (GB). In this application, the WSI is usually referred to as a large image for short. In the related art, processing on the WSI focuses on selection and analysis on a local tissue region in the WSI. In this application, the local tissue region in the WSI is generally referred to as a small image.

1 FIG. Comparison learning (also referred to as contrastive learning): Referring to, deep learning may be classified into supervised learning and unsupervised learning according to whether data labeling is performed. Massive data needs to be labeled in the supervised learning, while autonomous discovery of a latent structure is allowed in the unsupervised learning. The unsupervised learning may be further classified into generative learning and comparison learning. The generative learning is represented by a method such as a self-encoder (for example, a GAN, a VAE, and the like), and data is generated by using data, so that the data is similar to training data on the whole or in terms of high-level semantics. For example, a plurality of images of horses in a training set are used to learn features of horses through a generative model, so that a new image of horses may be generated.

1 FIG. 1 1 The comparison learning focuses on learning a common feature between samples of the same type and distinguishing different features between samples of different types. In the contrastive learning, the encoder is usually trained through a triple (anchor image, negative sample, and positive sample). As shown in, a circle A is an anchor image in the contrastive learning, a circle Ais a positive sample in the contrastive learning, and a square B is a negative sample in the contrastive learning. The contrastive learning aims to use an encoder obtained through training to pull the circle A and the circle Ato the close to each other and pull the circle A and the square B to be far away from each other. In other words, the encoder obtained through training supports similar encoding of data of the same type, and makes encoding results of data of different types as different as possible. In this application, a method for training an image encoder through contrastive learning is described.

Next, an implementation environment of this application is described below.

2 FIG. 2 FIG. 21 21 22 22 is a schematic diagram of a computer system according to an exemplary embodiment. As shown in, a training deviceof an image encoder is configured to train the image encoder. The training deviceof the image encoder sends the image encoder to a use deviceof the image encoder, and the use deviceof the image encoder searches for a whole slide image through the image encoder.

2 FIG. 2 FIG. 210 210 211 212 213 214 211 210 1 212 210 2 213 210 3 214 210 4 Image encoder training phase: As shown in, the image encoder is trained through contrastive learning, and a distance between an anchor imageand a positive sample is less than a distance between the anchor imageand a negative sample, where in, the positive sample includes a positive sample class clusterand positive sample class clusterthat are obtained through clustering, the negative sample includes a negative sample class clusterand a negative sample class clusterthat are obtained through clustering, a distance between a clustering center of the positive sample class clusterand the anchor imageis L, a distance between a clustering center of the positive sample class clusterand the anchor imageis L, a distance between a clustering center of the negative sample class clusterand the anchor imageis L, and a distance between a clustering center of the negative sample class clusterand the anchor imageis L.

2 1 2 212 212 1 2 FIG. In this application, a plurality of positive samples are clustered to obtain a plurality of positive sample class clusters, a distance between the anchor image and a clustering center of a class cluster most similar to the anchor image is set to L, a distance between the anchor image and another positive sample in the plurality of positive samples is set to L(it is to be noted that, Lshown inis merely the distance between the clustering center of the positive sample class clusterand the anchor image, and a distance between another positive sample of the positive sample class clusterand the anchor image is L), and the anchor image and the plurality of positive samples are pulled close to each other according to redefined distances between the plurality of positive samples and the anchor image. In the related art, it is considered that distances between all positive samples and the anchor image are the same.

3 4 2 FIG. In this application, a plurality of negative samples are clustered to obtain a plurality of negative sample class clusters, a weight is assigned to each class cluster based on a similarity between a clustering center of each class cluster and the anchor image, and the anchor image and the negative samples are pulled away from each other according to the weight of the class cluster. Land Lshown inare weighted distances. In the related art, it is considered that distances between all negative samples and the anchor image are the same.

2 FIG. Image encoder use phase: As shown in, the image encoder use phase in this application is a search process of a whole slide image.

1 1 First, one WSI is cropped into a plurality of tissue images (small images). Then, the plurality of tissue images are clustered to obtain a plurality of key images, where the plurality of key images are jointly configured for representing one WSI. Next, for one key image (a small image A), the small image A is inputted into the image encoder, to obtain an image feature vector of the small image A. Finally, a database is queried according to the image feature vector of the small image A to obtain small images Ato AN, WSIs corresponding to the small images Ato AN are used as search results, and the plurality of key images are used as queried images to determine the WSI from the database.

21 22 In some embodiments, the training deviceof the image encoder and the use deviceof the image encoder may be a computer device having a machine learning capability. For example, the computer device may be a terminal or a server.

21 22 21 22 21 22 21 22 21 22 21 22 In some embodiments, the training deviceof the image encoder and the use deviceof the image encoder may be the same computer device, or the training deviceof the image encoder and the use deviceof the image encoder may be different computer devices. In addition, when the training deviceof the image encoder and the use deviceof the image encoder are different devices, the training deviceof the image encoder and the use deviceof the image encoder may be devices of the same type. For example, both the training deviceof the image encoder and the use deviceof the image encoder may be servers. Alternatively, the training deviceof the image encoder and the use deviceof the image encoder may be devices of different types. The server may be an independent physical server, or may be a server cluster or a distributed system formed by a plurality of physical servers, or may be a cloud server that provides a basic cloud computing service such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a network service, cloud communication, a middleware service, a domain name service, a security service, a content delivery network (CDN), big data, and an artificial intelligence platform. The terminal may be a smartphone, an in-vehicle terminal, a smart television, a wearable device, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smartwatch, or the like, but is not limited thereto. The terminal and the server may be directly or indirectly connected in a wired or wireless communication manner. This is not limited in this embodiment of this application.

related content of a first weight loss function—1-1-1; related content of a second weight loss function—1-1-2; related content of pulling an anchor image and a negative sample away from each other—1-1; related content of a first group loss function—1-2-1; related content of a second group loss function—1-2-2; and related content of pulling the anchor image and a positive sample close to each other—1-2; related content of a complete loss function—1-3; and Image encoder training phase—1; image encoder use phase (search process of a whole slide image)—2. Detailed descriptions are provided below in the following order.

3 FIG. 2 FIG. 21 shows an image encoder training framework according to an exemplary embodiment. Descriptions are provided by using an example in which the framework is applied to the training deviceof the image encoder shown in.

3 FIG. 301 301 305 307 302 303 308 303 305 302 304 309 304 306 310 308 309 311 307 309 312 310 311 In, feature vectors respectively corresponding to a plurality of second sample tissue imagesare generated based on the plurality of second sample tissue imagesthrough a first image encoder, that is, a plurality of feature vectorsare generated; data enhancement is performed on a first sample tissue imageto obtain a first image, and a first feature vectoris generated based on the first imagethrough the first image encoder; data enhancement is performed on the first sample tissue imageto obtain a second image, and a second feature vectoris generated based on the second imagethrough the second image encoder; a first sub-functionis generated based on the first feature vectorand the second feature vector; a second sub-functionis generated based on the plurality of feature vectorsand the second feature vector; and a first weight loss functionis generated based on the first sub-functionand the second sub-function.

312 The first weight loss functionis configured for pulling the anchor image and the negative sample away from each other.

4 FIG. 3 FIG. 401 408 shows a flowchart of an image encoder training method according to an exemplary embodiment. Descriptions are provided by using an example in which the method is applied to the image encoder training framework shown in. The method is performed by a training device of an image encoder in the training framework. The method includes at least one of stepto step.

401 Step: Obtain a first sample tissue image and a plurality of second sample tissue images, the second sample tissue images being negative samples in contrastive learning.

1 2 The first sample tissue image is an image configured for training the image encoder in this application. The second sample tissue image is an image configured for training the image encoder in this application. The first sample tissue image and the second sample tissue image are different small images, that is, the first sample tissue image and the second sample tissue image are not a small image Xand a small image Xobtained by performing data enhancement on a small image X, but are respectively a small image X and a small image Y, where the small image X and the small image Y are small images in different large images, or the small image X and the small image Y are different small images in the same large image.

In some embodiments, the second sample tissue image is used as a negative sample in contrastive learning, and the contrastive learning aims to pull the anchor image and the positive sample close to each other and pull the anchor image and the negative sample away from each other.

5 FIG. With reference to, an image X is the first sample tissue image, and a sub-container of the negative sample is a container that contains the feature vectors respectively corresponding to the plurality of second sample tissue images.

402 Step: Perform data enhancement on the first sample tissue image, to obtain a first image; and input the first image into a first image encoder, to obtain a first feature vector, the first image being a positive sample in the contrastive learning.

rotation/reflection transformation: randomly rotating an image by an angle, to change an orientation of content of the image; flipping transformation: flipping the image horizontally or vertically; zoom transformation: zooming in or out the image in a ratio; translation transformation: translating the image on a plane of the image in a manner; specifying a translation range and a translation step in a random or artificially defined manner, and performing translation horizontally or vertically to change a location of the image; scale transformation: zooming in or out the image according to a specified scale factor; or filtering, with reference to a SIFT feature extraction idea and using the specified scale factor, the image to construct a scale space, to change a size or a blur degree of the content of the image; contrast ratio transformation: in an HSV color space of the image, changing a saturation S and a V luminance component, keeping a hue H unchanged, performing an exponential operation on the S and the V component of each pixel (where an exponential factor is in a range of 0.25 and 4), and increasing illumination changes; noise disturbance: performing random disturbance on RGB of each pixel of the image, where a commonly used noise mode includes salt-and-pepper noise and Gaussian noise; color change: adding random disturbance to an image channel; and inputting the image and randomly selecting a region for blackening. Data enhancement is also referred to as data augmentation and aims to generate more data from limited data without substantially increasing data. In some embodiments, a data enhancement method includes but is not limited to at least one of the following:

In some embodiments, data enhancement is performed on the first sample tissue image to obtain a first image, and the first image is used as a positive sample in the contrastive learning.

5 FIG. k k k d With reference to, data enhancement is performed on the image X to obtain an image X, and then the image Xis converted to a high-level semantic spacethrough an encoder f, so that a first feature vector fis obtained.

403 Step: Perform data enhancement on the first sample tissue image, to obtain a second image; and input the second image into a second image encoder, to obtain a second feature vector, the second image being an anchor image in the contrastive learning, and the second image being different from the first image.

In some embodiments, data enhancement is performed on the first sample tissue image to obtain a second image, and the second image is used as an anchor image in the contrastive learning. In some embodiments, the first image and the second image are different images obtained in different data enhancement manners.

In some embodiments, the second image is inputted into the second image encoder, to obtain a first intermediate feature vector; and the first intermediate feature vector is inputted into a first MLP (multilayer perceptron), to obtain the second feature vector. The first MLP plays a transition role and is configured to improve an expression capability of the second image.

5 FIG. p p p p p2 d With reference to, data enhancement is performed on the image X to obtain an image X, the image Xis converted to the high-level semantic spacethrough an encoder h, that is, a first intermediate feature vector his obtained, and the first intermediate feature vector his inputted into the first MLP, to obtain a second feature vector g.

404 Step: Input the plurality of second sample tissue images into the first image encoder, to obtain feature vectors respectively corresponding to the plurality of second sample tissue images; cluster the feature vectors respectively corresponding to the plurality of second sample tissue images, to obtain a plurality of clustering centers; and generate a plurality of weights based on similarity values between the plurality of clustering centers and the first feature vector.

In some embodiments, the second sample tissue image is a negative sample in the contrastive learning, the feature vectors respectively corresponding to the plurality of second sample tissue images are clustered, and weights are respectively assigned to a plurality of feature vectors according to similarity values between the plurality of clustering centers and the first feature vector.

5 FIG. j k k − With reference to, the sub-container of the negative sample contains the feature vectors respectively corresponding to the plurality of second sample tissue images, that is, the plurality of feature vectors. The plurality of second sample tissue images are inputted into the encoder f and then are stored in a storage queue by performing a stack pushing operation. In the storage queue, k-means clustering is performed on the storage queue to cluster a plurality of feature vectors in the queue into Q categories, to further construct Q sub-queues, where a clustering center of each sub-queue is represented as c(j=1, . . . , Q), Q being a positive integer. Then, a similarity score between each clustering center and the first feature vector fis calculated, to determine potential and false negative samples. Finally, a weight φ(f) of each feature vector in the storage queue is obtained, which is calculated as follows:

th j k δ( ) is a discriminant function. If two inputs are consist, δ( ) outputs 1, or otherwise, δ( ) outputs 0. In some embodiments, δ( ) is configured for determining whether a jclustering center cis similar to f, where w is an assigned weight, and w ∈[0,1]. Certainly, this application does not limit a manner for calculating a similarity, including but not limited to calculating a cosine similarity, a Euclidean distance, and the like.

th th k k In some embodiments, values of the weights respectively corresponding to the plurality of clustering centers are in a negative correlation with the similarity values between the clustering centers and the first feature vector; and for the jclustering center in the plurality of clustering centers, feature vectors included in a category of the jclustering center correspond to the same weight. In formula (1), smaller weights w are assigned to a plurality of feature vectors corresponding to categories of clustering centers more similar to f, and greater weights w are assigned to a plurality of feature vectors corresponding to categories of clustering centers less similar to f.

1 2 3 1 2 3 For example, the plurality of feature vectors corresponding to the plurality of second sample tissue images are clustered into 3 categories, and clustering centers are namely c, c, and c. A category of the clustering center cincludes a feature vector 1, a feature vector 2, and a feature vector 3. A category of the clustering center cincludes a feature vector 4, a feature vector 5, and a feature vector 6. A category of the clustering center cincludes a feature vector 7, a feature vector 8, and a feature vector 9.

1 2 3 k 1 2 3 If similarity values between the clustering centers c, c, and cand fare sorted in descending order, weights corresponding to the categories of the clustering centers c, c, and care sorted in ascending order. In addition, the feature vectors 1, 2, and 3 correspond to the same weight, the feature vectors 4, 5, and 6 correspond to the same weight, and the feature vectors 7, 8, and 9 correspond to the same weight.

In some embodiments, when the first sample tissue image belongs to a first sample tissue image in a first training batch, the feature vectors respectively corresponding to the plurality of second sample tissue images are clustered, to obtain a plurality of clustering centers corresponding to the first training batch.

th th th In some embodiments, when the first sample tissue image belongs to a first sample tissue image in an ntraining batch, a plurality of clustering centers corresponding to an (n−1)training batch are updated to a plurality of clustering centers corresponding to the ntraining batch, n being a positive integer greater than 1.

th th th th th th th th In some embodiments, for a jclustering center in the plurality of clustering centers in the (n−1)training batch, the jclustering center in the (n−1)training batch is updated based on a first sample tissue image of a jcategory in the ntraining batch, to obtain a jclustering center corresponding to the ntraining batch, i and j being positive integers.

5 FIG. th th th th j* j With reference to, a jclustering center cin the ntraining batch is updated according to a jclustering center cin the (n−1)training batch, and a formula is expressed as follows:

j* c c j k th th th th crepresents the updated jclustering center in the ntraining batch, mrepresents a weight updated for use, m∈[0,1]. Brepresents a feature set of the jcategory in a plurality of first feature vectors (a plurality of f) of a plurality of first sample tissue images (a plurality of images X) in the ntraining batch.

th th th k represents an ifeature vector in the plurality of first feature vectors (the plurality of f) of the jcategory in the ntraining batch.

k th th is configured for calculating a mean value of features of the plurality of first feature vectors (the plurality of f) of the jcategory in the ntraining batch.

In some embodiments, during each training period, all clustering centers may be updated by re-clustering all negative sample feature vectors in a repository.

th th It may be understood that, an objective of updating the plurality of clustering centers in the (n−1)training batch to the plurality of clustering centers in the ntraining batch is to prevent negative sample feature vectors in a negative sample container from being farther away from the inputted first sample tissue image.

With continuous training on the image encoder, a better effect of the image encoder on pulling the anchor image and the negative sample away from each other is achieved. It is assumed that the image encoder pulls an image X in a previous training batch away from the negative sample to a first distance, and the image encoder pulls an image X in a current training batch away from the negative sample to a second distance, the second distance being greater than the first distance. The image encoder pulls an image X in a next training batch away from the negative sample to a third distance, the third distance being greater than the second distance. However, if a negative sample image (that is, the clustering center) is not updated, an increase between the third distance and the second distance may be less than an increase between the second distance and the first distance, and a training effect of the image encoder may gradually become worse. If the negative sample image is updated (that is, the clustering center is updated), a distance between the updated negative sample image and the image X is properly reduced, which balances a pulling-away effect gradually improved by the image encoder, so that the image encoder can maintain more long-time training. In this way, the image encoder obtained through final training has a better capability to extract image features, making the extracted image features more accurate. Moreover, the clustering center is determined according to a category of the sample tissue image, which is beneficial to ensuring a correspondence between clustering centers in the previous and next batches, and avoids a correspondence error, thereby improve accuracy of determining the clustering center. In addition, the feature vectors are clustered, and weights of all feature vectors of a category are set to the same, which is beneficial to classifying the feature vectors, and makes a training effect better by weighting the feature vectors.

405 Step: Generate, based on the first feature vector and the second feature vector, a first sub-function configured for representing an error between the anchor image and the positive sample.

In some embodiments, a first sub-function is generated according to the first feature vector and the second feature vector, and the first sub-function is configured for representing an error between the anchor image and the positive sample.

5 FIG. p2 k k p2 With reference to, the first sub-function may be represented as exp(g·f/τ), and it can be seen that the first sub-function includes the first feature vector fand the second feature vector g.

406 Step: Based on the second feature vector and the plurality of feature vectors, generate, in combination with a plurality of weights, a second sub-function configured for representing an error between the anchor image and the negative sample.

In some embodiments, according to the second feature vector and the plurality of feature vectors corresponding to the plurality of second sample tissue images, a second sub-function is generated in combination with a plurality of weights, where the second sub-function is configured for representing an error between the anchor image and the negative sample.

5 FIG. With reference to, the second sub-function may be represented as

where

th represents a weight of an inegative sample feature vector (that is, a feature vector of a second sample tissue image),

th p2 represents the inegative sample feature vector, the negative sample container includes K negative sample feature vectors in total, and grepresents a feature vector (that is, a second feature vector) of the anchor image, K being a positive integer.

407 Step: Generate a first weight loss function based on the first sub-function and the second sub-function.

5 FIG. With reference to, the first weight loss function may be expressed as follows:

WeightedNCE1 represents the first weight loss function, and log represents a logarithm operation. In some embodiments, weighted summation is performed on the first sub-function and the second sub-function, to obtain the first weight loss function. In some embodiments, weight values respectively corresponding to the first sub-function and the second sub-function are not limited in this application. In some embodiments, the weight value is a preset hyperparameter.

408 Step: Train the first image encoder and the second image encoder based on the first weight loss function.

The first image encoder and the second image encoder are trained based on the first weight loss function. A manner for training the first image encoder and the second image encoder by using the first weight loss function is not limited in the embodiments of this application. In some embodiments, Parameters in the first image encoder and the second image encoder are updated by using the first weight loss function in a forward gradient update or reverse gradient update manner.

408 409 In some embodiments, after step, step(not shown in the figure) is further included.

409 Step: Update the first image encoder based on the second image encoder.

The first image encoder is updated based on the second image encoder. In some embodiments, a parameter of the first image encoder is updated in a weighting manner according to a parameter of the second image encoder.

For example, a formula for updating the parameter of the first image encoder is expressed as follows:

θ′ on a left side of formula (4) represents the parameter of the first image encoder after the update, θ′ on a right side of formula (4) represents the parameter of the first image encoder before the update, and θ represents the parameter of the second image encoder, m being a constant. In some embodiments, m is equal to 0.99.

Based on the above, weights are assigned to negative samples identified in the related art, and a “negative degree” of the negative samples is further distinguished between the negative samples, so that a loss value used in contrastive learning (also referred to as a contrastive learning paradigm) can more accurately pull anchor images and the negative samples away from each other, and the impact of potential false negative samples is reduced, thereby better training an image encoder. In this way, the image encoder obtained through training can better distinguish different features between the anchor images and the negative samples, thereby improving precision of an image feature extracted by the image encoder, and further improving accuracy of a result of a downstream search task.

Further, in addition to jointly training the first image encoder and the second image encoder, the second image encoder further updates the parameter of the first image encoder, which is beneficial to increasing a convergence speed of the loss function, and improving training efficiency of the image encoder. Moreover, in addition to using the second image encoder to train the first image encoder, parameters shared between the second image encoder and the third image encoder are also used to train the first image encoder, which training the first encoder from different dimensions while enriching training manners of the image encoder. In this way, image features extracted through the trained first image encoder can be more accurate.

3 FIG. 4 FIG. andfurther show that the first image encoder is trained through a sample triple (anchor image, positive sample, negative sample). In another embodiment, the first image encoder may also be trained through a plurality of sample triples. The following describes training on the first image encoder through two sample triples (anchor image 1, positive sample, negative sample) and (anchor image 2, positive sample, negative sample). The anchor image 1 and the anchor image 2 are images obtained by separately performing data enhancement on the same small image. A quantity of sample triples specifically constructed is not limited in this application.

6 FIG. 1 FIG. 21 shows an image encoder training framework according to an exemplary embodiment. Descriptions are provided by using an example in which the framework is applied to the training deviceof the image encoder shown in.

6 FIG. 307 301 305 302 303 308 303 305 302 304 309 304 306 310 308 309 311 307 309 312 310 311 In, a plurality of feature vectorsare generated based on a plurality of second sample tissue imagesthrough a first image encoder. data enhancement is performed on a first sample tissue imageto obtain a first image, and a first feature vectoris generated based on the first imagethrough the first image encoder; data enhancement is performed on the first sample tissue imageto obtain a second image, and a second feature vectoris generated based on the second imagethrough the second image encoder; a first sub-functionis generated based on the first feature vectorand the second feature vector; a second sub-functionis generated based on the plurality of feature vectorsand the second feature vector; and a first weight loss functionis generated based on the first sub-functionand the second sub-function.

3 FIG. 6 FIG. 302 313 315 313 314 316 315 308 317 315 307 318 316 317 Different from the training framework shown in, in, data enhancement is performed on the first sample tissue imageto obtain a third image, and a third feature vectoris obtained based on the third imagethrough the third image encoder; a third sub-functionis generated based on the third feature vectorand the first feature vector; a fourth sub-functionis generated based on the third feature vectorand the plurality of feature vectors; and a second weight loss functionis generated based on the third sub-functionand the fourth sub-function.

318 The second weight loss functionis configured for pulling the anchor image and the negative sample away from each other.

4 FIG. 7 FIG. 4 FIG. 7 FIG. 6 FIG. 410 414 Based on the image encoder training method shown in,further provides stepto stepbased on the method steps in, and descriptions are provided by using an example in which the method inis applied to the image encoder training framework shown in. The method is performed by a training device of an image encoder in the training framework. The method includes the following steps:

410 Step: Perform data enhancement on the first sample tissue image, to obtain a third image; and input the third image into a third image encoder, to obtain a third feature vector, the third image being an anchor image in the contrastive learning.

In some embodiments, data enhancement is performed on the first sample tissue image to obtain a third image, and the third image is used as an anchor image in the contrastive learning. In some embodiments, the first image, the second image, and the third image are different images obtained in different data enhancement manners.

In some embodiments, the third image is inputted into the third image encoder, to obtain a second intermediate feature vector; and the second intermediate feature vector is inputted into a second MLP, to obtain the third feature vector. The second MLP plays a transition role and is configured to improve an expression capability of the third image.

5 FIG. q q q q q1 d With reference to, data enhancement is performed on the image X to obtain an image X, the image Xis converted to the high-level semantic spacethrough the encoder h, that is, a second intermediate feature vector his obtained, and the second intermediate feature vector his inputted into the second MLP, to obtain a third feature vector g.

411 Step: Generate, based on the first feature vector and the third feature vector, a third sub-function configured for representing an error between the anchor image and the positive sample.

In some embodiments, a third sub-function is generated according to the first feature vector and the third feature vector, and the third sub-function is configured for representing an error between the anchor image and the positive sample.

5 FIG. q1 k k q1 With reference to, the third sub-function may be represented as exp(g·f/τ), and it can be seen that the third sub-function includes the first feature vector fand the third feature vector g.

412 Step: Based on the third feature vector and the feature vectors respectively corresponding to the plurality of second sample tissue images, generate, in combination with the plurality of weights, a fourth sub-function configured for representing an error between the anchor image and the negative sample.

In some embodiments, according to the third feature vector and the plurality of feature vectors corresponding to the plurality of second sample tissue images, a fourth sub-function is generated in combination with the plurality of weights, where the fourth sub-function is configured for representing an error between the anchor image and the negative sample.

5 FIG. With reference to, the fourth sub-function may be represented as

where

th represents a weight of an inegative sample feature vector (that is, a feature vector of a second sample tissue image),

th q1 inegative sample feature vector, the negative sample container includes K negative sample feature vectors in total, and grepresents a feature vector (that is, a third feature vector) of the anchor image.

413 Step: Generate a second weight loss function based on the third sub-function and the fourth sub-function.

5 FIG. With reference to, the second weight loss function may be expressed as follows:

WeightedNCE2 represents the second weight loss function, and log represents a logarithm operation.

414 Step: Train the first image encoder and the third image encoder based on the second weight loss function.

The first image encoder and the third image encoder are trained based on the second weight loss function.

308 In some embodiments, a complete weight loss function may be constructed with reference to the first weight loss function obtained in step, which is calculated as follows:

WeightedNCE is the complete weight loss function. The first image encoder, the second image encoder, and the third image encoder are trained according to the complete weight loss function. A manner for training the first image encoder, the second image encoder, and the third image encoder by using the complete weight loss function is not limited in the embodiments of this application. In some embodiments, Parameters in the first image encoder, the second image encoder, and the third image encoder are updated by using the complete weight loss function in a forward gradient update or reverse gradient update manner.

409 409 In some embodiments, the step of updating the first image encoder based on the second image encoder in stepmay be replaced with the updating, according to a parameter shared between the second image encoder and the third image encoder, a parameter of the first image encoder in a weighting manner, that is, θ in formula (4) in steprepresents a parameter shared between the second image encoder and the third image encoder, and the first image encoder is slowly updated by using the parameter shared between the second image encoder and the third image encoder.

Based on the above, two sample triples (first image, second image, a plurality of second sample tissue images) and (third image, second image, a plurality of second sample tissue images) are constructed in the foregoing solution. The first image is an anchor image 1, and the third image is an anchor image 2. In this way, a coding effect of an image encoder obtained through training is further improved, and the complete weight loss function constructed is more robust than the first weight loss function or the second weight loss function.

The content of training the image encoder based on the weight loss function has been completely described above. The image encoder includes the first image encoder, the second image encoder, and the third image encoder. The following describes training on the image encoder based on a group loss function.

8 FIG. 1 FIG. 21 shows an image encoder training framework according to an exemplary embodiment. Descriptions are provided by using an example in which the framework is applied to the training deviceof the image encoder shown in.

8 FIG. 801 802 806 802 804 801 807 808 801 803 809 803 805 810 807 809 811 808 809 812 810 811 In, data enhancement is performed on a first sample tissue imageto obtain a second image, a fourth feature vectoris obtained based on the second imagethrough a second image encoder, and when a plurality of first sample tissue imagesare simultaneously inputted, a positive sample vectorin a plurality of fourth feature vectors and a negative sample vectorin the plurality of fourth feature vectors are distinguished from the plurality of fourth feature vectors; data enhancement is performed on the first sample tissue imageto obtain a third image, and a fifth feature vectoris obtained based on the third imagethrough the third image encoder; a fifth sub-functionis generated based on the positive sample vectorin the plurality of fourth feature vectors and the fifth feature vector; a sixth sub-functionis generated based on the negative sample vectorin the plurality of fourth feature vectors and the fifth feature vector; and a first group loss functionis constructed based on the fifth sub-functionand the sixth sub-function.

812 The first group loss functionis configured for pulling the anchor image and the positive sample close to each other.

9 FIG. 8 FIG. shows a flowchart of an image encoder training method according to an exemplary embodiment. Descriptions are provided by using an example in which the method is applied to the image encoder training framework shown in. The method includes the following steps:

901 Step: Obtain a first sample tissue image.

The first sample tissue image is an image configured for training the image encoder in this application, that is, a local region image (a small image) in the WSI.

10 FIG. With reference to, an image X is the first sample tissue image.

902 Step: Perform data enhancement on the first sample tissue image, to obtain a second image; and input the second image into a second image encoder, to obtain a fourth feature vector.

In some embodiments, data enhancement is performed on the first sample tissue image, to obtain a second image, and feature extraction is performed on the second image through the second image encoder, to obtain a fourth feature vector.

In some embodiments, the second image is inputted into the second image encoder, to obtain a first intermediate feature vector; and the first intermediate feature vector is inputted into a third MLP, to obtain the fourth feature vector. The third MLP plays a transition role and is configured to improve an expression capability of the second image.

10 FIG. p p p p p1 d With reference to, data enhancement is performed on the image X to obtain an image X, the image Xis converted to the high-level semantic spacethrough an encoder h, that is, a first intermediate feature vector his obtained, and the first intermediate feature vector his inputted into the third MLP, to obtain a second feature vector g.

903 Step: Perform data enhancement on the first sample tissue image, to obtain a third image; and input the third image into a third image encoder, to obtain a fifth feature vector;

In some embodiments, the third image is inputted into the third image encoder, to obtain a second intermediate feature vector; and the second intermediate feature vector is inputted into a fourth MLP, to obtain the fifth feature vector. The fourth MLP plays a transition role and is configured to improve an expression capability of the third image.

10 FIG. q q q q q2 With reference to, data enhancement is performed on the image X to obtain an image X, the image Xis converted to the high-level semantic space Rd through the encoder h, that is, a second intermediate feature vector his obtained, and the second intermediate feature vector his inputted into the fourth MLP, to obtain a fifth feature vector g.

904 Step: Determine the fourth feature vector as a contrastive vector for contrastive learning, and determine the fifth feature vector as an anchor vector for the contrastive learning.

In some embodiments, the fourth feature vector is determined as a contrastive vector for contrastive learning, and the fifth feature vector is determined as an anchor vector for the contrastive learning. The contrastive vector in the contrastive learning may be a positive sample vector, or may be a negative sample vector.

905 Step: Cluster a plurality of fourth feature vectors of different first sample tissue images, to obtain a plurality of first clustering centers.

In some embodiments, a plurality of different first sample tissue images are simultaneously inputted, and a plurality of fourth feature vectors of the plurality of first sample tissue images are clustered, to obtain a plurality of first clustering centers. In some embodiments, the plurality of different first sample tissue images are sample tissue images in the same training batch.

In some embodiments, fourth feature vectors of different first sample tissue images are clustered into S categories, and S first clustering centers of the S categories are represented as

where j∈[1, . . . , S], S being a positive integer.

10 FIG. shows a first clustering center in a plurality of first clustering centers of fourth feature vectors of different first sample tissue images.

906 Step: Determine a feature vector that is in the plurality of first clustering centers and that has a maximum similarity value with the fifth feature vector as a positive sample vector in the plurality of fourth feature vectors.

In some embodiments, a first clustering center that is in the S first clustering centers and that is closest to the fifth feature vector is used as a positive sample vector, and is represented as

907 Step: Determine a first remaining feature vector as a negative sample vector in the plurality of fourth feature vectors.

The first remaining feature vector is a feature vector other than the feature vector having the maximum similarity value with the fifth feature vector in the plurality of fourth feature vectors.

In some embodiments, a feature vector other than

in the S first clustering centers is used as a negative sample vector, and is represented as

908 Step: Generate a fifth sub-function based on the fifth feature vector and the positive sample vector in the plurality of fourth feature vectors.

In some embodiments, the fifth sub-function is represented as

q2 gis used as an anchor vector in the contrastive learning, and

is used as a positive sample vector in the contrastive learning.

909 Step: Generate a sixth sub-function based on the fifth feature vector and the negative sample vector in the plurality of fourth feature vectors.

In some embodiments, the sixth sub-function is represented as

q2 The fifth feature vector gis used as an anchor vector in the contrastive learning, and

is used as a negauve sample vector in the contrastive learning.

910 Step: Generate a first group loss function based on the fifth sub-function and the sixth sub-function.

In some embodiments, the first group loss function is expressed as follows:

GroupNCE1 Lrepresents the first group loss function, and log represents a logarithm operation.

911 Step: Train the second image encoder and the third image encoder based on the first group loss function; and determine the third image encoder as an image encoder obtained through final training.

The second image encoder and the third image encoder may be trained according to the first group loss function.

In some embodiments, the third image encoder is determined as an image encoder obtained through final training.

Based on the above, positive samples identified in the related art are further distinguished, and a “positive degree” of the positive samples is further distinguished between the positive samples, so that a loss value used in contrastive learning (also referred to as a contrastive learning paradigm) can more accurately pull anchor images and the positive samples close to each other, thereby better training an image encoder. In this way, the image encoder obtained through training can better learn a common feature between the anchor image and the positive sample, so that the trained image encoder performs feature extraction on the image more accurately.

8 FIG. 9 FIG. andfurther show that the image encoder is trained through a feature vector sample triple, where the sample triple in the contrastive learning is (anchor vector, positive vector, negative vector). In another embodiment, the first image encoder may also be trained through a plurality of feature vector sample triples. The following describes training on the image encoder through two feature vector sample triples (anchor vector 1, positive vector 1, negative vector 1) and (anchor vector 2, positive vector 2, negative vector 2). The anchor vector 1 and the anchor vector 2 are different vectors obtained by performing data enhancement on the first sample tissue image and processing through different image encoders and different MLPs. A quantity of feature vector sample triples specifically constructed is not limited in this application.

11 FIG. 1 FIG. 21 shows an image encoder training framework according to an exemplary embodiment. Descriptions are provided by using an example in which the framework is applied to the training deviceof the image encoder shown in.

11 FIG. 801 802 806 802 804 801 807 808 801 803 809 803 805 810 807 809 811 808 809 812 810 811 In, data enhancement is performed on a first sample tissue imageto obtain a second image, a fourth feature vectoris obtained based on the second imagethrough a second image encoder, and when a plurality of first sample tissue imagesare simultaneously inputted, a positive sample vectorin a plurality of fourth feature vectors and a negative sample vectorin the plurality of fourth feature vectors are distinguished from the plurality of fourth feature vectors; data enhancement is performed on the first sample tissue imageto obtain a third image, and a fifth feature vectoris obtained based on the third imagethrough the third image encoder; a fifth sub-functionis generated based on the positive sample vectorin the plurality of fourth feature vectors and the fifth feature vector; a sixth sub-functionis generated based on the negative sample vectorin the plurality of fourth feature vectors and the fifth feature vector; and a first group loss functionis constructed based on the fifth sub-functionand the sixth sub-function.

8 FIG. 11 FIG. 801 813 814 815 813 806 816 814 806 817 815 816 Different from the training framework shown in, in, when a plurality of first sample tissue imagesare simultaneously inputted, a positive sample vectorin a plurality of fifth feature vectors and a negative sample vectorin the plurality of fifth feature vectors are distinguished from the plurality of fifth feature vectors; a seventh sub-functionis generated based on the positive sample vectorin the plurality of fifth feature vectors and the fourth feature vector; an eighth sub-functionis generated based on the negative sample vectorin the plurality of fifth feature vectors and the fourth feature vector; and a second group loss functionis constructed based on the seventh sub-functionand the eighth sub-function.

817 The second group loss functionis configured for pulling the anchor image and the positive sample close to each other.

9 FIG. 12 FIG. 8 FIG. 11 FIG. 10 FIG. 912 919 Based on the image encoder training method shown in,further provides stepto stepbased on the method steps in, and descriptions are provided by using an example in which the method inis applied to the image encoder training framework shown in. The method is performed by a training device of an image encoder in the training framework. The method includes the following steps:

912 Step: Determine the fifth feature vector as a contrastive vector for contrastive learning, and determine the fourth feature vector as an anchor vector for the contrastive learning.

In some embodiments, the fifth feature vector is determined as a contrastive vector for contrastive learning, and the fourth feature vector is determined as an anchor vector for the contrastive learning. The contrastive vector in the contrastive learning may be a positive sample vector, or may be a negative sample vector.

913 Step: Cluster a plurality of fifth feature vectors of different first sample tissue images, to obtain a plurality of second clustering centers.

In some embodiments, a plurality of different first sample tissue images are simultaneously inputted, feature vectors corresponding to the plurality of first sample tissue images, that is, a plurality of fifth feature vectors are clustered, to obtain a plurality of second clustering centers. In some embodiments, the plurality of different first sample tissue images are sample tissue images in the same training batch.

In some embodiments, fifth feature vectors of different first sample tissue images are clustered into S categories, and S second clustering centers of the S categories are represented as

where j∈[1, . . . , S].

10 FIG. shows a second clustering center in a plurality of second clustering centers of fifth feature vectors of different first sample tissue images.

914 Step: Determine a feature vector that is in the plurality of second clustering centers and that has a maximum similarity value with the fourth feature vector as a positive sample vector in the plurality of fifth feature vectors.

In some embodiments, a second clustering center that is in the S second clustering centers and that is closest to the fourth feature vector is used as a positive sample vector, and is represented as

915 Step: Determine a second remaining feature vector as a negative sample vector in the plurality of fifth feature vectors.

The second remaining feature vector is a feature vector other than the feature vector having the maximum similarity value with the fourth feature vector in the plurality of fifth feature vectors.

In some embodiments, a feature vector other than

in the S second clustering centers is used as a negative sample vector, and is represented as

916 Step: Generate a seventh sub-function based on the fourth feature vector and the positive sample vector in the plurality of fifth feature vectors.

In some embodiments, the seventh sub-function is represented as

917 Step: Generate an eighth sub-function based on the fourth feature vector and the negative sample vector in the plurality of fifth feature vectors.

In some embodiments, the eighth sub-function is represented as

p1 The fourth feature vector gis used as an anchor vector in the contrastive length, and

is used as a negative sample vector in the contrastive learning.

918 Step: Generate a second group loss function based on the seventh sub-function and the eighth sub-function.

In some embodiments, the second group loss function is expressed as follows:

GroupNCE2 represents the second group loss function, and log represents a logarithm operation.

919 Step: Train the second image encoder and the third image encoder based on the second group loss function; and determine the second image encoder as an image encoder obtained through final training.

The second image encoder and the third image encoder are trained according to the second group loss function; and the second image encoder is determined as an image encoder obtained through final training.

910 In some embodiments, a complete group loss function may be constructed with reference to the first group loss function obtained in step, which is calculated as follows:

GroupNCE is the complete group loss function. The second image encoder and the third image encoder are trained according to the complete group loss function. The second image encoder and the third image encoder are determined as image encoders obtained through final training.

919 920 In some embodiments, after step, the method further includes step: Update, according to a parameter shared between the second image encoder and the third image encoder, a parameter of the first image encoder in a weighting manner.

For example, a formula for updating the parameter of the first image encoder is expressed as follows:

θ′ on a left side of formula (10) represents the parameter of the first image encoder after the update, θ′ on a right side of formula (10) represents the parameter of the first image encoder before the update, and θ represents the parameter shared between the second image encoder and the third image encoder, m being a constant. In some embodiments, m is equal to 0.99.

Based on the above, two feature vector sample triples (fifth feature vector, positive vector in a plurality of fourth feature vectors, negative vector in a plurality of fourth feature vectors) and (fourth feature vector, positive vector in a plurality of fifth feature vectors, negative vector in a plurality of fifth feature vectors) are constructed, so that a coding effect of an image encoder obtained through training is further improved, precision of an image feature obtained through the image encoder is improved, and the complete group loss function constructed is more robust than the first group loss function or the second group loss function.

3 FIG. 7 FIG. 8 FIG. 12 FIG. Based onto, the first image encoder may be trained by using the weight loss function; and based onto, the first image encoder may be trained by using the group loss function.

13 FIG. In some embodiments, the first image encoder may be trained jointly by using the weight loss function and the group loss function.shows a schematic diagram of a training architecture of a first image encoder according to an exemplary embodiment of this application.

k k k p p p p2 p q q q p1 q k the plurality of second sample tissue images are inputted into the encoder f and then are stored in a storage queue by performing a stack pushing operation, and in the storage queue, k-means clustering is performed on the storage queue to cluster negative sample feature vectors in the queue into Q categories, to further construct Q sub-queues. A weight is assigned to each clustering center based on a similarity value between each clustering center and f; p2 k p2 a sub-function configured for representing the negative sample and then anchor image is constructed based on Q clustering centers and the second feature vector g; a sub-function configured for representing the positive sample and the anchor image is constructed based on the first feature vector fand the second feature vector g; the first weight loss function is formed in combination with two sub-functions; p1 k p1 a sub-function configured for representing the negative sample and then anchor image is constructed based on Q clustering centers and the third feature vector g; a sub-function configured for representing the positive sample and the anchor image is constructed based on the first feature vector fand the third feature vector g; a second weight loss function is formed by combining the two sub-functions; and the first image encoder, the second image encoder, and the third image encoder are trained based on the weight loss function obtained by combining the first weight loss function and the second weight loss function, and the parameter of the first image encoder is slowly updated by using the parameter shared between the second image encoder and the third image encoder. Data enhancement is performed on the image X to obtain the image X, and the first feature vector fis obtained based on the image Xthrough the encoder f; data enhancement is performed on the image X to obtain the image X, the first intermediate feature vector his obtained based on the image Xthrough the encoder h, and the second feature vector gis obtained based on the first intermediate feature vector hthrough the first MLP; data enhancement is performed on the image X to obtain the image X, the second intermediate feature vector his obtained based on the image Xthrough the encoder h, and a third feature vector gis obtained based on the first intermediate feature vector hthrough the second MLP; and

p p p p1 p q q q q2 q Data enhancement is performed on the image X to obtain the image X, the first intermediate feature vector his obtained based on the image Xthrough the encoder h, and the fourth feature vector gis obtained based on the first intermediate feature vector hthrough the third MLP. Data enhancement is performed on the image X to obtain the image X, the second intermediate feature vector his obtained based on the image Xthrough the encoder h, and a fifth feature vector gis obtained based on the first intermediate feature vector hthrough the fourth MLP.

p1 q2 q2 q2 In the same training batch, a plurality of fourth feature vectors gof a plurality of first sample tissue images are clustered to obtain a plurality of first clustering centers; a first clustering center that is in the plurality of first clustering centers and that is closest to a fifth feature vector gof one first sample tissue image is determined as a positive sample vector; a remaining feature vector in the plurality of first clustering centers is determined as a negative sample vector; a sub-function configured for representing an error between the positive sample vector and the anchor vector is constructed based on the positive sample vector and the fifth feature vector g; a sub-function configured for representing an error between the negative sample vector and the anchor vector is constructed based on the negative sample vector and the fifth feature vector g; and a first group loss function is formed by combining the two sub-functions.

q2 p1 p1 p1 In the same training batch, a plurality of fifth feature vectors gof a plurality of first sample tissue images are clustered to obtain a plurality of second clustering centers; a second clustering center that is in the plurality of second clustering centers and that is closest to a fourth feature vector gof one first sample tissue image is determined as a positive sample vector; a remaining feature vector in the plurality of second clustering centers is determined as a negative sample vector; a sub-function configured for representing an error between the positive sample vector and the anchor vector is constructed based on the positive sample vector and the fourth feature vector g; a sub-function configured for representing an error between the negative sample vector and the anchor vector is constructed based on the negative sample vector and the fourth feature vector g; and a second group loss function is formed by combining the two sub-functions.

The second image encoder and the third image encoder are trained based on a group loss function obtained by combining the first group loss function and the second group loss function.

It may be understood that, the training the image encoder based on the weight loss function and based on the group loss function are both to determine a similarity value based on clustering, positive and negative sample assumption is re-assigned, the weight loss function is configured for correcting positive and negative sample assumption of the negative sample in the related art, and the group loss function is configured for correcting positive and negative sample assumption of the positive sample in the related art.

13 FIG. In the training architecture shown in, the weight loss function and the group loss function are combined through a hyperparameter, which is expressed as follows:

WeightedNCE GroupNCE on a left side of formula (11) is a final loss function,is the weight loss function,is the group loss function, and λ is used as a hyperparameter to adjust contributions of the two loss functions.

Based on the above, a final loss function is constructed through both the weight loss function and the group loss function. Compared with a single weight loss function or a single group loss function, the final loss function may be more robust, an image encoder obtained through final training may have a better coding effect, and a feature of a small image extracted through the image encoder can better represent the small image.

14 FIG. 1 FIG. 22 22 The image encoder training phase is introduced above. The Image encoder use phase may be introduced below. In an embodiment of this application, the image encoder may be used in a WSI image search scenario.shows a flowchart of a method for searching for a whole slide image according to an exemplary embodiment of this application. Descriptions are provided by using an example in which the method is applied to the use deviceof the image encoder shown in. In this case, the use deviceof the image encoder may also referred to as a device for searching for a whole slide image.

1401 Step: Obtain a whole slide image, and crop the whole slide image into a plurality of tissue images.

The whole slide image (WSI) is a visual digital image obtained by using a digital scanner to scan a conventional slide image to acquire a high-resolution image, and then using a computer to seamlessly splice acquired fragmented images. In this application, the WSI is usually referred to as a large image.

The tissue image is a local tissue region in the WSI, and the tissue image is usually referred to as a small image in this application.

In some embodiments, in a WSI preprocessing phase, a foreground tissue region in the WSI is extracted by using a threshold technology, and then the foreground tissue region in the WSI is cropped into a plurality of tissue images based on a sliding window technology.

1402 Step: Generate, through an image encoder, image feature vectors respectively corresponding to the plurality of tissue images.

4 FIG. 7 FIG. image feature vectors respectively corresponding to the plurality of tissue images are generated through the first image encoder obtained through training in the method embodiment shown in. In this case, the first image encoder is obtained through training based on the first weight loss function and the second weight loss function or 9 FIG. image feature vectors respectively corresponding to the plurality of tissue images are generated through the third image encoder obtained through training in the method embodiment shown in. In this case, the third image encoder is obtained through training based on the first group loss function; or 12 FIG. image feature vectors respectively corresponding to the plurality of tissue images are generated through the second image encoder or the third image encoder obtained through training in the method embodiment shown in. In this case, both the second image encoder and the third image encoder are obtained through training based on the first group loss function and the second group loss function; or 13 FIG. image feature vectors respectively corresponding to the plurality of tissue images are generated through the first image encoder obtained through training in the embodiment shown in. In this case, the first image encoder is obtained through training based on the weight loss function and the group loss function. In some embodiments, image feature vectors respectively corresponding to the plurality of tissue images are generated through the first image encoder obtained through training in the method embodiment shown in. In this case, the first image encoder is obtained through training based on the first weight loss function; or

1403 Step: Cluster the image feature vectors respectively corresponding to the plurality of tissue images, to determine at least one key image from the plurality of tissue images.

In some embodiments, the image feature vectors respectively corresponding to the plurality of tissue images are clustered, to obtain a plurality of first class clusters; and clustering centers respectively corresponding to the plurality of first class clusters are determined as image feature vectors respectively corresponding to the plurality of key images, that is, the plurality of key images are determined from the plurality of tissue images.

In another embodiment, the plurality of image feature vectors corresponding to the plurality of tissue images are clustered, to obtain a plurality of first class clusters, and then clustering is performed again. For a target first class cluster in the plurality of first class clusters, clustering is performed based on location features of whole slide images to which a plurality of tissue images corresponding to the target first class cluster respectively belong, to obtain a plurality of second class clusters; and That clustering centers respectively corresponding to the plurality of first class clusters are determined as image feature vectors respectively corresponding to the plurality of key images includes: for the target first class cluster in the plurality of first class clusters, determining clustering centers respectively corresponding to the plurality of second class clusters included in the target first class cluster as the image feature vectors of the key images, where the target first class cluster is any one of the plurality of first class clusters.

all 1 1 i 1 i 2 2 2 i 1 2 1 2 1 2 1 2 For example, clustering is performed in a K-means clustering manner. During the first clustering, a plurality of image feature vectors fare clustered to obtain Kdifferent categories, Kbeing a positive integer and represented as F, i=1, 2, . . . , K. During the second clustering, in each class cluster F, spatial coordinate information of the plurality of tissue images is used as a feature, clustering is further performed to obtain Kcategories, Kis a positive integer, and K=round(R·N). R is a proportional parameter. In some embodiments, R is equal to 20%; and N is a quantity of small images in the class cluster F. Based on double clustering, K*Kclustering centers may be finally obtained, tissue images corresponding to the K*Kclustering centers are used as K*Kkey images, and the K*Kkey images are used as a global representation of the WSI. In some embodiments, the key image is usually referred to as a mosaic image.

1404 Step: Query, based on image feature vectors respectively corresponding to the at least one key image, a database to obtain at least one candidate image package respectively corresponding to the at least one key image, the candidate image package including at least one candidate tissue image.

1404 1 2 i k i 1 2 j k i i1 i2 ij it ij i th th th Based on step, it may be obtained that WSI={P, P, . . . , P, . . . , P}, where Pand k respectively represent a feature vector of an ikey image and a total quantity of key images in the WSI, both i and k being positive integers. When searching for the WSI, the key image are used as query images one by one to generate candidate image packages, and a total of k candidate image packages are generated, which is expressed as Bag={,, . . . ,, . . . ,}, where an icandidate image package={b, b, . . . , b, . . . , b}, and band t respectively represent a jcandidate tissue image and a total quantity of candidate tissue images in, j being a positive integer.

1405 Step: Determine, according to an attribute of the at least one candidate image package, at least one target image package from the at least one candidate image package.

1405 As can be seen from step, a total of k candidate image packages are generated. To increase a WSI search speed and optimize final search results, k candidate image packages further need to be screened. In some embodiments, k candidate image packages are screened according to similarities between the candidate image packages and the WSI and/or diagnostic categories of the candidate image packages, to obtain a plurality of target image packages. Specific screening steps are described in detail below.

1406 Step: Determine a whole slide image to which at least one candidate tissue image included in the at least one target image package respectively belongs as a final search result.

After the plurality of target image packages are screened, whole slide images to which a plurality of target tissue images in the target image packages belong are determined as final search results. In some embodiments, the plurality of target tissue images in the target image packages may be from the same whole slide image, or may come from a plurality of different whole slide images.

Based on the above, first, the WSI is cropped into a plurality of small images, and the plurality of small images are inputted into the image encoder to obtain image feature vectors respectively corresponding to the plurality of small images, that is, a plurality of image feature vectors are obtained. The plurality of image feature vectors are clustered, and small images corresponding to clustering centers are used as key images. Next, the key images are queried to obtain candidate image packages. Then, the candidate image packages are screened to obtain the target image package. Finally, a WSI corresponding to at least one small image in the candidate image package is used as the final search result. This method provides a manner for searching for a WSI (a large image) with a WSI (a large image), and the clustering step and the screening step mentioned can greatly reduce an amount of data processed and improve search efficiency. In addition, the manner for searching for the WSI (the large image) with the WSI (the large image) provided in this embodiment does not require a training process, and can achieve fast search and matching. Specifically, the key images are determined from the plurality of small images in a clustering manner, and clustering centers are further determined as image features of the key images, which is beneficial to improving accuracy of the determined key images and image features corresponding to the key images.

In addition, the determining the target image package from the candidate image packages includes at least two manners, one is based on a quantity of diagnostic categories of the candidate image packages, and the other is based on similarities between the image features in the candidate image packages and the image features of the key images. Therefore, this embodiment of this application enriches a manner for determining search results, and further improves accuracy of the search results.

On the one hand, when the target image package is determined based on the quantity of diagnostic categories of the candidate image packages, the candidate image packages are screened based on the diagnostic categories, which is closer to the actual situation. Furthermore, entropy values corresponding to the candidate image packages are determined from a plurality of dimensions according to the diagnostic categories, so that the target image package is screened more intuitively.

On the other hand, when determining is performed based on the similarities between the image features in the candidate image packages and the image features of the key image, cosine similarities between the candidate tissue images in the candidate image packages and the key images are respectively calculated, and first m cosine similarity values are taken to determine a mean value, to further screen the target image package according to the mean value. In this case, not only a cosine similarity of a single feature is considered, m similarities are comprehensively considered. Therefore, this solution has better fault tolerance.

In the related art, a manner for using a small image to represent a large image often adopts manual selection. Pathologists use a color and a texture feature of each small image in the WSI (for example, histogram statistical information from various color spaces) to select a core small image. Then, features of the core small images are accumulated into a global representation of the WSI, and then a support vector machine (SVM) is used to classify the WSI global representations of a plurality of WSIs into two main disease types. In a search phase, once a disease type for which the WSI is to be searched is determined, image search may be performed on a WSI library with the same disease type.

14 FIG. 1405 1405 1 Based on the embodiment shown in, stepmay be replaced with-.

1405 1 -: Screen the at least one candidate image package according to a quantity of diagnostic categories respectively included in the at least one candidate image package, to obtain the at least one target image package.

In some embodiments, for a first candidate image package that is in the at least one candidate image package and that corresponds to a first key image in the at least one key image, an entropy value of the candidate image package is calculated based on a cosine similarity between at least one candidate tissue image in the first candidate image package and the first key image, an occurrence probability of at least one diagnostic category in the database, and a diagnostic category of the at least one candidate tissue image, where the entropy value is configured for measuring a quantity of diagnostic categories corresponding to the first candidate image package, and the first candidate image package is any one of the at least one candidate image package; and the at least one candidate image package is screened, to obtain the at least one target image package whose entropy value is less than an entropy value threshold.

For example, a formula for calculating the entropy value is as follows:

i i m th th th th Entrepresents an entropy value of the icandidate image package, urepresents a total quantity of diagnostic categories in the icandidate image package, and prepresents an occurrence probability of an mdiagnosis type in the icandidate image package, m being a positive integer.

th th th th th th th It may be understood that, the entropy value is configured for representing a degree of uncertainty of the icandidate image package. A greater entropy value indicates a higher degree of uncertainty of the icandidate image package and more disordered distribution of the candidate tissue images in the icandidate image package on a dimension of the diagnostic categories. In other words, the higher the degree of uncertainty of the ikey image, the less the ikey image can be used to represent the WSI. If a plurality of candidate tissue images in the icandidate image package have the same diagnosis result, the entropy value of the candidate image package may be equal to 0, and the ikey image has the best representation of the WSI.

m In formula (12), pis calculated as follows:

j y j y j j th th th th th th th th yrepresents a diagnostic category of a jcandidate tissue image in the icandidate image package; δ( ) is a discriminant function configured for determining whether the diagnostic category of the jcandidate tissue image is consistent with the mdiagnostic category. If the two are consistent, 1 is outputted, or otherwise, 0 is outputted; wis a weight of the jcandidate tissue image, and wis calculated based on the occurrence probability of at least one diagnostic category in the database; and di represents a cosine similarity between the jcandidate tissue image in the icandidate package and the ikey image, and (d+1)/2 is used to ensure that the value is in a range of 0 and 1.

y j j j th th th th th For ease of understanding, in formula (13), w·(d+1)/2 may be regarded as a weight score v, and is configured for representing the jcandidate tissue image in the icandidate image package. A denominator of formula (13) represents a total score of the icandidate image package, and a numerator of formula (13) represents a sum of scores of the mdiagnostic category of the icandidate image package.

1 2 i k′ Based on formula (12) and formula (13), a plurality of candidate image packages may be screened, and a candidate image package whose entropy value is lower than a preset threshold of the entropy value is removed. A plurality of target image packages may be screened out from the plurality of candidate image packages, which is expressed as Bag={,, . . . ,, . . . ,}, where k′ is a quantity of the plurality of target image packages, k′ being a positive integer.

Based on the above, a candidate image package whose entropy value is lower than the preset threshold of the entropy value is removed, that is, a candidate image package with higher stability is screened, which further reduces the amount of data processed in the process of searching for the WSI with the WSI, thereby improving search efficiency.

14 FIG. 1405 1405 2 Based on the embodiment shown in, stepmay be replaced with-.

1405 2 -: Screen the at least one candidate image package according to similarities between the at least one candidate tissue image and the key images, to obtain the at least one target image package.

In some embodiments, for a first candidate image package that is in the at least one candidate image package and that corresponds to a first key image in the at least one key image, the at least one candidate tissue image in the first candidate image package is sorted in descending order of cosine similarities with the first key image; first m candidate tissue images of the first candidate image package are obtained; cosine similarities respectively corresponding to the first m candidate tissue images are calculated; a mean value of cosine similarities respectively corresponding to first m candidate tissue images of the plurality of candidate image packages is determined as a first mean value; and a candidate image package that includes the at least one candidate tissue image whose mean value of cosine similarities is greater than the first mean value is determined as the target image package, to obtain the at least one target image package, where the first candidate image package is any one of the at least one candidate image package, and m is a positive integer.

1 2 i k For example, the plurality of candidate image packages are represented as Bag={,, . . . ,, . . . ,}, candidate tissue images in each candidate image package are sorted in descending order of the cosine similarities, and the first mean value may be expressed as follows:

i 1 2 i k″ i th th th and k respectively represent the icandidate image package and a total quantity of the plurality of candidate image packages, AveTop represents the mean value of the first m cosine similarities in the icandidate image package, η is the first mean value, and η is used as an evaluation criterion to remove a candidate image package whose average cosine similarity is less than η, so that a plurality of target image packages may be obtained. The plurality of target image packages are expressed as Bag={,, . . . ,, . . . ,}, whereand k″ respectively represent the itarget image package and a total quantity of the plurality of target image packages, k″ being a positive integer.

Based on the above, the candidate image package whose similarity to the key image is less than the first mean value is removed, that is, a candidate image package with a high similarity between the candidate tissue image and the key image are screened, which further reduces the amount of data processed in the process of searching for the WSI with the WSI, thereby improving search efficiency.

1405 1 1405 2 1405 1 1405 2 1405 2 1405 1 Steps-and-may be separately performed to screen the plurality of candidate image packages, or may be jointly performed to screen the plurality of candidate image packages. In this case,-may be performed before-, or-may be performed before-. This is not limited in this application.

14 FIG. 15 FIG. 1404 Based on the method embodiment shown in, stepinvolves querying candidate image packages through the database, and a process for constructing the database is introduced below.shows a schematic diagram of a database construction framework according to an exemplary embodiment of this application. The method is performed by a use device of an image encoder, or may be performed by another computer device other than the use device of the image encoder. This is not limited in this application.

Descriptions are provided by using one WSI as an example.

1501 1502 First, a WSIis cropped into a plurality of tissue images. In some embodiments, the cropping method includes the following steps: In a WSI preprocessing phase, a foreground tissue region in the WSI is extracted by using a threshold technology, and then the foreground tissue region in the WSI is cropped into a plurality of tissue images based on a sliding window technology.

1502 1503 1502 1505 Then, the plurality of tissue imagesare inputted into an image encoder, and feature extraction is performed on the plurality of tissue images, to obtain image feature vectorsrespectively corresponding to the plurality of tissue images.

1502 1506 1505 1506 1506 1 1506 2 Finally, selection on the plurality of tissue imagesis performed (that is, small image selectionis performed) based on the plurality of image feature vectorscorresponding to the plurality of tissue images. In some embodiments, small image selectionincludes double clustering, where the first clustering is feature-based clustering-, and the second clustering is coordinate-based clustering-.

1506 1 1505 1 1 15 FIG. In feature-based clustering-, the plurality of image feature vectorsof the plurality of tissue images are clustered into Kcategories in a K-means clustering manner, and correspondingly, Kclustering centers are obtained.shows a small image corresponding to one of the clustering centers.

1506 2 1 2 2 15 FIG. In coordinate-based clustering-, for any one of Kcategories, a plurality of feature vectors included in the category are clustered into Kcategories in the K-means clustering manner, and correspondingly, Kclustering centers are obtained.shows a small image corresponding to one of the clustering centers.

1 2 1506 3 15 FIG. Small images corresponding to the K*Kclustering centers obtained through double clustering are used as representative small images-.shows a small image corresponding to one of the clustering centers.

Use all representative small images as small images of the WSI to represent WSI. Based on this, a plurality of small images of one WSI are constructed.

Based on the above, construction of the database is similar to the process of searching for the WSI with the WSI, which is intended to determine a plurality of small images configured for representing one WSI, to support matching of large images by matching small images in the search process.

In some embodiments, the training idea of the image encoder may further be applied to another image field. Through a sample star field image (a small image), the star field image comes from a starry sky image (a large image). The star field image indicates a local region in the starry sky image. For example, the starry sky image is an image of a first range of starry sky, and the star field image is an image of a subrange within the first range.

obtaining a first sample star field image and a plurality of second sample star field images, the second sample star field images being negative samples in contrastive learning; performing data enhancement on the first sample star field image, to obtain a first image; and inputting the first image into a first image encoder, to obtain a first feature vector, the first image being a positive sample in the contrastive learning; performing data enhancement on the first sample star field image, to obtain a second image; and inputting the second image into a second image encoder, to obtain a second feature vector, the second image being an anchor image in the contrastive learning. inputting the plurality of second sample star field images into the first image encoder, to obtain feature vectors respectively corresponding to the plurality of second sample star field images; clustering the feature vectors respectively corresponding to the plurality of second sample star field images, to obtain a plurality of clustering centers; and generating a plurality of weights based on similarity values between the plurality of clustering centers and the first feature vector; generating, based on the first feature vector and the second feature vector, a first sub-function configured for representing an error between the anchor image and the positive sample; based on the second feature vector and the feature vectors respectively corresponding to the plurality of second sample star field images, generating, in combination with the plurality of weights, a second sub-function configured for representing an error between the anchor image and the negative sample; and generating a first weight loss function based on the first sub-function and the second sub-function; and training the first image encoder and the second image encoder based on the first weight loss function. The first image encoder is updated based on the second image encoder. The image encoder training phase includes the following steps:

Similarly, an image encoder of the star field image may also adopt another training method similar to the method performed by the image encoder of the sample tissue image. Details are not described herein again.

obtaining a starry sky image and cropping the starry sky images into a plurality of star field images; generating, through an image encoder, image feature vectors respectively corresponding to the plurality of star field images; clustering the image feature vectors respectively corresponding to the plurality of star field images, to determine at least one key image from the plurality of star field images; querying, based on image feature vectors respectively corresponding to the at least one key image, a database to obtain at least one candidate image package respectively corresponding to the at least one key image, the candidate image package including at least one candidate star field image; determining, according to an attribute of the at least one candidate image package, at least one target image package from the at least one candidate image package; and determining a starry sky image to which at least one candidate star field image included in the at least one target image package respectively belongs as a final search result. The image encoder use phase includes the following steps:

In some other embodiments, the training idea of the image encoder may further be applied to the field of geographical images, and the image encoder is trained through a sample terrain image (a small image). The terrain image comes from a landform image (a large image), and the terrain image indicates a local region in the terrain image. For example, the terrain image is an image of a second range of terrain captured by the satellite, and the terrain image is an image of a subrange within the second range.

obtaining a first sample terrain image and a plurality of second sample terrain images, the second sample terrain images being negative samples in contrastive learning; performing data enhancement on the first sample terrain image, to obtain a first image; and inputting the first image into a first image encoder, to obtain a first feature vector, the first image being a positive sample in the contrastive learning; performing data enhancement on the first sample terrain image, to obtain a second image; and inputting the second image into a second image encoder, to obtain a second feature vector, the second image being an anchor image in the contrastive learning; inputting the plurality of second sample terrain images into the first image encoder, to obtain feature vectors respectively corresponding to the plurality of second sample terrain images; clustering the feature vectors respectively corresponding to the plurality of second sample terrain images, to obtain a plurality of clustering centers; and generating a plurality of weights based on similarity values between the plurality of clustering centers and the first feature vector; generating, based on the first feature vector and the second feature vector, a first sub-function configured for representing an error between the anchor image and the positive sample; based on the second feature vector and the feature vectors respectively corresponding to the plurality of second sample terrain images, generating, in combination with the plurality of weights, a second sub-function configured for representing an error between the anchor image and the negative sample; and generating a first weight loss function based on the first sub-function and the second sub-function; and training the first image encoder and the second image encoder based on the first weight loss function. The first image encoder is updated based on the second image encoder. The image encoder training phase includes the following steps:

Similarly, an image encoder of the terrain image may also adopt another training method similar to the method performed by the image encoder of the sample tissue image. Details are not described herein again.

obtaining a landform image and cropping the landform image into a plurality of terrain images; generating, through an image encoder, image feature vectors respectively corresponding to the plurality of terrain images; clustering the image feature vectors respectively corresponding to the plurality of terrain images, to determine at least one key image from the plurality of terrain images; querying, based on image feature vectors respectively corresponding to the at least one key image, a database to obtain at least one candidate image package respectively corresponding to the at least one key image, the candidate image package including at least one candidate terrain image; determining, according to an attribute of the at least one candidate image package, at least one target image package from the at least one candidate image package; and determining a landform image to which at least one candidate terrain image included in the at least one target image package respectively belongs as a final search result. The image encoder use phase includes the following steps:

16 FIG. 1601 an obtaining module, configured to obtain a first sample tissue image and a plurality of second sample tissue images, the second sample tissue images being negative samples in contrastive learning; 1602 a processing module, configured to perform data enhancement on the first sample tissue image, to obtain a first image; and input the first image into a first image encoder, to obtain a first feature vector, the first image being a positive sample in the contrastive learning; 1602 the processing modulebeing further configured to perform data enhancement on the first sample tissue image, to obtain a second image; and input the second image into a second image encoder, to obtain a second feature vector, the second image being an anchor image in the contrastive learning, and the second image being different from the first image; 1602 the processing modulebeing further configured to input the plurality of second sample tissue images into the first image encoder, to obtain feature vectors respectively corresponding to the plurality of second sample tissue images; cluster the feature vectors respectively corresponding to the plurality of second sample tissue images, to obtain a plurality of clustering centers; and generate a plurality of weights based on similarity values between the plurality of clustering centers and the first feature vector; 1603 a generation module, configured to generate, based on the first feature vector and the second feature vector, a first sub-function configured for representing an error between the anchor image and the positive sample; based on the second feature vector and the feature vectors respectively corresponding to the plurality of second sample tissue images, generate, in combination with the plurality of weights, a second sub-function configured for representing an error between the anchor image and the negative sample; and generate a first weight loss function based on the first sub-function and the second sub-function; and 1604 a training module, configured to train the first image encoder and the second image encoder based on the first weight loss function. is a structural block diagram of an image encoder training apparatus according to an exemplary embodiment of this application. The apparatus includes:

1602 In some embodiments, the processing moduleis further configured to: when the first sample tissue image belongs to a first sample tissue image in a first training batch, cluster the feature vectors respectively corresponding to the plurality of second sample tissue images, to obtain a plurality of clustering centers corresponding to the first training batch.

1602 th th th In some embodiments, the processing moduleis further configured to: when the first sample tissue image belongs to a first sample tissue image in an ntraining batch, update a plurality of clustering centers corresponding to an (n−1)training batch to a plurality of clustering centers corresponding to the ntraining batch, n being a positive integer greater than 1.

1602 th th th th th th th th In some embodiments, the processing moduleis further configured to: for a jclustering center in the plurality of clustering centers in the (n−1)training batch, update the jclustering center in the (n−1)training batch based on a first sample tissue image of a jcategory in the ntraining batch, to obtain a jclustering center corresponding to the ntraining batch, i being a positive integer.

th th In some embodiments, values of the weights are in a negative correlation with the similarity values between the clustering centers and the first feature vector; and for the jclustering center in the plurality of clustering centers, feature vectors included in a category of the jclustering center correspond to the same weight.

1602 In some embodiments, the processing moduleis further configured to input the second image into the second image encoder, to obtain a first intermediate feature vector; and input the first intermediate feature vector into a first multilayer perceptron MLP, to obtain the second feature vector.

1604 In some embodiments, the training moduleis further configured to update a parameter of the first image encoder in a weighting manner according to a parameter of the second image encoder.

1602 In some embodiments, the processing modulebeing further configured to perform data enhancement on the first sample tissue image, to obtain a third image; and input the third image into a third image encoder, to obtain a third feature vector, the third image being an anchor image in the contrastive learning, and the third image being different from the first image and the second image.

1603 In some embodiments, the generation moduleis further configured to generate, based on the first feature vector and the third feature vector, a third sub-function configured for representing an error between the anchor image and the positive sample; based on the third feature vector and the feature vectors respectively corresponding to the plurality of second sample tissue images, generate, in combination with the plurality of weights, a fourth sub-function configured for representing an error between the anchor image and the negative sample; and generate a second weight loss function based on the third sub-function and the fourth sub-function.

1604 In some embodiments, the training moduleis further configured to train the first image encoder and the third image encoder based on the second weight loss function.

1602 In some embodiments, the processing moduleis further configured to input the third image into the third image encoder, to obtain a second intermediate feature vector; and input the second intermediate feature vector into a second MLP, to obtain the third feature vector.

1604 In some embodiments, the training moduleis further configured to update, according to a parameter shared between the second image encoder and the third image encoder, a parameter of the first image encoder in a weighting manner.

Based on the above, weights are assigned to negative samples identified in the related art, and a “negative degree” of the negative samples is further distinguished between the negative samples, so that a loss value used in contrastive learning (also referred to as a contrastive learning paradigm) can more accurately pull anchor images and the negative samples away from each other, and the impact of potential false negative samples is reduced, thereby better training an image encoder. In this way, the image encoder obtained through training can better distinguish different features between the anchor images and the negative samples, and a feature of a small image obtained through the image encoder can better represent the small image.

17 FIG. 1701 an obtaining module, configured to obtain a first sample tissue image; 1702 a processing module, configured to perform data enhancement on the first sample tissue image, to obtain a second image; and input the second image into a second image encoder, to obtain a fourth feature vector; 1702 the processing modulebeing further configured to perform data enhancement on the first sample tissue image, to obtain a third image; and input the third image into a third image encoder, to obtain a fifth feature vector; 1703 a determining module, configured to determine the fourth feature vector as a contrastive vector for contrastive learning, and determine the fifth feature vector as an anchor vector for the contrastive learning; 1704 a clustering module, configured to cluster a plurality of fourth feature vectors of different first sample tissue images, to obtain a plurality of first clustering centers; determine a feature vector that is in the plurality of first clustering centers and that has a maximum similarity value with the fifth feature vector as a positive sample vector in the plurality of fourth feature vectors; and determine a first remaining feature vector as a negative sample vector in the plurality of fourth feature vectors, where the first remaining feature vector is a feature vector other than the feature vector having the maximum similarity value with the fifth feature vector in the plurality of fourth feature vectors; 1705 a generation module, configured to generate a fifth sub-function based on the fifth feature vector and the positive sample vector in the plurality of fourth feature vectors; generate a sixth sub-function based on the fifth feature vector and the negative sample vector in the plurality of fourth feature vectors; and generate a first group loss function based on the fifth sub-function and the sixth sub-function; and 1706 a training module, configured to train the second image encoder and the third image encoder based on the first group loss function; and determine the third image encoder as an image encoder obtained through final training. is a structural block diagram of an image encoder training apparatus according to an exemplary embodiment of this application. The apparatus includes:

1702 In some embodiments, the processing moduleis further configured to input the second image into the second image encoder, to obtain a first intermediate feature vector; and the first intermediate feature vector is inputted into a third MLP, to obtain the fourth feature vector.

1702 In some embodiments, the processing moduleis further configured to input the third image into the third image encoder, to obtain a second intermediate feature vector; and the second intermediate feature vector is inputted into a fourth MLP, to obtain the fifth feature vector.

1703 In some embodiments, the determining moduleis further configured to determine the fifth feature vector as a contrastive vector for contrastive learning, and determine the fourth feature vector as an anchor vector for the contrastive learning.

1704 In some embodiments, the clustering moduleis further configured to cluster a plurality of fifth feature vectors of different first sample tissue images, to obtain a plurality of second clustering centers; determine a feature vector that is in the plurality of second clustering centers and that has a maximum similarity value with the fourth feature vector as a positive sample vector in the plurality of fifth feature vectors; and determine a second remaining feature vector as a negative sample vector in the plurality of fifth feature vectors, where the second remaining feature vector is a feature vector other than the feature vector having the maximum similarity value with the fourth feature vector in the plurality of fifth feature vectors.

1705 In some embodiments, the generation moduleis further configured to generate a seventh sub-function based on the fourth feature vector and the positive sample vector in the plurality of fifth feature vectors; generate an eighth sub-function based on the fourth feature vector and the negative sample vector in the plurality of fifth feature vectors; and generate a second group loss function based on the seventh sub-function and the eighth sub-function.

1706 In some embodiments, the training moduleis further configured to train the second image encoder and the third image encoder based on the second group loss function; and determine the second image encoder as an image encoder obtained through final training.

1706 In some embodiments, the training moduleis further configured to update, according to a parameter shared between the second image encoder and the third image encoder, a parameter of the first image encoder in a weighting manner.

18 FIG. 1801 an obtaining module, configured to obtain a whole slide image, and crop the whole slide image into a plurality of tissue images; 1802 a generation module, configured to generate, through an image encoder, image feature vectors respectively corresponding to the plurality of tissue images; 1803 a clustering module, configured to cluster the image feature vectors respectively corresponding to the plurality of tissue images, to determine at least one key image from the plurality of tissue images; 1804 a query module, configured to query, based on image feature vectors respectively corresponding to the at least one key image, a database to obtain at least one candidate image package respectively corresponding to the at least one key image, the candidate image package including at least one candidate tissue image; 1805 a screening module, configured to determine, according to an attribute of the at least one candidate image package, at least one target image package from the at least one candidate image package; and 1806 a determining module, configured to determine a whole slide image to which at least one candidate tissue image included in the at least one target image package respectively belongs as a final search result. is a structural block diagram of an apparatus for searching for a whole slide image according to an exemplary embodiment of this application. The apparatus includes:

1803 In some embodiments, the clustering moduleis further configured to cluster the image feature vectors respectively corresponding to the plurality of tissue images, to obtain a plurality of first class clusters; and determine clustering centers respectively corresponding to the plurality of first class clusters as image feature vectors respectively corresponding to the plurality of key images.

1803 In some embodiments, the clustering moduleis further configured to: for a target first class cluster in the plurality of first class clusters, cluster based on location features of whole slide images to which a plurality of tissue images corresponding to the target first class cluster respectively belong, to obtain a plurality of second class clusters; and That clustering centers respectively corresponding to the plurality of first class clusters are determined as image feature vectors respectively corresponding to the plurality of key images includes: for the target first class cluster in the plurality of first class clusters, determining clustering centers respectively corresponding to the plurality of second class clusters included in the target first class cluster as the image feature vectors of the key images, where the target first class cluster is any one of the plurality of first class clusters.

1805 In some embodiments, the screening moduleis further configured to screen the at least one candidate image package according to a quantity of diagnostic categories respectively included in the at least one candidate image package, to obtain the at least one target image package.

1805 In some embodiments, the screening moduleis further configured to: for a first candidate image package that is in the at least one candidate image package and that corresponds to a first key image in the at least one key image, calculate an entropy value of the candidate image package based on a cosine similarity between at least one candidate tissue image in the first candidate image package and the first key image, an occurrence probability of at least one diagnostic category in the database, and a diagnostic category of the at least one candidate tissue image, where the entropy value is configured for measuring a quantity of diagnostic categories corresponding to the first candidate image package, and the first candidate image package is any one of the at least one candidate image package; and the at least one candidate image package is screened, to obtain the at least one target image package whose entropy value is less than an entropy value threshold.

1805 In some embodiments, the screening moduleis further configured to screen the at least one candidate image package according to similarities between the at least one candidate tissue image and the key images, to obtain the at least one target image package.

1805 In some embodiments, the screening moduleis further configured to: for a first candidate image package that is in the at least one candidate image package and that corresponds to a first key image in the at least one key image, sort the at least one candidate tissue image in the first candidate image package in descending order of cosine similarities with the first key image; obtain first m candidate tissue images of the first candidate image package; cosine similarities respectively corresponding to the first m candidate tissue images are calculated; determine a mean value of cosine similarities respectively corresponding to first m candidate tissue images of the plurality of candidate image packages as a first mean value; and determine a candidate image package that includes the at least one candidate tissue image whose mean value of cosine similarities is greater than the first mean value as the target image package, to obtain the at least one target image package, where the first candidate image package is any one of the at least one candidate image package, and m is a positive integer.

Based on the above, first, the WSI is cropped into a plurality of small images, and the plurality of small images are inputted into the image encoder to obtain a plurality of image feature vectors of the plurality of small images. The plurality of image feature vectors are clustered, and small images corresponding to clustering centers are used as key images. Next, the key images are queried to obtain candidate image packages. Then, the candidate image packages are screened to obtain the target image package. Finally, a WSI corresponding to at least one small image in the candidate image package is used as the final search result. The apparatus supports searching for a WSI (a large image) with a WSI (a large image), and the clustering module and the screening module mentioned can greatly reduce an amount of data processed and improve search efficiency. In addition, the apparatus for searching for the WSI (the large image) with the WSI (the large image) provided in this embodiment does not require a training process, and can achieve fast search and matching.

19 FIG. 2 FIG. 2 FIG. 1900 21 22 1900 1901 1904 1902 1903 1905 1904 1901 1900 1906 1907 1913 1914 1915 is a schematic structural diagram of a computer device according to an exemplary embodiment. The computer devicemay be the training deviceof the image encoder in, or may be the use deviceof the image encoder in. The computer deviceincludes a central processing unit (CPU), a system memoryincluding a random access memory (RAM)and a read-only memory (ROM), and a system busconnecting the system memoryto the CPU. The computer devicefurther includes a basic input/output system (I/O system)configured to transmit information between components in the computer device, and a mass storage deviceconfigured to store an operating system, an application, and other program module.

1906 1908 1909 1908 1909 1901 1910 1905 1906 1910 1910 The basic I/O systemincludes a displayconfigured to display information and an input devicesuch as a mouse or a keyboard that is configured to input information by a user. The displayand the input deviceare both connected to the CPUby an input/output (I/O) controllerconnected to the system bus. The basic I/O systemmay further include the I/O controllerfor receiving and processing an input from a plurality of other devices such as a keyboard, a mouse, or an electronic stylus. Similarly, the I/O controllerfurther provides an output to a display screen, a printer, or another type of output device.

1907 1901 1905 1907 1900 1907 The mass storage deviceis connected to the CPUby using a mass storage controller (not shown) connected to the system bus. The mass storage deviceand a computer device-readable medium associated with the mass storage device provide non-volatile storage to the computer device. That is, the mass storage devicemay include a computer device-readable medium (not shown) such as a hard disk or a compact disc ROM (CD-ROM) drive.

1904 1907 In general, the computer device-readable medium may include a computer device storage medium and a communication medium. The computer device storage medium includes volatile and non-volatile, removable and non-removable media implemented by using any method or technology for storing information such as computer device-readable instructions, data structures, program modules, or other data. The computer device storage medium includes a RAM, a ROM, an erasable programmable ROM (EPROM), an electrically erasable programmable ROM (EEPROM), a CD-ROM, a digital video disc (DVD) or another optical memory, a magnetic cassette, a magnetic tape, a magnetic disk memory, or another magnetic storage device. Certainly, it is known to a person skilled in the art that the computer device storage medium is not limited to the foregoing types. The system memoryand the mass storage devicemay be collectively referred to as a memory.

1900 1900 1911 1912 1905 1912 According to the various embodiments of the present disclosure, the computer devicemay further be connected, through a network such as the Internet, to a remote computer device on the network for running. That is, the computer devicemay be connected to a networkby a network interface unitconnected to the system bus, or may be connected to another type of network or remote computer device system (not shown) by a network interface unit.

1901 The memory further includes one or more programs. The one or more programs are stored in the memory. The central processing unitexecutes the one or more programs to implement all or some steps of the image encoder training method described above.

This application further provides a non-transitory computer-readable storage medium, the storage medium storing at least one instruction, at least one program, a code set or an instruction set, and the at least one instruction, the at least one program, the code set or the instruction set being loaded and executed by the processor to implement the image encoder training method provided in the foregoing method embodiments.

This application provides a computer program product or a computer program. The computer program product or the computer program includes computer instructions, and the computer instructions are stored in a non-transitory computer-readable storage medium. A processor of a computer device reads the computer instructions from the computer-readable storage medium, and executes the computer instructions, to cause the computer device to perform the image encoder training method provided in the foregoing method embodiments.

The sequence numbers of the foregoing embodiments of this application are merely for description purpose but do not imply the preference among the embodiments. In this application, the term “module” in this application refers to a computer program or part of the computer program that has a predefined function and works together with other related parts to achieve a predefined goal and may be all or partially implemented by using software, hardware (e.g., processing circuitry and/or memory configured to perform the predefined functions), or a combination thereof. Each module can be implemented using one or more processors (or processors and memory). Likewise, a processor (or processors and memory) can be used to implement one or more modules. Moreover, each module can be part of an overall module that includes the functionalities of the module.

A person of ordinary skill in the art may understand that all or some of the steps of the foregoing embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware. The program may be stored in a non-transitory computer-readable storage medium. The storage medium may be a read-only memory, a magnetic disk, an optical disc, or the like.

The foregoing descriptions are merely embodiments of this application, but are not intended to limit this application. Any modification, equivalent replacement, or improvement made within the spirit and principle of this application shall fall within the protection scope of this application.

Classification Codes (CPC)

Cooperative Patent Classification codes for this invention. Click any code to explore related patents in that topic.

G06F G06F16/535 G06F16/55 G06F16/56 G06V G06V10/26 G06V10/46 G06V10/761 G06V10/762 G06V10/764 G06V10/774 G06V2201/3

Patent Metadata

Filing Date

November 10, 2025

Publication Date

March 12, 2026

Inventors

Sen YANG

Jinxi Xiang

Jun Zhang

Xiao Han

Want to explore more patents?

Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.

Browse All Patents Try Prior Art Search