An example system includes a processor to receive, a randomly generated alpha-map, a pair of training images, and a pair of training texts associated with the pair of training images. The processor is to generate a blended image based on the randomly generated alpha-map and the pair of training images. The processor is to train a visual language grounding model to separate the blended image into a pair of heatmaps identifying portions of the blended image corresponding to each of the training images using a separation loss.
Legal claims defining the scope of protection. Each claim is shown in both the original legal language and a plain English translation.
2. The system of claim 1, wherein the training texts comprise natural free-form texts.
This invention relates to a system for processing and analyzing training texts, particularly natural free-form texts, to improve machine learning models. The system addresses the challenge of effectively training models using unstructured, real-world language data, which often contains noise, variability, and informal expressions. The core system includes a text processing module that receives and preprocesses training texts, followed by a feature extraction module that converts the texts into structured numerical representations. These representations are then used by a machine learning model to learn patterns and relationships within the data. The system further includes a validation module to assess the model's performance and an optimization module to refine the model based on feedback. The training texts used in this system are natural free-form texts, meaning they are not artificially constrained or formatted, allowing the model to learn from authentic language use. This approach enhances the model's ability to generalize to real-world applications, such as chatbots, sentiment analysis, and language translation, where natural language input is common. The system's modular design allows for flexibility in adapting to different types of text data and machine learning algorithms, ensuring robust performance across various domains.
3. The system of claim 1, wherein the visual language grounding model comprises an encoder to generate image encodings based on the alpha-map and the pair of training images, a text conditioner to generate a plurality of text attenuated image encodings based on the image encodings and the pair of training texts, and a decoder to convert the text attenuated image encodings into heatmaps.
This invention relates to a visual language grounding system designed to align visual and textual data for tasks like image captioning or visual question answering. The system addresses the challenge of accurately mapping textual descriptions to specific regions within images, improving the precision of visual-language models. The system includes a visual language grounding model that processes pairs of training images and corresponding training texts. The model comprises an encoder, a text conditioner, and a decoder. The encoder generates image encodings by analyzing the input images and an alpha-map, which likely represents attention or segmentation data. The text conditioner then refines these encodings by incorporating the training texts, producing text-attenuated image encodings that emphasize relevant visual features based on the textual context. Finally, the decoder converts these refined encodings into heatmaps, which highlight the regions of the images most relevant to the given text. This approach enhances the model's ability to ground textual descriptions in specific image regions, improving performance in tasks requiring fine-grained visual-textual alignment. The system is particularly useful in applications where precise localization of textual concepts within images is critical, such as automated image annotation or interactive visual search.
4. The system of claim 3, wherein the text conditioner comprises a Bidirectional Encoder Representations from Transformers (BERT) model.
The system relates to natural language processing (NLP) and text analysis, specifically addressing the challenge of improving text representation for downstream tasks such as classification, sentiment analysis, or information retrieval. The system includes a text conditioner that processes input text to enhance its semantic and contextual understanding before further analysis. The text conditioner utilizes a Bidirectional Encoder Representations from Transformers (BERT) model, a deep learning architecture designed to capture bidirectional context in text. BERT processes the input text by encoding it into contextualized embeddings, which represent the meaning of words based on their surrounding context. These embeddings are then used to improve the accuracy and relevance of subsequent NLP tasks. The system may also include preprocessing steps to clean or normalize the input text before conditioning, as well as post-processing steps to refine the conditioned output. The BERT-based text conditioner enhances the system's ability to handle complex linguistic structures, ambiguities, and variations in text, leading to more robust and accurate results in applications like document classification, question answering, or machine translation.
5. The system of claim 4, wherein the text conditioner comprises a plurality of projection modules coupled to the BERT model.
This invention relates to natural language processing (NLP) systems, specifically improving text preprocessing for machine learning models. The problem addressed is the inefficiency of traditional text preprocessing methods, which often fail to capture contextual nuances or adapt dynamically to different text inputs. The system includes a text conditioner that enhances input text before feeding it into a BERT (Bidirectional Encoder Representations from Transformers) model, a widely used deep learning architecture for NLP tasks. The text conditioner comprises multiple projection modules, each designed to transform the input text into a more structured or optimized format. These modules may perform operations such as dimensionality reduction, feature extraction, or noise filtering to improve the quality of the text representations. By integrating these projection modules with the BERT model, the system ensures that the input text is preprocessed in a way that maximizes the model's performance on downstream NLP tasks, such as text classification, sentiment analysis, or question answering. The modular design allows for flexibility, enabling customization of the preprocessing pipeline based on specific application requirements. This approach enhances the accuracy and efficiency of NLP systems by ensuring that the input text is optimally conditioned before being processed by the BERT model.
6. The system of claim 1, wherein the visual language grounding model is trained using an unconditioned adversary loss.
A system for training visual language grounding models addresses the challenge of aligning visual and textual representations in multimodal tasks such as image captioning, visual question answering, or object detection. The system employs a visual language grounding model that maps visual features from images to corresponding textual descriptions or vice versa. To improve alignment, the model is trained using an unconditioned adversary loss, which helps the model learn a shared embedding space where visual and textual representations are indistinguishable. This adversarial training approach encourages the model to generate more coherent and contextually relevant mappings between modalities. The system may also include components for feature extraction from images, text encoding, and a discriminator network that evaluates the quality of the generated embeddings. By minimizing the adversarial loss, the model refines its ability to ground visual and textual data accurately, enhancing performance in tasks requiring cross-modal understanding. The training process leverages unconditioned adversarial loss, meaning the discriminator does not rely on specific task conditions, allowing the model to generalize better across different multimodal applications. This approach improves the robustness and accuracy of visual language grounding in real-world scenarios.
7. The system of claim 1, comprising a separately trained detector-based weak supervised grounding network, wherein the separately trained detector-based WSG network is to generate bounding boxes scores based on a received image and the trained visual language grounding model is to generate a first heatmap based on the received image, wherein the bounding box scores are converted to a second heatmap using assignment of the bounding box scores to pixels of the bounding box, and wherein the first heatmap and the second heatmap are averaged to generate a combined heatmap.
This invention relates to visual language grounding, a technology that aligns textual descriptions with corresponding regions in images. The problem addressed is improving the accuracy of identifying and localizing objects or regions in images based on natural language queries. Traditional methods often struggle with precise localization due to limitations in combining visual detection with language understanding. The system includes a detector-based weakly supervised grounding (WSG) network and a visual language grounding model. The WSG network is separately trained to analyze an input image and generate bounding box scores, which indicate potential regions of interest. The visual language grounding model processes the same image to produce a first heatmap, highlighting areas relevant to the given language query. The bounding box scores are converted into a second heatmap by assigning scores to the pixels within each bounding box. These two heatmaps are then averaged to create a combined heatmap, which provides a more accurate and refined localization of the queried region. This approach leverages the strengths of both detection-based and language-grounded methods to enhance precision in visual grounding tasks. The system is particularly useful in applications requiring precise image region identification, such as autonomous systems, augmented reality, and image retrieval.
10. The computer-implemented method of claim 8, wherein training the visual language grounding model comprises calculating a separation loss for each of the pair of training images as a main training objective.
The invention relates to training visual language grounding models, which are used to align visual and textual representations for tasks like image captioning or visual question answering. A key challenge in this domain is ensuring that the model learns meaningful correspondences between images and language, avoiding superficial or misleading associations. The method involves training a visual language grounding model using a separation loss as the primary training objective. This loss function is applied to pairs of training images, encouraging the model to distinguish between relevant and irrelevant visual-textual associations. Specifically, the separation loss penalizes the model when it incorrectly links an image to an unrelated text or vice versa, thereby improving the model's ability to ground language in the correct visual context. The training process may also include other techniques, such as contrastive learning, where the model learns to pull related image-text pairs closer in a shared embedding space while pushing unrelated pairs apart. The separation loss ensures that the model does not overfit to spurious correlations, leading to more robust and accurate visual language grounding. This approach enhances the model's performance in tasks requiring precise alignment between visual and textual data.
11. The computer-implemented method of claim 10, wherein training the visual language grounding model comprises calculating an image-to-text loss for text and image feature distribution alignment.
This invention relates to training visual language grounding models, which align visual and textual representations to enable tasks like image captioning or visual question answering. The challenge addressed is ensuring that the model accurately maps between images and their corresponding textual descriptions, which is critical for applications requiring cross-modal understanding. The method involves training a visual language grounding model by calculating an image-to-text loss. This loss function measures the discrepancy between the distributions of image features and text features, ensuring they are properly aligned in a shared embedding space. The alignment process helps the model learn meaningful correspondences between visual and linguistic representations, improving its ability to generate accurate text descriptions for images or answer questions about visual content. The training process may include extracting features from both images and text, then optimizing the model to minimize the image-to-text loss. This ensures that the learned representations are semantically consistent across modalities. The method may also incorporate additional techniques, such as contrastive learning or attention mechanisms, to further refine the alignment between visual and textual data. The result is a model that effectively bridges the gap between images and language, enabling robust cross-modal applications.
12. The computer-implemented method of claim 8, wherein training the visual language grounding model comprises calculating a negative texts loss based on a third received training text that is unrelated to the pair of training images.
This invention relates to training a visual language grounding model, which aligns visual and textual representations to enable tasks like image captioning or visual question answering. The problem addressed is improving model robustness by incorporating negative text samples that are unrelated to the input images, reducing false associations between irrelevant text and visual content. The method involves training a visual language grounding model using pairs of training images and corresponding training text. During training, a third, unrelated training text is received, which is not associated with the image pair. A negative text loss is calculated based on this unrelated text to penalize incorrect alignments between the visual and textual representations. This loss function helps the model distinguish between relevant and irrelevant text, improving its ability to ground language in visual context accurately. The model is trained by processing the image pair and the corresponding training text to generate visual and textual embeddings. The unrelated text is also processed to generate a textual embedding. The negative text loss is computed by comparing the similarity between the visual embeddings and the unrelated text embedding, ensuring the model learns to minimize incorrect associations. This approach enhances the model's performance in tasks requiring precise visual-language alignment, such as image captioning, visual question answering, and cross-modal retrieval.
13. The computer-implemented method of claim 8, wherein training the visual language grounding model comprises calculating an unconditioned adversary loss to decrease overfitting on artifacts.
This invention relates to improving visual language grounding models, which are used to align visual and textual data for tasks like image captioning or visual question answering. A key challenge in training such models is overfitting to artifacts in the training data, leading to poor generalization. The invention addresses this by incorporating an unconditioned adversary loss during training. This loss function helps the model distinguish between meaningful patterns and spurious artifacts, reducing overfitting. The adversary loss operates by training a secondary adversarial component that identifies overfitting tendencies, while the primary model is optimized to resist these tendencies. This approach ensures the model learns robust, generalizable features rather than memorizing noise or artifacts. The method involves iteratively adjusting the model parameters to minimize the adversary loss, thereby improving performance on unseen data. This technique is particularly useful in applications requiring high accuracy and reliability, such as autonomous systems or medical imaging, where overfitting can lead to critical errors. The invention enhances the robustness of visual language grounding models by explicitly penalizing overfitting during training.
17. The computer program product of claim 15, further comprising program code executable by the processor to calculate a separation loss for each of the pair of training images as a main training objective.
This invention relates to a computer program product for training machine learning models, specifically addressing the challenge of optimizing model performance by minimizing separation loss between pairs of training images. The program includes executable code to calculate a separation loss for each pair of training images, serving as the primary training objective. This separation loss quantifies the dissimilarity between images in a pair, ensuring the model learns to distinguish between different classes or features effectively. The program also includes code to generate these image pairs, which may involve selecting images from different classes or applying transformations to emphasize distinguishing features. By focusing on separation loss as the main training objective, the model is trained to maximize the distance between dissimilar images in its feature space, improving classification accuracy and robustness. The invention is particularly useful in applications requiring high precision, such as medical imaging, autonomous systems, or security, where distinguishing between similar but distinct inputs is critical. The program may also include additional features, such as adaptive learning rates or regularization techniques, to further enhance training efficiency and model generalization.
18. The computer program product of claim 15, further comprising program code executable by the processor to calculate an image-to-text loss for text and image feature distribution alignment.
The invention relates to a computer program product for aligning text and image feature distributions in machine learning models, particularly for tasks involving multimodal data processing. The core problem addressed is the misalignment between text and image representations in multimodal models, which can degrade performance in applications like image captioning, visual question answering, or cross-modal retrieval. The solution involves a method to train a model by calculating an image-to-text loss that measures the discrepancy between the learned distributions of text and image features. This loss function is designed to enforce consistency between the two modalities, ensuring that the model generates text representations that are semantically aligned with corresponding image features. The approach may include extracting features from both text and images, computing statistical measures of their distributions, and optimizing the model to minimize the divergence between these distributions. By incorporating this loss function during training, the model improves its ability to generate coherent and contextually relevant text for given images or vice versa. The invention is particularly useful in applications requiring precise alignment between textual and visual data, such as automated content generation or multimodal search systems.
19. The computer program product of claim 15, further comprising program code executable by the processor to calculate a negative texts loss based on a third received training text that is unrelated to the pair of training images.
This invention relates to a computer program product for training a machine learning model to generate text descriptions for images. The problem addressed is improving the accuracy of text generation by reducing the influence of unrelated or misleading text inputs during training. The system receives a pair of training images and a first training text related to the images, then generates a first text description for one of the images. The system compares this generated text with the first training text to calculate a first loss value, which measures the difference between the generated and reference text. Additionally, the system receives a second training text unrelated to the images and generates a second text description for one of the images. The system compares this second generated text with the second training text to calculate a second loss value. The system then calculates a combined loss by weighting and summing the first and second loss values, where the second loss value is weighted negatively to penalize the model when the generated text resembles the unrelated text. This negative weighting helps the model learn to ignore or suppress irrelevant text inputs, improving the quality of generated descriptions. The system further calculates a negative text loss based on a third received training text that is unrelated to the images, reinforcing the model's ability to distinguish relevant from irrelevant text. The overall approach enhances the model's performance by minimizing the impact of misleading or unrelated text during training.
20. The computer program product of claim 15, further comprising program code executable by the processor to calculate an unconditioned adversary loss to decrease overfitting on artifacts.
This invention relates to machine learning systems, specifically addressing the problem of overfitting in adversarial training. Overfitting occurs when a model learns to recognize artifacts or noise in training data rather than generalizing to real-world patterns, reducing its performance on unseen data. The invention improves adversarial training by introducing a mechanism to calculate an unconditioned adversary loss, which helps mitigate overfitting by penalizing the model when it relies too heavily on training artifacts. The system includes a processor and a memory storing program code executable by the processor. The program code includes instructions to generate an adversarial example by perturbing an input sample, compute a loss function based on the adversarial example, and adjust the model parameters to minimize this loss. The unconditioned adversary loss is calculated separately to ensure the model does not over-optimize for specific training artifacts, thereby improving robustness and generalization. This approach enhances the model's ability to handle real-world variations while maintaining accuracy. The invention is particularly useful in applications where adversarial robustness is critical, such as security systems, autonomous vehicles, and medical diagnostics.
Cooperative Patent Classification codes for this invention. Click any code to explore related patents in that topic.
August 26, 2021
April 9, 2024
Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.