News Release

Adequate alignment and interaction for cross-modal retrieval

Peer-Reviewed Publication

Beijing Zhongke Journal Publising Co. Ltd.

General architecture of the proposed framework

image: 

hv( ) and ht( ) represent the image encoder and text encoder respectively, which are used to extract features from images and texts. hv( ) and ht( ) are their corresponding momentum encoders, which are employed to provide rich negative samples. Then, the extracted features are fed into GPO aggregators to obtain the holistic embeddings. gv( ) and gt( ) denote GPO aggregators for image and text modalities, respectively. gv( ) and gt( ) are their corresponding momentum aggregators. To learn adequate alignment relationships between different modalities, we aligned the aggregated features in the alignment module, which contains three objectives: image-text contrastive learning (ITC), intra-modal separability (IMS), and local mutual information maximization (LMIM). Finally, we incorporated a multimodal fusion encoder hf( ) at the end of our model to explore the interaction information between different modalities. Details of the image encoder, text encoder, and fusion encoder are described on the right side.

view more 

Credit: Beijing Zhongke Journal Publising Co. Ltd.

With the popularization of social networks, different modalities of data such as images, text, and audio aregrowing rapidly on the Internet. Subsequently, cross-modal retrieval has emerged as a fundamental task invarious applications, and has received significant attention in recent years. The core idea of cross-modalretrieval is to learn an accurate and generalizable alignment between different modalities (e.g., visual andtextual data) such that semantically similar objects can be correctly retrieved in one modality with a queryfrom another modality.

This article propose a novel framework for cross-modal retrieval that aims to performadequate alignment and interaction on the aggregated features of different modalities to effectively bridge themodality gap. The proposed framework contains two key components: a well-designed alignment module andnovel multimodal fusion encoder. Specifically,we leveraged an image/text encoder to extract a set of features from the input image/text, and maintained amomentum encoder corresponding to the image/text encoder to provide rich negative samples for modeltraining. Inspired by recent work on feature aggregation, we adopted the design of a generalized poolingoperator (GPO) to improve the quality of the global representations. To ensure that the model learns adequately aligned relationships, we introduced an alignment module with three objectives: image-text contrastive learning (ITC), intra-modal separability (IMS), and local mutual information maximization (LMIM).ITC encourages the model to separate unmatched image-text pairs and colligates the embeddings of matchedimage-text pairs. The IMS enables the model to learn representations that can distinguish different objectsusing the same modality, which can alleviate the problem of representation degradation to a certain extent. TheLMIM encourages the model to maximize the mutual information between the global representation (aggregated features) and local regional features (e.g., image patches or text tokens), aiming to capture sharedinformation among all regions rather than being affected by certain noisy regions. To endow the model withthe capability to explore the interaction information between different modalities, we incorporated a multimodal fusion encoder at the end of our model to perform interactions between different modalities after crossmodal alignment.


Disclaimer: AAAS and EurekAlert! are not responsible for the accuracy of news releases posted to EurekAlert! by contributing institutions or for the use of any information through the EurekAlert system.