[TOIS 24] Breaking Through the Noisy Correspondence: A Robust Model for Image-Text Matching

DSDMR is a robust cross-modal retrieval framework that effectively handles noisy image-text correspondence through similarity distribution modeling and calibrated similarity learning, achieving state-of-the-art performance on major benchmarks.

Authors

Haitao Shi¹, Meng Liu²*, Xiaoxuan Mu¹, Xuemeng Song¹, Yupeng Hu¹, Liqiang Nie³*

¹ School of Software, Shandong University, Jinan, China
² School of Computer Science and Technology, Shandong Jianzhu University, Jinan, China
³ School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, China
* Corresponding authors

Links

Paper: ACM DL Link
Code Repository: GitHub

Updates

[04/2026] Initial release of code and documentation.

Introduction

This project is the official implementation of the paper Breaking Through the Noisy Correspondence: A Robust Model for Image-Text Matching, published in ACM TOIS.

Problem Addressed

Figure 1. Examples of noisy correspondence from Flickr30K, MS-COCO, and Conceptual Captions (CCs) datasets. The artificial noisy correspondence was injected into the Flickr30K and MS-COCO datasets, whereas the noisy correspondence in the CC dataset is authentic.

Traditional image-text matching models are highly sensitive to Noisy Correspondence (misaligned image-text pairs). DSDMR addresses the performance degradation and gradient misguidance caused by these noisy samples in large-scale datasets.

Core Idea

DSDMR enhances noise robustness through Similarity Distribution Modeling. It transforms noise filtering into a parameter estimation problem using a bimodal Gaussian Mixture Model (GMM) to explicitly separate "clean" and "noisy" distributions.

Key Characteristics

Separation Mechanism: Effectively distinguishes reliable pairs from noisy ones.
Dynamic Optimization: Features a specialized loss function that adjusts margins to enhance distribution separability.
Plug-and-Play Framework: As a post-processing module, it can be seamlessly integrated into any pretrained cross-modal model (e.g., CLIP) without modifying the original architecture.

Highlights

Similarity Distribution Modeling: Transforms noise filtering into a parameter estimation problem of a bimodal GMM.
DSDMR Loss Function: Dynamically adjusts margins to enhance the separability of distributions and mitigate gradient misguidance.
Plug-and-Play: Seamlessly integrated into any pretrained cross-modal model (e.g., CLIP).
State-of-the-Art: Demonstrates superior robustness under various noise rates across three major benchmarks.

Method / Framework

Figure 2. Schematic representation of our proposed methodology. (1) Similarity Sampling: Leveraging the CLIP model, image-text similarity samples are obtained from pairs exhibiting noisy correspondence. (2) Noisy Correspondence Filtering: A bimodal GMM segregates similarity samples into clean” and ”noisy” distributions, facilitating the filtering out of noisy correspondence. (3) DSDMR Loss: By discerning the data distribution, the DSDMR loss dynamically modulates the margin, further mitigating the detrimental influence of noisy correspondence.

Experimental Results

Table 3. Performance Comparison Between the Proposed Method and the State-of-the-Art Baselines at Different Noise Rates on Flickr30K and MS-COCO 1K.

Table 4. Performance Comparison Between Our Proposed Method and the State-of-the-Art Baselines on CC152K.

Table 5. Ablation Studies of Our Model on the MS-COCO Dataset with 20% Noise.

Figure 3. Comparison between initial and post-filtered noise rates across various noise rate settings for the Flickr30K and MS-COCO datasets. Blue indicates the initial noise rate, while green signifies the noise rate after the filtering process.

Figure 4. Illustration of similarity distributions before and after noise filtering on the Flickr30K dataset at noise rates of 20% and 40%. The left represents the distribution before filtering, the middle represents the distribution after filtering with a fixed margin loss, and the right represents the distribution after filtering with DSDMR loss. The red delineates the noisy distribution, while the blue depicts the clean distribution.

Project Structure

.
├── CLIPFinetune/          # Scripts and modules for CLIP fine-tuning
├── NCR/                   # Implementation of Noisy Correspondence Robustness modules
├── assets/                # Framework diagrams and visualization results
├── cc152k/                # Dataset-specific processing or configuration for CC152K
├── changedCLIPmodal/      # Modified CLIP model architectures
├── test CLIP/             # Testing and evaluation scripts for CLIP-based models
├── LICENSE
├── README.md
└── __init__.py

Installation

1. Clone the repository

git clone [https://github.com/shinian-023/DSDMR.git](https://github.com/shinian-023/DSDMR.git)
cd DSDMR

2. Create environment

conda create -n dsdmr python=3.8 -y
conda activate dsdmr

3. Install dependencies

pip install -r requirements.txt

Usage

Training

# Example training command
python CLIPFinetune/train.py

Evaluation

# Example evaluation command
python test\ CLIP/eval.py

Citation

@article{DSDMR_TOIS2024,
  author    = {Haitao Shi and
               Meng Liu and
               Xiaoxuan Mu and
               Xuemeng Song and
               Yupeng Hu and
               Liqiang Nie},
  title     = {Breaking Through the Noisy Correspondence: {A} Robust Model for Image-Text Matching},
  journal   = {{ACM} Trans. Inf. Syst.},
  volume    = {42},
  number    = {6},
  pages     = {149:1--149:26},
  year      = {2024}
}

Acknowledgement

Thanks to our supervisors and collaborators for their valuable support.
Thanks to the open-source community for providing useful baselines and cross-modal tools.

License

This project is released under the Apache License 2.0.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

[TOIS 24] Breaking Through the Noisy Correspondence: A Robust Model for Image-Text Matching

Authors

Links

Table of Contents

Updates

Introduction

Problem Addressed

Core Idea

Key Characteristics

Highlights

Method / Framework

Experimental Results

Project Structure

Installation

1. Clone the repository

2. Create environment

3. Install dependencies

Usage

Training

Evaluation

Citation

Acknowledgement

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
CLIPFinetune		CLIPFinetune
NCR		NCR
assets		assets
cc152k		cc152k
changedCLIPmodal		changedCLIPmodal
test CLIP		test CLIP
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py

Folders and files

Latest commit

History

Repository files navigation

[TOIS 24] Breaking Through the Noisy Correspondence: A Robust Model for Image-Text Matching

Authors

Links

Table of Contents

Updates

Introduction

Problem Addressed

Core Idea

Key Characteristics

Highlights

Method / Framework

Experimental Results

Project Structure

Installation

1. Clone the repository

2. Create environment

3. Install dependencies

Usage

Training

Evaluation

Citation

Acknowledgement

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages