βΒ Corresponding authorΒ Β
Accepted by ICASSP 2026: A novel contextualized network tackling the neglect of contextual information in Composed Image Retrieval (CIR) by amplifying similarity differences between matching and non-matching samples.
HINT (dual-patH composItional coNtextualized neTwork) is our proposed framework for Composed Image Retrieval (CIR), accepted by ICASSP 2026. Although existing methods have made significant progress, they often neglect contextual information in discriminating matching samples. To address the implicit dependencies and the lack of a differential amplification mechanism, HINT systematically models contextual structure to improve the upper performance of CIR models in complex scenarios.
- [2026-03-26] π Initial setup for the HINT repository. Source code is scheduled for release in April 2026.
- [2026-01-18] π₯ Our paper "HINT: COMPOSED IMAGE RETRIEVAL WITH DUAL-PATH COMPOSITIONAL CONTEXTUALIZED NETWORK" has been accepted by ICASSP 2026!
- π§ Dual Context Extraction (DCE): Extracts both intra-modal context and cross-modal context, enhancing joint semantic representation by integrating multimodal contextual information.
- π Quantification of Contextual Relevance (QCR): Evaluates the relevance between cross-modal contextual information and the target image semantics, enabling the quantification of implicit dependencies.
- π‘οΈ Dual-Path Consistency Constraints (DPCC): Optimizes the training process by constraining the representation consistency between multimodal fusion features and the target, ensuring the stable enhancement of similarity for matching instances while lowering the similarity for non-matching instances.
- π Outstanding Performance: Achieves competitive results on major metrics across two CIR benchmark datasets, FashionIQ and CIRR, demonstrating strong cross-domain generalization ability.
- π Introduction
- π’ News
- β¨ Key Features
- ποΈ Architecture
- πββοΈ Experiment-Results
- Table of Contents
- π¦ Install
- π Data Preparation
- π Quick Start
- π Project Structure
- π€ Acknowledgement
- βοΈ Contact
- π Related Projects
- πβοΈ Citation
1. Clone the repository
git clone https://github.com/zh-mingyu.github.io/HINT.git
cd HINT2. Setup Python Environment
The code is evaluated on Python 3.8.10 and CUDA 12.6. We recommend using Anaconda:
conda create -n habit python=3.8
conda activate habit
# Install PyTorch (The evaluated environment uses Torch 2.1.0 with CUDA 12.1 compatibility)
pip install torch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 --index-url [https://download.pytorch.org/whl/cu121](https://download.pytorch.org/whl/cu121)
# Install core dependencies
pip install open-clip-torch==2.24.0 scikit-learn==1.3.2 transformers==4.25.0 salesforce-lavis==1.0.2 timm==0.9.16We evaluated our framework on two standard datasets: FashionIQ and CIRR. Please download the datasets first.
Click to expand: FashionIQ Dataset Directory Structure
Please follow the official instructions to download the FashionIQ dataset. Once downloaded, ensure the folder structure looks like this:
βββ FashionIQ
β βββ captions
β β βββ cap.dress.[train | val | test].json
β β βββ cap.toptee.[train | val | test].json
β β βββ cap.shirt.[train | val | test].json
β βββ image_splits
β β βββ split.dress.[train | val | test].json
β β βββ split.toptee.[train | val | test].json
β β βββ split.shirt.[train | val | test].json
β βββ dress
β β βββ [B000ALGQSY.jpg | B000AY2892.jpg | B000AYI3L4.jpg |...]
β βββ shirt
β β βββ [B00006M009.jpg | B00006M00B.jpg | B00006M6IH.jpg | ...]
β βββ toptee
β β βββ [B0000DZQD6.jpg | B000A33FTU.jpg | B000AS2OVA.jpg | ...]
Click to expand: CIRR Dataset Directory Structure
Please follow the official instructions to download the CIRR dataset. Once downloaded, ensure the folder structure looks like this:
βββ CIRR
β βββ train
β β βββ [0 | 1 | 2 | ...]
β β β βββ [train-10108-0-img0.png | train-10108-0-img1.png | ...]
β βββ dev
β β βββ [dev-0-0-img0.png | dev-0-0-img1.png | ...]
β βββ test1
β β βββ [test1-0-0-img0.png | test1-0-0-img1.png | ...]
β βββ cirr
β βββ captions
β β βββ cap.rc2.[train | val | test1].json
β βββ image_splits
β β βββ split.rc2.[train | val | test1].json
Our model is trained using the AdamW optimizer. The hyper-parameter
Training on FashionIQ:
python train.py \
--dataset fashioniq \
--fashioniq_path "/path/to/FashionIQ/" \
--model_dir "./checkpoints/fashioniq_hint" \
--batch_size 128 \
--num_epochs 10 \
--lr 2e-5Training on CIRR:
python train.py \
--dataset cirr \
--cirr_path "/path/to/CIRR/" \
--model_dir "./checkpoints/cirr_hint" \
--batch_size 128 \
--num_epochs 10 \
--lr 2e-5π‘ Tips: > - Our model is based on the powerful BLIP-2 architecture. It is highly recommended to run the training on GPUs with sufficient memory (e.g., NVIDIA A40 48G / V100 32G).
- The best model weights and evaluation metrics generated during training will be automatically saved in the
best_model.ptandmetrics_best.jsonfiles within your specified--model_dir.
To generate the prediction files on the CIRR dataset for submission to the CIRR Evaluation Server, run the testing script:
python src/cirr_test_submission.py checkpoints/cirr_hint/(The corresponding script will automatically output .json based on the generated best checkpoints in the folder for online evaluation.)
Our code is deeply customized based on the LAVIS framework. The core implementations are centralized in the following files:
HINT/
βββ lavis/
β βββ models/
β β βββ blip2_models/
β β βββ HINT.py # π§ Core model implementation: Includes DCE, QCR and DPCC modules
βββ train.py # π Training entry point: Controls noise_ratio injection and training loops
βββ datasets.py
βββ test.py
βββ utils.py
βββ data_utils.py
βββ cirr_test_submission.py # Auxiliary scripts
βββ datasets/ # Dataset loading and processing logic
βββ README.md
The implementation of this project utilizes the pre-trained vision-language features from BLIP-2 and references the LAVIS framework. We express our sincere gratitude to these open-source contributions!
For any questions, issues, or feedback, please open an issue on GitHub or reach out to us at mingyuzhang@mail.sdu.edu.cn.
Ecosystem & Other Works from our Team
![]() ConeSep (CVPR'26) Web | Code | |
![]() Air-Know (CVPR'26) Web | Code | |
![]() ReTrack (AAAI'26) Web | Code | Paper |
![]() INTENT (AAAI'26) Web | Code | Paper |
![]() HUD (ACM MM'25) Web | Code | Paper |
![]() OFFSET (ACM MM'25) Web | Code | Paper |
![]() ENCODER (AAAI'25) Web | Code | Paper |
![]() HABIT (AAAI'26) Web | Code | Paper |
If you find our work or this code useful in your research, please consider leaving a StarβοΈ or Citingπ our paper π₯°. Your support is our greatest motivation!
@inproceedings{HINT2026,
title={HINT: COMPOSED IMAGE RETRIEVAL WITH DUAL-PATH COMPOSITIONAL CONTEXTUALIZED NETWORK},
author={Zhang, Mingyu and Li, Zixu and Chen, Zhiwei and Fu, Zhiheng and Zhu, Xiaowei and Nie, Jiajia and Wei, Yinwei and Hu, Yupeng},
booktitle={Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
year={2026}
}








