Mapping reference images into subject-oriented and attribute-oriented pseudo-word tokens for zero-shot composed image retrieval, without requiring any annotated training triplets.
Haoqiang Lin1, Haokun Wen2, Xuemeng Song1*, Meng Liu3, Yupeng Hu1, Liqiang Nie2
1 Shandong University, Qingdao / Jinan, China
2 Harbin Institute of Technology (Shenzhen), Shenzhen, China
3 Shandong Jianzhu University, Jinan, China
* Corresponding author: sxmustc@gmail.com
- Paper:
SIGIR '24 - Pre-trained Model:
huggingface
- Updates
- Introduction
- Highlights
- Method / Framework
- Project Structure
- Installation
- Dataset / Benchmark
- Pre-trained Model
- Usage
- Citation
- Acknowledgement
- License
- [07/2024] Paper accepted and presented at SIGIR 2024 (Washington, DC, USA)
- [07/2024] Initial code release
This repository contains the official implementation of the paper Fine-grained Textual Inversion Network for Zero-Shot Composed Image Retrieval, published at SIGIR 2024.
Background. Composed Image Retrieval (CIR) enables users to retrieve target images using a multimodal query consisting of a reference image and a modification text describing the desired changes. However, existing supervised CIR methods rely heavily on expensive annotated training triplets in the form of <reference image, modification text, target image>, which limits dataset scale and model generalization.
Method. We propose FTI4CIR, which addresses CIR without requiring any annotated triplets. Unlike prior methods that map an image to a single coarse-grained pseudo-word token, FTI4CIR maps each image into:
- One subject-oriented pseudo-word token β capturing the primary subject(s) of the image.
- Several attribute-oriented pseudo-word tokens β capturing contextual attributes such as sleeve length, color, and background.
Each image is represented by the sentence "a photo of [S*] with [A1*, ..., Ar*]", which is concatenated with the modification text to form a unified text query, reducing CIR to a standard text-to-image retrieval task.
Key Components.
- Fine-grained Pseudo-word Token Mapping: Generates subject pseudo-words from global image features, and leverages a Transformer with a local-global relevance-based filtering strategy to dynamically extract attribute-oriented features.
- Tri-wise Caption-based Semantic Regularization: Employs BLIP to generate natural language descriptions and aligns pseudo-word tokens to the real-word embedding space via three complementary objectives: subject-wise, attribute-wise, and whole-wise.
Results. FTI4CIR achieves state-of-the-art zero-shot performance on FashionIQ, CIRR, and CIRCO.
This repository provides: full training code, inference and evaluation scripts, and pre-trained model weights.
- π First fine-grained textual inversion approach: Maps images into both subject-oriented and attribute-oriented pseudo-word tokens, surpassing single pseudo-word representations.
- π§ Dynamic local attribute feature extraction: Adaptively handles diverse attribute types across different domains (fashion, animals, open-domain).
- π Tri-wise semantic regularization: Aligns pseudo-word tokens to the real-word embedding space via subject-wise, attribute-wise, and whole-wise objectives using BLIP-generated captions.
- π« Fully zero-shot: Training relies solely on unlabeled open-domain images; no annotated CIR triplets are required at any stage.
- π State-of-the-art on three benchmarks: Outperforms all zero-shot baselines on FashionIQ (R@10/50), CIRR (R@1/5/10/50), and CIRCO.
Figure 1. Overall framework of FTI4CIR, consisting of (a) Fine-grained Pseudo-word Token Mapping and (b) Tri-wise Caption-based Semantic Regularization.
Reference Image
β
βΌ
Fine-grained Textual Inversion
β
βΌ
"A photo of [S*] with [A1*, ..., Ar*]."
β
+ Modification Text
β
βΌ
Pure Text Query β Text-to-Image Retrieval β Target Image
.
βββ assets/ # Figures and visualizations
βββ src/
β βββ train.py # Training script
β βββ evaluate.py # Validation evaluation script
β βββ test.py # Test-split prediction generation script
βββ README.md
βββ requirements.txt
βββ LICENSE
git clone https://github.com/ZiChao111/FTI4CIR.git
cd FTI4CIRPlatform : NVIDIA A100 40G
Python : 3.9.12
PyTorch : 2.2.0
pip install -r requirements.txtPlease download each dataset following the instructions in the respective official repositories and organize them as described below.
Download the ImageNet1K (ILSVRC2012) test set from the official site.
βββ ImageNet
β βββ test
β βββ [ILSVRC2012_test_00000001.JPEG | ... | ILSVRC2012_test_00100000.JPEG]
Download from the official repository.
βββ FashionIQ
β βββ captions
β β βββ cap.dress.[train | val | test].json
β β βββ cap.toptee.[train | val | test].json
β β βββ cap.shirt.[train | val | test].json
β βββ image_splits
β β βββ split.dress.[train | val | test].json
β β βββ split.toptee.[train | val | test].json
β β βββ split.shirt.[train | val | test].json
β βββ dress / shirt / toptee
β β βββ [*.jpg]
Download from the official repository.
βββ CIRR
β βββ train / dev / test1
β β βββ [*.png]
β βββ cirr
β βββ captions
β β βββ cap.rc2.[train | val | test1].json
β βββ image_splits
β βββ split.rc2.[train | val | test1].json
Download from the official repository.
βββ CIRCO
β βββ annotations
β β βββ [val | test].json
β βββ COCO2017_unlabeled
β βββ annotations
β β βββ image_info_unlabeled2017.json
β βββ unlabeled2017
β βββ [*.jpg]
Pre-trained model weights are available on Google Drive.
After downloading, place the checkpoint under model_save/, or specify the path via the --model_path argument at evaluation time.
Image captions used during training are generated by BLIP. Please refer to the BLIP repository for caption generation details.
python src/train.py \
--save_frequency 1 \
--batch_size=256 \
--lr=4e-5 \
--wd=0.01 \
--epochs=60 \
--model_dir="./model_save" \
--workers=8 \
--model ViT-L/14python src/evaluate.py \
--dataset='cirr' \
--save_path='' \
--model_path=""| Argument | Description |
|---|---|
--dataset |
Dataset to evaluate on. Options: fashioniq / cirr / circo |
--model_path |
Path to the pre-trained model checkpoint |
--save_path |
Path to save the prediction results |
Generate prediction files for submission to the official evaluation servers:
python src/test.py \
--dataset='cirr' \
--save_path='' \
--model_path=""| Argument | Description |
|---|---|
--dataset |
Options: cirr / circo |
--model_path |
Path to the pre-trained model checkpoint |
--save_path |
Path to save the prediction file |
If you find this work useful for your research, please consider citing:
@inproceedings{FTI4CIR,
author = {Haoqiang Lin and Haokun Wen and Xuemeng Song and Meng Liu and Yupeng Hu and Liqiang Nie},
title = {Fine-grained Textual Inversion Network for Zero-Shot Composed Image Retrieval},
booktitle = {Proceedings of the International {ACM} SIGIR Conference on Research and Development in Information Retrieval},
pages = {240--250},
publisher = {{ACM}},
year = {2024}
}We thank the following open-source projects for providing valuable components and implementations:
This project is released under the Apache License 2.0.