Fine-grained Textual Inversion Network for Zero-Shot Composed Image Retrieval

Mapping reference images into subject-oriented and attribute-oriented pseudo-word tokens for zero-shot composed image retrieval, without requiring any annotated training triplets.

Authors

Haoqiang Lin¹, Haokun Wen², Xuemeng Song¹*, Meng Liu³, Yupeng Hu¹, Liqiang Nie²

¹ Shandong University, Qingdao / Jinan, China
² Harbin Institute of Technology (Shenzhen), Shenzhen, China
³ Shandong Jianzhu University, Jinan, China
* Corresponding author: sxmustc@gmail.com

Links

Paper: SIGIR '24
Pre-trained Model: huggingface

Updates

[07/2024] Paper accepted and presented at SIGIR 2024 (Washington, DC, USA)
[07/2024] Initial code release

Introduction

This repository contains the official implementation of the paper Fine-grained Textual Inversion Network for Zero-Shot Composed Image Retrieval, published at SIGIR 2024.

Background. Composed Image Retrieval (CIR) enables users to retrieve target images using a multimodal query consisting of a reference image and a modification text describing the desired changes. However, existing supervised CIR methods rely heavily on expensive annotated training triplets in the form of <reference image, modification text, target image>, which limits dataset scale and model generalization.

Method. We propose FTI4CIR, which addresses CIR without requiring any annotated triplets. Unlike prior methods that map an image to a single coarse-grained pseudo-word token, FTI4CIR maps each image into:

One subject-oriented pseudo-word token — capturing the primary subject(s) of the image.
Several attribute-oriented pseudo-word tokens — capturing contextual attributes such as sleeve length, color, and background.

Each image is represented by the sentence "a photo of [S*] with [A1*, ..., Ar*]", which is concatenated with the modification text to form a unified text query, reducing CIR to a standard text-to-image retrieval task.

Key Components.

Fine-grained Pseudo-word Token Mapping: Generates subject pseudo-words from global image features, and leverages a Transformer with a local-global relevance-based filtering strategy to dynamically extract attribute-oriented features.
Tri-wise Caption-based Semantic Regularization: Employs BLIP to generate natural language descriptions and aligns pseudo-word tokens to the real-word embedding space via three complementary objectives: subject-wise, attribute-wise, and whole-wise.

Results. FTI4CIR achieves state-of-the-art zero-shot performance on FashionIQ, CIRR, and CIRCO.

This repository provides: full training code, inference and evaluation scripts, and pre-trained model weights.

Highlights

🆕 First fine-grained textual inversion approach: Maps images into both subject-oriented and attribute-oriented pseudo-word tokens, surpassing single pseudo-word representations.
🔧 Dynamic local attribute feature extraction: Adaptively handles diverse attribute types across different domains (fashion, animals, open-domain).
📐 Tri-wise semantic regularization: Aligns pseudo-word tokens to the real-word embedding space via subject-wise, attribute-wise, and whole-wise objectives using BLIP-generated captions.
🚫 Fully zero-shot: Training relies solely on unlabeled open-domain images; no annotated CIR triplets are required at any stage.
📊 State-of-the-art on three benchmarks: Outperforms all zero-shot baselines on FashionIQ (R@10/50), CIRR (R@1/5/10/50), and CIRCO.

Method / Framework

Figure 1. Overall framework of FTI4CIR, consisting of (a) Fine-grained Pseudo-word Token Mapping and (b) Tri-wise Caption-based Semantic Regularization.

Inference Pipeline

Reference Image
      │
      ▼
Fine-grained Textual Inversion
      │
      ▼
"A photo of [S*] with [A1*, ..., Ar*]."
      │
      + Modification Text
      │
      ▼
Pure Text Query → Text-to-Image Retrieval → Target Image

Project Structure

.
├── assets/               # Figures and visualizations
├── src/
│   ├── train.py          # Training script
│   ├── evaluate.py       # Validation evaluation script
│   └── test.py           # Test-split prediction generation script
├── README.md
├── requirements.txt
└── LICENSE

Installation

1. Clone the repository

git clone https://github.com/ZiChao111/FTI4CIR.git
cd FTI4CIR

2. Running Environment

Platform : NVIDIA A100 40G
Python   : 3.9.12
PyTorch  : 2.2.0

3. Install dependencies

pip install -r requirements.txt

Dataset / Benchmark

Please download each dataset following the instructions in the respective official repositories and organize them as described below.

ImageNet

Download the ImageNet1K (ILSVRC2012) test set from the official site.

├── ImageNet
│   └── test
│       └── [ILSVRC2012_test_00000001.JPEG | ... | ILSVRC2012_test_00100000.JPEG]

FashionIQ

Download from the official repository.

├── FashionIQ
│   ├── captions
│   │   ├── cap.dress.[train | val | test].json
│   │   ├── cap.toptee.[train | val | test].json
│   │   └── cap.shirt.[train | val | test].json
│   ├── image_splits
│   │   ├── split.dress.[train | val | test].json
│   │   ├── split.toptee.[train | val | test].json
│   │   └── split.shirt.[train | val | test].json
│   ├── dress / shirt / toptee
│   │   └── [*.jpg]

CIRR

Download from the official repository.

├── CIRR
│   ├── train / dev / test1
│   │   └── [*.png]
│   └── cirr
│       ├── captions
│       │   └── cap.rc2.[train | val | test1].json
│       └── image_splits
│           └── split.rc2.[train | val | test1].json

CIRCO

Download from the official repository.

├── CIRCO
│   ├── annotations
│   │   └── [val | test].json
│   └── COCO2017_unlabeled
│       ├── annotations
│       │   └── image_info_unlabeled2017.json
│       └── unlabeled2017
│           └── [*.jpg]

Pre-trained Model

Pre-trained model weights are available on Google Drive.

After downloading, place the checkpoint under model_save/, or specify the path via the --model_path argument at evaluation time.

Image captions used during training are generated by BLIP. Please refer to the BLIP repository for caption generation details.

Usage

Training

python src/train.py \
    --save_frequency 1 \
    --batch_size=256 \
    --lr=4e-5 \
    --wd=0.01 \
    --epochs=60 \
    --model_dir="./model_save" \
    --workers=8 \
    --model ViT-L/14

Validation (split=val)

python src/evaluate.py \
    --dataset='cirr' \
    --save_path='' \
    --model_path=""

Argument	Description
`--dataset`	Dataset to evaluate on. Options: `fashioniq` / `cirr` / `circo`
`--model_path`	Path to the pre-trained model checkpoint
`--save_path`	Path to save the prediction results

Test (split=test)

Generate prediction files for submission to the official evaluation servers:

python src/test.py \
    --dataset='cirr' \
    --save_path='' \
    --model_path=""

Argument	Description
`--dataset`	Options: `cirr` / `circo`
`--model_path`	Path to the pre-trained model checkpoint
`--save_path`	Path to save the prediction file

Citation

If you find this work useful for your research, please consider citing:

@inproceedings{FTI4CIR,
  author    = {Haoqiang Lin and Haokun Wen and Xuemeng Song and Meng Liu and Yupeng Hu and Liqiang Nie},
  title     = {Fine-grained Textual Inversion Network for Zero-Shot Composed Image Retrieval},
  booktitle = {Proceedings of the International {ACM} SIGIR Conference on Research and Development in Information Retrieval},
  pages     = {240--250},
  publisher = {{ACM}},
  year      = {2024}
}

Acknowledgement

We thank the following open-source projects for providing valuable components and implementations:

License

This project is released under the Apache License 2.0.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Fine-grained Textual Inversion Network for Zero-Shot Composed Image Retrieval

Authors

Links

Table of Contents

Updates

Introduction

Highlights

Method / Framework

Inference Pipeline

Project Structure

Installation

1. Clone the repository

2. Running Environment

3. Install dependencies

Dataset / Benchmark

ImageNet

FashionIQ

CIRR

CIRCO

Pre-trained Model

Usage

Training

Validation (split=val)

Test (split=test)

Citation

Acknowledgement

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors 1

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 79 Commits
data		data
src		src
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Fine-grained Textual Inversion Network for Zero-Shot Composed Image Retrieval

Authors

Links

Table of Contents

Updates

Introduction

Highlights

Method / Framework

Inference Pipeline

Project Structure

Installation

1. Clone the repository

2. Running Environment

3. Install dependencies

Dataset / Benchmark

ImageNet

FashionIQ

CIRR

CIRCO

Pre-trained Model

Usage

Training

Validation (split=val)

Test (split=test)

Citation

Acknowledgement

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors 1

Languages

Packages