Skip to content

iLearn-Lab/SIGIR24-FTI4CIR

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

79 Commits
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Fine-grained Textual Inversion Network for Zero-Shot Composed Image Retrieval

Mapping reference images into subject-oriented and attribute-oriented pseudo-word tokens for zero-shot composed image retrieval, without requiring any annotated training triplets.

Authors

Haoqiang Lin1, Haokun Wen2, Xuemeng Song1*, Meng Liu3, Yupeng Hu1, Liqiang Nie2

1 Shandong University, Qingdao / Jinan, China
2 Harbin Institute of Technology (Shenzhen), Shenzhen, China
3 Shandong Jianzhu University, Jinan, China
* Corresponding author: sxmustc@gmail.com

Links


Table of Contents


Updates

  • [07/2024] Paper accepted and presented at SIGIR 2024 (Washington, DC, USA)
  • [07/2024] Initial code release

Introduction

This repository contains the official implementation of the paper Fine-grained Textual Inversion Network for Zero-Shot Composed Image Retrieval, published at SIGIR 2024.

Background. Composed Image Retrieval (CIR) enables users to retrieve target images using a multimodal query consisting of a reference image and a modification text describing the desired changes. However, existing supervised CIR methods rely heavily on expensive annotated training triplets in the form of <reference image, modification text, target image>, which limits dataset scale and model generalization.

Method. We propose FTI4CIR, which addresses CIR without requiring any annotated triplets. Unlike prior methods that map an image to a single coarse-grained pseudo-word token, FTI4CIR maps each image into:

  • One subject-oriented pseudo-word token β€” capturing the primary subject(s) of the image.
  • Several attribute-oriented pseudo-word tokens β€” capturing contextual attributes such as sleeve length, color, and background.

Each image is represented by the sentence "a photo of [S*] with [A1*, ..., Ar*]", which is concatenated with the modification text to form a unified text query, reducing CIR to a standard text-to-image retrieval task.

Key Components.

  1. Fine-grained Pseudo-word Token Mapping: Generates subject pseudo-words from global image features, and leverages a Transformer with a local-global relevance-based filtering strategy to dynamically extract attribute-oriented features.
  2. Tri-wise Caption-based Semantic Regularization: Employs BLIP to generate natural language descriptions and aligns pseudo-word tokens to the real-word embedding space via three complementary objectives: subject-wise, attribute-wise, and whole-wise.

Results. FTI4CIR achieves state-of-the-art zero-shot performance on FashionIQ, CIRR, and CIRCO.

This repository provides: full training code, inference and evaluation scripts, and pre-trained model weights.


Highlights

  • πŸ†• First fine-grained textual inversion approach: Maps images into both subject-oriented and attribute-oriented pseudo-word tokens, surpassing single pseudo-word representations.
  • πŸ”§ Dynamic local attribute feature extraction: Adaptively handles diverse attribute types across different domains (fashion, animals, open-domain).
  • πŸ“ Tri-wise semantic regularization: Aligns pseudo-word tokens to the real-word embedding space via subject-wise, attribute-wise, and whole-wise objectives using BLIP-generated captions.
  • 🚫 Fully zero-shot: Training relies solely on unlabeled open-domain images; no annotated CIR triplets are required at any stage.
  • πŸ“Š State-of-the-art on three benchmarks: Outperforms all zero-shot baselines on FashionIQ (R@10/50), CIRR (R@1/5/10/50), and CIRCO.

Method / Framework

image

Figure 1. Overall framework of FTI4CIR, consisting of (a) Fine-grained Pseudo-word Token Mapping and (b) Tri-wise Caption-based Semantic Regularization.

Inference Pipeline

Reference Image
      β”‚
      β–Ό
Fine-grained Textual Inversion
      β”‚
      β–Ό
"A photo of [S*] with [A1*, ..., Ar*]."
      β”‚
      + Modification Text
      β”‚
      β–Ό
Pure Text Query β†’ Text-to-Image Retrieval β†’ Target Image

Project Structure

.
β”œβ”€β”€ assets/               # Figures and visualizations
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ train.py          # Training script
β”‚   β”œβ”€β”€ evaluate.py       # Validation evaluation script
β”‚   └── test.py           # Test-split prediction generation script
β”œβ”€β”€ README.md
β”œβ”€β”€ requirements.txt
└── LICENSE

Installation

1. Clone the repository

git clone https://github.com/ZiChao111/FTI4CIR.git
cd FTI4CIR

2. Running Environment

Platform : NVIDIA A100 40G
Python   : 3.9.12
PyTorch  : 2.2.0

3. Install dependencies

pip install -r requirements.txt

Dataset / Benchmark

Please download each dataset following the instructions in the respective official repositories and organize them as described below.

ImageNet

Download the ImageNet1K (ILSVRC2012) test set from the official site.

β”œβ”€β”€ ImageNet
β”‚   └── test
β”‚       └── [ILSVRC2012_test_00000001.JPEG | ... | ILSVRC2012_test_00100000.JPEG]

FashionIQ

Download from the official repository.

β”œβ”€β”€ FashionIQ
β”‚   β”œβ”€β”€ captions
β”‚   β”‚   β”œβ”€β”€ cap.dress.[train | val | test].json
β”‚   β”‚   β”œβ”€β”€ cap.toptee.[train | val | test].json
β”‚   β”‚   └── cap.shirt.[train | val | test].json
β”‚   β”œβ”€β”€ image_splits
β”‚   β”‚   β”œβ”€β”€ split.dress.[train | val | test].json
β”‚   β”‚   β”œβ”€β”€ split.toptee.[train | val | test].json
β”‚   β”‚   └── split.shirt.[train | val | test].json
β”‚   β”œβ”€β”€ dress / shirt / toptee
β”‚   β”‚   └── [*.jpg]

CIRR

Download from the official repository.

β”œβ”€β”€ CIRR
β”‚   β”œβ”€β”€ train / dev / test1
β”‚   β”‚   └── [*.png]
β”‚   └── cirr
β”‚       β”œβ”€β”€ captions
β”‚       β”‚   └── cap.rc2.[train | val | test1].json
β”‚       └── image_splits
β”‚           └── split.rc2.[train | val | test1].json

CIRCO

Download from the official repository.

β”œβ”€β”€ CIRCO
β”‚   β”œβ”€β”€ annotations
β”‚   β”‚   └── [val | test].json
β”‚   └── COCO2017_unlabeled
β”‚       β”œβ”€β”€ annotations
β”‚       β”‚   └── image_info_unlabeled2017.json
β”‚       └── unlabeled2017
β”‚           └── [*.jpg]

Pre-trained Model

Pre-trained model weights are available on Google Drive.

After downloading, place the checkpoint under model_save/, or specify the path via the --model_path argument at evaluation time.

Image captions used during training are generated by BLIP. Please refer to the BLIP repository for caption generation details.


Usage

Training

python src/train.py \
    --save_frequency 1 \
    --batch_size=256 \
    --lr=4e-5 \
    --wd=0.01 \
    --epochs=60 \
    --model_dir="./model_save" \
    --workers=8 \
    --model ViT-L/14

Validation (split=val)

python src/evaluate.py \
    --dataset='cirr' \
    --save_path='' \
    --model_path=""
Argument Description
--dataset Dataset to evaluate on. Options: fashioniq / cirr / circo
--model_path Path to the pre-trained model checkpoint
--save_path Path to save the prediction results

Test (split=test)

Generate prediction files for submission to the official evaluation servers:

python src/test.py \
    --dataset='cirr' \
    --save_path='' \
    --model_path=""
Argument Description
--dataset Options: cirr / circo
--model_path Path to the pre-trained model checkpoint
--save_path Path to save the prediction file

Citation

If you find this work useful for your research, please consider citing:

@inproceedings{FTI4CIR,
  author    = {Haoqiang Lin and Haokun Wen and Xuemeng Song and Meng Liu and Yupeng Hu and Liqiang Nie},
  title     = {Fine-grained Textual Inversion Network for Zero-Shot Composed Image Retrieval},
  booktitle = {Proceedings of the International {ACM} SIGIR Conference on Research and Development in Information Retrieval},
  pages     = {240--250},
  publisher = {{ACM}},
  year      = {2024}
}

Acknowledgement

We thank the following open-source projects for providing valuable components and implementations:


License

This project is released under the Apache License 2.0.

About

Codes of the Fine-grained Textual Inversion network for Zero-Shot Composed Image Retrieval

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Languages