Multi-View Captioning with Semantic Delta Re-Ranking for Zero-Shot Composed Video Retrieval

This is the code repository of the paper "Multi-View Captioning with Semantic Delta Re-Ranking for Zero-Shot Composed Video Retrieval", which aims to provide a comprehensive overview of MCSD.

News 🚀🚀🚀

2025.11.01: 🏆 Congratulations to us for winning the Best Paper Award at ICIG 2025! Thank you all for your support. Stay tuned!!!
2025.10.22: 🎉 We've open-sourced MCSD! Get started with composed video retrieval today!
2025.08.08: 📢 Great news! Our paper has been accepted as an oral presentation at ICIG 2025. Check out our work and project!

Overview

Motivation: Video content is inherently dense in semantic information. A single caption often fails to capture the full semantics of a target video, whereas captions generated from multiple perspectives can provide more comprehensive coverage of its potential meanings.

Abstract

Composed Video Retrieval (CVR) aims to retrieve video relevant to a query video while incorporating specific changes described in modification text. For Zero-Shot Composed Video Retrieval (ZS-CVR), current methods utilize vision-language models to convert the query video into a single caption, subsequently merged with modification text to generate an edited caption for retrieval. However, the modification text doesn't clearly specify which elements to preserve from the query video, leading to possible misalignment between edited caption and target video. Additionally, the final retrieval result should not be determined solely by the similarity between edited caption and candidate videos but also incorporate the semantic delta arising from the modification text. To address these issues, we propose Multi-View Captioning with Semantic Delta Re-Ranking (MCSD) method for ZS-CVR. Specifically, the Multi-View Captioning Module to generate captions covering potential semantics of the target video, the Semantic Delta Re-Ranking Module that computes the semantic delta between the original and edited captions, to adjust similarity scores and re-ranks the retrieval results. Extensive experiments on two benchmarks demonstrate that the proposed MCSD method achieves state-of-the-art performance in ZS-CVR.

Getting Started

Python 3.9+
CUDA-enabled GPU (recommended)

Installation

Clone the repository

# Clone the repository
git clone https://github.com/yzy-bupt/MCSD.git
cd MCSD

Install Python dependencies

# Create and activate conda environment
conda create -n MCSD -y python=3.9.20
conda activate MCSD

# Install PyTorch and dependencies
conda install -y -c pytorch pytorch=1.11.0 torchvision=0.12.0
pip install -r requirements.txt

Data Preparation

WebVid-CoVR

Download the WebVid-CoVR dataset following the instructions in the official web. Place the data in data/webvid-covr/.

EgoCVR

Download the EgoCVR dataset following the instructions in the official web. Place the data in data/egocvr/.

MCSD

To address these issues, we propose Multi-View Captioning with Semantic Delta Re-Ranking (MCSD) for ZS-CVR. Our method features:

(1) Multi-View Captioning Module to generate captions covering potential semantics of the target video;

(2) Semantic Delta Re-Ranking Module that computes the semantic delta between original and edited captions to adjust similarity scores and re-rank retrieval results.

1. Step 1

Extract frames from the videos in WebVid-COVR and EgoCVR

python code/tools/extract_frames.py

then extract the corresponding video features.

python code/tools/video_feature.py

2. Step 2

Run this script to generate diverse captions

python code/generate_captions.py

then run this script to generate the edited captions.

python code/generate_edit_captions.py

3. Step 3

To evaluate ours method, please run the following command:

python code/retrieval_webcovr.py
python code/retrieval_egocvr_global.py

Citation

If you find our data useful, please consider citing our work!

@inproceedings{ding2025multi,
  title={Multi-view Captioning with Semantic Delta Re-ranking for Zero-Shot Composed Video Retrieval},
  author={Ding, Zhixiang and Liu, Lilong and Yang, Zhenyu and Qian, Shengsheng},
  booktitle={International Conference on Image and Graphics},
  pages={80--91},
  year={2025},
  organization={Springer}
}

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
assets		assets
code		code
.gitignore		.gitignore
README.md		README.md
egocvr_20_descriptions.jsonl		egocvr_20_descriptions.jsonl
egocvr_20_edit_descriptions.jsonl		egocvr_20_edit_descriptions.jsonl
requirements.txt		requirements.txt
web_covr_20_descriptions.jsonl		web_covr_20_descriptions.jsonl
web_covr_20_edit_descriptions.jsonl		web_covr_20_edit_descriptions.jsonl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Multi-View Captioning with Semantic Delta Re-Ranking for Zero-Shot Composed Video Retrieval

News 🚀🚀🚀

Overview

Abstract

Getting Started

Installation

Data Preparation

WebVid-CoVR

EgoCVR

MCSD

1. Step 1

2. Step 2

3. Step 3

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Multi-View Captioning with Semantic Delta Re-Ranking for Zero-Shot Composed Video Retrieval

News 🚀🚀🚀

Overview

Abstract

Getting Started

Installation

Data Preparation

WebVid-CoVR

EgoCVR

MCSD

1. Step 1

2. Step 2

3. Step 3

Citation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages