Vision-Controllable Language Model for Image-guided Story Ending Generation

Dizhan Xue, Shengsheng Qian, and Changsheng Xu.

State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences

📖 About

This repository contains the official implementation of our paper "Vision-Controllable Language Model for Image-guided Story Ending Generation", published at IEEE Transactions on Multimedia (TMM) 2024.

🎯 Examples

📝 Introduction

Image-guided Story Ending Generation (IgSEG) aims to continue natural language generation (NLG) following a perceived visual control.
Vision-Controllable Language Model (VCLM) aligns a frozen visual encoder from BLIP, a frozen textual encoder BERT, and a trained-from-scratch or pretrained generative language model (LM).
VCLM adopts (optional) multimodal-contextual cloud knowledge retrieval to improve edge computing AI when additional knowledge is needed.
VCLM adopts vision-controlled reinforcement learning to constrain the trained model to follow visual controls.

🚀 Getting Started

1. Prepare the code and the environment

Git clone our repository, create a python environment and activate it via the following command:

git clone https://github.com/LivXue/VCNLG.git
cd VCNLG
conda env create -f environment.yml
conda activate vcnlg

We adopt ViT pretrained by BLIP to extract visual features. Download the weights of BLIP w/ ViT-L and save the file to visual_feature_extraction/checkpoints/model_large.pth.

2. Prepare the datasets

VIST-E [Link]

Download SIS-with-labels.tar.gz, train_split.(0-12).tar.gz, val_images.tar.gz, test_images.tar.gz and unzip them into data/VIST-E.

NOTE: There should be train.story-in-sequence.json, val.story-in-sequence.json, test.story-in-sequence.json in data/VIST-E/ and image_id.jpg/png in data/VIST-E/images/.

Then, run

python visual_feature_extraction/extract_fea_img.py --input_dir data/VIST-E/images --output_dir data/VIST-E/ViT_features --device <your device>

to extract the ViT features of images.

Then, run

python data/VIST-E/prepare_data.py --images_directory data/VIST-E/ViT_features --device <your device>

to generate the story files.

Finally, run

python data/VIST-E/extract_clip_feature.py --input_dir data/VIST-E/images --output_dir data/VIST-E/clip_features

to generate clip features.

NOTE: There should be story_train.json, story_val.json, story_test.json in data/VIST-E/, <image_id>.npy in data/VIST-E/ViT_features/, and <image_id>.npy in data/VIST-E/clip_features/.

LSMDC-E [Link]

Download LSMDC 2021 version (task1_2021.zip, MPIIMD_downloadLinks.txt, MVADaligned_downloadLinks.txt) and unzip them into data/LSMDC-E.

Note: Due to LSMDC agreement, we cannot share data to any third-party.

Note: There should be LSMDC16_annos_training_someone.csv, LSMDC16_annos_val_someone.csv, LSMDC16_annos_test_someone.csv, MPIIMD_downloadLinks.txt, MVADaligned_downloadLinks.txt in data/LSMDC-E/.

Then, merge MPIIMD_downloadLinks.txt and MVADaligned_downloadLinks.txt to a download_video_urls.txt file, modify the user name and password to LSMDC in data/LSMDC-E/generate_clips.py and run:

python data/LSMDC-E/generate_clips.py --output_path data/LSMDC-E/videos --user_name <your user name to LSMDC> --password <your password to LSMDC>

to download videos and save resampled frames into data/LSMDC-E/videos.

Then, run:

python visual_feature_extraction/extract_fea_video.py --input_dir data/LSMDC-E/videos --output_dir data/LSMDC-E/ViT_features --device <your device>

to extract the ViT features of video frames.

Then, run:

python data/LSMDC-E/prepare_data.py --input_path data/LSMDC-E

to generate the story files.

Finally, run:

python data/LSMDC-E/extract_clip_feature_video.py --input_dir data/LSMDC-E/videos --output_dir data/LSMDC-E/clip_features

to generate clip features.

Note: There should be story_train.json, story_val.json, story_test.json in data/LSMDC-E/, <video_id>.npy in data/LSMDC-E/ViT_features/, and <video_id>.npy in data/LSMDC-E/clip_features/.

3. (Optional) Fetch Textual Knowledge

Download the code and pretrained checkpoints of mPLUG-Owl.

Then, run our script:

python mPLUG-Owl/test_onshot.py

to retrieve knowledge for the datasets.

🏋️ Training and Test

Check the configs in utils/opts.py and run:

python train.py --dataset <dataset>

to train the model.

Then, run:

python eval.py --dataset <dataset>

to test the model.

🎮 Launching Demo Locally

Coming soon...

📊 Our Results

We provide our results generated by VCLM on VIST-E and LSMDC-E test sets in results/.

📝 Citation

If you find our work or the code useful, please consider citing our paper:

@article{xue2026vision,
  title={Vision-Controllable Language Model for Image-guided Story Ending Generation},
  author={Xue, Dizhan and Qian, Shengsheng and Xu, Changsheng},
  journal={IEEE Transactions on Multimedia},
  year={2026},
  publisher={IEEE},
  doi={10.1109/TMM.2026.3679122}
}

Paper Link: https://ieeexplore.ieee.org/document/11458711

📄 License

This repository is under BSD 3-Clause License.

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
data		data
figs		figs
mPLUG-Owl		mPLUG-Owl
results/VIST-E		results/VIST-E
utils		utils
visual_feature_extraction		visual_feature_extraction
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
eval.py		eval.py
eval_utils.py		eval_utils.py
human_eval.py		human_eval.py
metrics.py		metrics.py
model.py		model.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Vision-Controllable Language Model for Image-guided Story Ending Generation

📖 About

🎯 Examples

📝 Introduction

🚀 Getting Started

1. Prepare the code and the environment

2. Prepare the datasets

VIST-E [Link]

LSMDC-E [Link]

3. (Optional) Fetch Textual Knowledge

🏋️ Training and Test

🎮 Launching Demo Locally

📊 Our Results

📝 Citation

📄 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Vision-Controllable Language Model for Image-guided Story Ending Generation

📖 About

🎯 Examples

📝 Introduction

🚀 Getting Started

1. Prepare the code and the environment

2. Prepare the datasets

VIST-E [Link]

LSMDC-E [Link]

3. (Optional) Fetch Textual Knowledge

🏋️ Training and Test

🎮 Launching Demo Locally

📊 Our Results

📝 Citation

📄 License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages