Skip to content

LivXue/VCNLG

Repository files navigation

Vision-Controllable Language Model for Image-guided Story Ending Generation

IEEE Paper GitHub stars

Dizhan Xue, Shengsheng Qian, and Changsheng Xu.

State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences


📖 About

This repository contains the official implementation of our paper "Vision-Controllable Language Model for Image-guided Story Ending Generation", published at IEEE Transactions on Multimedia (TMM) 2024.

🎯 Examples

example1 example2
example3 example4
example5 example6
example7 example8

📝 Introduction

  • Image-guided Story Ending Generation (IgSEG) aims to continue natural language generation (NLG) following a perceived visual control.
  • Vision-Controllable Language Model (VCLM) aligns a frozen visual encoder from BLIP, a frozen textual encoder BERT, and a trained-from-scratch or pretrained generative language model (LM).
  • VCLM adopts (optional) multimodal-contextual cloud knowledge retrieval to improve edge computing AI when additional knowledge is needed.
  • VCLM adopts vision-controlled reinforcement learning to constrain the trained model to follow visual controls.

overview


🚀 Getting Started

1. Prepare the code and the environment

Git clone our repository, create a python environment and activate it via the following command:

git clone https://github.com/LivXue/VCNLG.git
cd VCNLG
conda env create -f environment.yml
conda activate vcnlg

We adopt ViT pretrained by BLIP to extract visual features. Download the weights of BLIP w/ ViT-L and save the file to visual_feature_extraction/checkpoints/model_large.pth.


2. Prepare the datasets

VIST-E [Link]

Download SIS-with-labels.tar.gz, train_split.(0-12).tar.gz, val_images.tar.gz, test_images.tar.gz and unzip them into data/VIST-E.

NOTE: There should be train.story-in-sequence.json, val.story-in-sequence.json, test.story-in-sequence.json in data/VIST-E/ and image_id.jpg/png in data/VIST-E/images/.

Then, run

python visual_feature_extraction/extract_fea_img.py --input_dir data/VIST-E/images --output_dir data/VIST-E/ViT_features --device <your device>

to extract the ViT features of images.

Then, run

python data/VIST-E/prepare_data.py --images_directory data/VIST-E/ViT_features --device <your device>

to generate the story files.

Finally, run

python data/VIST-E/extract_clip_feature.py --input_dir data/VIST-E/images --output_dir data/VIST-E/clip_features

to generate clip features.

NOTE: There should be story_train.json, story_val.json, story_test.json in data/VIST-E/, <image_id>.npy in data/VIST-E/ViT_features/, and <image_id>.npy in data/VIST-E/clip_features/.

LSMDC-E [Link]

Download LSMDC 2021 version (task1_2021.zip, MPIIMD_downloadLinks.txt, MVADaligned_downloadLinks.txt) and unzip them into data/LSMDC-E.

Note: Due to LSMDC agreement, we cannot share data to any third-party.

Note: There should be LSMDC16_annos_training_someone.csv, LSMDC16_annos_val_someone.csv, LSMDC16_annos_test_someone.csv, MPIIMD_downloadLinks.txt, MVADaligned_downloadLinks.txt in data/LSMDC-E/.

Then, merge MPIIMD_downloadLinks.txt and MVADaligned_downloadLinks.txt to a download_video_urls.txt file, modify the user name and password to LSMDC in data/LSMDC-E/generate_clips.py and run:

python data/LSMDC-E/generate_clips.py --output_path data/LSMDC-E/videos --user_name <your user name to LSMDC> --password <your password to LSMDC>

to download videos and save resampled frames into data/LSMDC-E/videos.

Then, run:

python visual_feature_extraction/extract_fea_video.py --input_dir data/LSMDC-E/videos --output_dir data/LSMDC-E/ViT_features --device <your device>

to extract the ViT features of video frames.

Then, run:

python data/LSMDC-E/prepare_data.py --input_path data/LSMDC-E

to generate the story files.

Finally, run:

python data/LSMDC-E/extract_clip_feature_video.py --input_dir data/LSMDC-E/videos --output_dir data/LSMDC-E/clip_features

to generate clip features.

Note: There should be story_train.json, story_val.json, story_test.json in data/LSMDC-E/, <video_id>.npy in data/LSMDC-E/ViT_features/, and <video_id>.npy in data/LSMDC-E/clip_features/.


3. (Optional) Fetch Textual Knowledge

Download the code and pretrained checkpoints of mPLUG-Owl.

Then, run our script:

python mPLUG-Owl/test_onshot.py

to retrieve knowledge for the datasets.


🏋️ Training and Test

Check the configs in utils/opts.py and run:

python train.py --dataset <dataset>

to train the model.

Then, run:

python eval.py --dataset <dataset>

to test the model.


🎮 Launching Demo Locally

Coming soon...


📊 Our Results

We provide our results generated by VCLM on VIST-E and LSMDC-E test sets in results/.


📝 Citation

If you find our work or the code useful, please consider citing our paper:

@article{xue2026vision,
  title={Vision-Controllable Language Model for Image-guided Story Ending Generation},
  author={Xue, Dizhan and Qian, Shengsheng and Xu, Changsheng},
  journal={IEEE Transactions on Multimedia},
  year={2026},
  publisher={IEEE},
  doi={10.1109/TMM.2026.3679122}
}

Paper Link: https://ieeexplore.ieee.org/document/11458711


📄 License

This repository is under BSD 3-Clause License.

About

[TMM 2026] Vision-Controllable Language Model for Image-guided Story Ending Generation

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages