Disclaimer: This repository is a research-purpose re-implementation for reproducing the results presented in InfiniPot-V. It does not include all components of the full methodology described in the paper.
(260125) For comparison between code output and reported number in paper, please see accuracy.json number in results folder (which is averaged subtask accuracy)
pip install -r requirements.txtTested with torch==2.6 and CUDA 12.4 environment.
bash scripts/run_sample.sh 0 # GPU IDSupported Models: Qwen2-VL, Qwen2.5-VL series
Setup: 220 seconds surveillance video (~32K context length) compressed to 4K tokens using InfiniPot-V
(Sample video from MLVU: samples/video/6_anomaly_reco/surveil_8.mp4)
Question: Is there any abnormality in this surveillance video? If so, what type of abnormality is it? Respond with which option is the correct answer and explain why it is the correct answer.
Answer: Robbery
InfiniPot-V-4K: The correct answer is that there is an abnormality in the surveillance video. The abnormality is the presence of two individuals entering the room and engaging in what appears to be a robbery or theft. This is abnormal because it is not typical for people to enter a room and immediately start causing damage and taking items, especially in a public setting like a bank.
# bash scripts/run_ovu.sh [GPU_ID] [MODEL_SIZE] [BLOCK_SIZE] [COMPRESS_FRAMES] [METHOD] [DATASET] [VIDEO MAX_FRAMES]
bash scripts/run_ovu.sh 0 7 32 24 infinipot-v mlvu 768| Benchmark | Description |
|---|---|
mlvu |
MLVU - Multi-task Long Video Understanding |
videomme |
Video-MME - Video Multi-Modal Evaluation |
lvb |
LongVideoBench - Long-context Interleaved Video-Language Understanding |
egoschema |
EgoSchema - Egocentric Video QA |
Please download videos from each benchmark's official website and organize them according to the dataset structure. For faster evaluation, use the --load_dumped option to load pre-dumped pre-processed video pixel values.
| Argument | Description |
|---|---|
--compression_method |
KV cache compression strategy. uniform, swa, infinipot-v |
--block_size |
Block size for continual KV cache compression (KVC). |
--compress_frame_num |
Number of frames to compress in the KV cache after each block. |
--max_frames_num |
Maximum number of frames to sample from the input video. |
--load_dumped |
Load pre-dumped outputs for faster evaluation. |
block_size: Controls the granularity of block-wise processing. Larger values process more frames at once but require more memory. (ex. block size = 32, token_per_frame = 140 => token budget is ~4K)compress_frame_num: Determines how aggressively the KV cache is compressed. Higher values lead to more compression but may affect quality.compression_method:uniform: Uniform frame kv cache selectionswa: sliding window attention (sink + recent tokens)infinipot-v: Our proposed method (TaR + VaN method)
If you find this work useful, please cite:
@inproceedings{
kim2025infinipotv,
title={InfiniPot-V: Memory-Constrained {KV} Cache Compression for Streaming Video Understanding},
author={Minsoo Kim and Kyuhong Shim and Jungwook Choi and Simyung Chang},
booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems},
year={2025},
url={https://openreview.net/forum?id=hFxOZjHyTg}
}Contributions and extensions to this repository are always welcome 🤗
