InfiniPot-V: Memory-Constrained KV Cache Compression for Streaming Video Understanding [NeurIPS 25]

Disclaimer: This repository is a research-purpose re-implementation for reproducing the results presented in InfiniPot-V. It does not include all components of the full methodology described in the paper.

(260125) For comparison between code output and reported number in paper, please see accuracy.json number in results folder (which is averaged subtask accuracy)

Installation

pip install -r requirements.txt

Tested with torch==2.6 and CUDA 12.4 environment.

Quick Start

bash scripts/run_sample.sh 0  # GPU ID

Supported Models: Qwen2-VL, Qwen2.5-VL series

Example Result (Qwen2.5-VL-7B)

Setup: 220 seconds surveillance video (~32K context length) compressed to 4K tokens using InfiniPot-V (Sample video from MLVU: samples/video/6_anomaly_reco/surveil_8.mp4)

Question: Is there any abnormality in this surveillance video? If so, what type of abnormality is it? Respond with which option is the correct answer and explain why it is the correct answer.

Answer: Robbery

InfiniPot-V-4K: The correct answer is that there is an abnormality in the surveillance video. The abnormality is the presence of two individuals entering the room and engaging in what appears to be a robbery or theft. This is abnormal because it is not typical for people to enter a room and immediately start causing damage and taking items, especially in a public setting like a bank.

Evaluation

# bash scripts/run_ovu.sh [GPU_ID] [MODEL_SIZE] [BLOCK_SIZE] [COMPRESS_FRAMES] [METHOD] [DATASET] [VIDEO MAX_FRAMES]
bash scripts/run_ovu.sh 0 7 32 24 infinipot-v mlvu 768

Supported Long Video Understanding Benchmarks

Benchmark	Description
`mlvu`	MLVU - Multi-task Long Video Understanding
`videomme`	Video-MME - Video Multi-Modal Evaluation
`lvb`	LongVideoBench - Long-context Interleaved Video-Language Understanding
`egoschema`	EgoSchema - Egocentric Video QA

Please download videos from each benchmark's official website and organize them according to the dataset structure. For faster evaluation, use the --load_dumped option to load pre-dumped pre-processed video pixel values.

Key Arguments

Argument	Description
`--compression_method`	KV cache compression strategy. `uniform`, `swa`, `infinipot-v`
`--block_size`	Block size for continual KV cache compression (KVC).
`--compress_frame_num`	Number of frames to compress in the KV cache after each block.
`--max_frames_num`	Maximum number of frames to sample from the input video.
`--load_dumped`	Load pre-dumped outputs for faster evaluation.

Design Choices

block_size: Controls the granularity of block-wise processing. Larger values process more frames at once but require more memory. (ex. block size = 32, token_per_frame = 140 => token budget is ~4K)
compress_frame_num: Determines how aggressively the KV cache is compressed. Higher values lead to more compression but may affect quality.
compression_method:
- uniform: Uniform frame kv cache selection
- swa: sliding window attention (sink + recent tokens)
- infinipot-v: Our proposed method (TaR + VaN method)

Citation

If you find this work useful, please cite:

@inproceedings{
kim2025infinipotv,
title={InfiniPot-V: Memory-Constrained {KV} Cache Compression for Streaming Video Understanding},
author={Minsoo Kim and Kyuhong Shim and Jungwook Choi and Simyung Chang},
booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems},
year={2025},
url={https://openreview.net/forum?id=hFxOZjHyTg}
}

Contributions and extensions to this repository are always welcome 🤗

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
images		images
legacy		legacy
sample		sample
scripts		scripts
.gitignore		.gitignore
README.md		README.md
dataset_utils.py		dataset_utils.py
kvcache_utils.py		kvcache_utils.py
qwen_inference_ovu.py		qwen_inference_ovu.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

InfiniPot-V: Memory-Constrained KV Cache Compression for Streaming Video Understanding [NeurIPS 25]

Installation

Quick Start

Example Result (Qwen2.5-VL-7B)

Evaluation

Supported Long Video Understanding Benchmarks

Key Arguments

Design Choices

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

InfiniPot-V: Memory-Constrained KV Cache Compression for Streaming Video Understanding [NeurIPS 25]

Installation

Quick Start

Example Result (Qwen2.5-VL-7B)

Evaluation

Supported Long Video Understanding Benchmarks

Key Arguments

Design Choices

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages