This repository provides a curated collection of research papers, models, and datasets focused on Streaming (Online) Video Understanding. The field aims to develop AI assistants capable of J.A.R.V.I.S.-like continuous multimodal perception and interaction. Unlike traditional offline video understanding, where models have access to the complete video beforehand, streaming models must operate under real-time, causal constraints: frames arrive sequentially, and decisions at any moment can only rely on past and present information, without the ability to rewind or preview future content.
This paradigm introduces two fundamental challenges:
- Proactive Decision-Making (
Whento Act): Determining the optimal moment to generate a response, ask for clarification, or remain silent. - Efficient Resource Management (
Howto Sustain): Managing ever-growing context (memory/KV cache) and computational load for perpetual, real-time processing.
The repository is organized to reflect these core challenges and the supporting ecosystem:
- π Proactive Streaming Models: Approaches for deciding when to interact, including token-driven triggering (EOS), dedicated classifiers, perplexity validation, and visual-based detection.
- πΊ Responsive Streaming Models: Techniques for efficient long-context processing, covering KV cache management, hierarchical memory, retrieval-augmentation, and computational optimizations.
- π Benchmarks & Datasets: Key datasets for evaluating capabilities in multi-turn dialogue, real-time captioning, and proactive timing.
This list serves as a reference for researchers and practitioners exploring the frontier of always-on, interactive video AI systems. Love this awesome list? Help others discover it by starring the repository! β
- π Proactive Streaming Models
- πΊ Responsive Streaming Models
- π Benchmarks & Datasets
- π Complete Model List by Release Date
- π Complete Dataset List by Release Date
Models that decide actions (Speak, Wait, or others) by generating specific tokens or action probabilities within the sequence. Typically, they learn through autoregressive prediction where an EOS token represents silence, while regular language tokens represent responses. This approach may potentially impact the model's general-purpose capabilities.
| Paper | Model | Date | Link | Venue | Method / Key Contribution |
|---|---|---|---|---|---|
| Streaming Video Instruction Tuning | Streamo | 2025/12 | Link | arXiv | State-Token Unified Triggering: Introduces explicit response state tokens (Silence / Standby / Response) and integrates when to respond and what to say into a single autoregressive sequence; applies focal-weighted loss to mitigate extreme state imbalance. |
| Eyes Wide Open: Ego Proactive Video-LLM for Streaming Video | VideoLLM-EyeWO | 2025/10 | Link | NeurIPS 2025 | Active Perception & Action: Predicts 3 actions (Silence, Respond, Ask-High-Res); proactively requests high-res frames when uncertain to ensure just-in-time accuracy. |
| Proactive Assistant Dialogue Generation from Streaming Egocentric Videos | ProAssist | 2025/06 | Link | EMNLP 2025 | EOS-Based Trigger: Predicts [EOS] token to remain silent or generates text to respond at each frame. Uses Negative Frame Sub-sampling to handle class imbalance between silence and speaking. |
| LiveCC: Learning Video LLM with Streaming Speech Transcription at Scale | LiveCC | 2025/04 | Link | CVPR 2025 | EOS-Based: Trains on large-scale streaming ASR data. At inference, the model predicts [EOS] to stay silent or generates commentary tokens frame-by-frame, enabling real-time play-by-play narration. |
| AssistPDA: An Online Video Surveillance Assistant for Video Anomaly Prediction, Detection, and Analysis | AssistPDA | 2025/03 | N/A | arXiv | EOS-Based: Predicts [EOS] probability to decide whether to output an anomaly alert/prediction. Features a STRD module to distill offline temporal reasoning into online inference. |
| LION-FS: Fast & Slow Video-Language Thinker as Online Video Assistant | LION-FS | 2025/03 | Link | CVPR 2025 | EOS-Based + Fast-Slow Architecture: Uses a Fast Path to efficiently determine when to respond (via token prediction) and a Slow Path with multi-granularity keyframe augmentation to generate detailed responses only when needed. |
| VideoLLM-MoD: Efficient Video-Language Streaming with Mixture-of-Depths Vision Computation | VideoLLM-MoD | 2024/08 | N/A | NeurIPS 2024 | EOS-Based + MoD Efficiency: Inherits [EOS] token prediction for proactive triggering. Key contribution is Mixture-of-Depths, dynamically skipping redundant vision token computation to enable efficient streaming. |
| What to Say and When to Say it: Live Fitness Coaching as a Testbed for Situated Interaction | STREAM-VLM | 2024/07 | Link | NeurIPS 2024 | Special Action Tokens Triggering: Uses two special action tokens <next>(allows the model to opt not to say anything and request the next video frame-3D CNN) and <feedback>(generate response-LLM) to enable proactive feedbacks. |
| VideoLLM-online: Online Video Large Language Model for Streaming Video | VideoLLM-online | 2024/06 | Link | CVPR 2024 | Streaming EOS: Pioneered the Streaming EOS training objective. The model predicts an [EOS] token at each frame to decide whether to stay silent or generate a response, enabling real-time, proactive interaction. |
Models that use a lightweight detector, router head, or auxiliary module to trigger responses. A binary classification module determines whether to remain silent or to respond.
| Paper | Model | Date | Link | Venue | Method / Key Contribution |
|---|---|---|---|---|---|
| StreamReady: Learning What to Answer and When in Long Streaming Videos | StreamReady | 2026/03 | N/A | CVPR 2026 | Readiness Head Trigger: Introduces a lightweight learnable readiness token within the reasoning module, monitored by a Readiness Head (MLP) to output a readiness score β [0, 1]. The model triggers response only when the score exceeds threshold, ensuring sufficient visual evidence has been observed. Uses dual-branch Q-Former for short/long-term reasoning and hierarchical Visual Memory Tree for efficient context storage. Trains via contrastive loss between pseudo-positive and pseudo-negative temporal regions. |
| Learning to Respond: A Large-Scale Benchmark and Progressive Learning Framework for Trigger-Centric Online Video Understanding | ToM | 2025/12 | N/A | under review | Trigger-centric Responding: Introduces TV-Online and an agent-like paradigm that continuously processes streaming inputs and decides whether to respond or remain silent, trained with progressive training and reinforcement objectives. |
| MMDuet2: Enhancing Proactive Interaction of Video MLLMs with Multi-Turn Reinforcement Learning | MMDuet2 | 2025/12 | Link | under review | Text-only Proactive Interaction + RL: Formulates proactive video interaction as a text-to-text decision where the model outputs either a response or βNO REPLYβ at each turn based on dialogue history and visual context up to the current frame. Uses multi-turn reinforcement learning with a PAUC-inspired reward to encourage early and correct responses without requiring precise reply-time annotations. |
| Open-ended Hierarchical Streaming Video Understanding with Vision Language Models | OpenHOUSE | 2025/09 | N/A | ICCV 2025 | Detector-Triggered Hierarchical Captioning: Uses a lightweight Streaming Module (RNN) to detect action boundaries (hybrid actionness/progress). Triggers the frozen VLM only at detected boundaries to generate hierarchical (substep/step) descriptions. |
| StreamAgent: Towards Anticipatory Agents for Streaming Video Understanding | StreamAgent | 2025/08 | N/A | arXiv | Agent-as-Detector: Uses a separate, lightweight Anticipatory Agent (Small VLM) to act as a decision module. It plans and predicts future events to trigger the main responder only when necessary, decoupling decision from generation. |
| StreamBridge: Turning Your Offline Video Large Language Model into a Proactive Streaming Assistant | StreamBridge | 2025/05 | Link | NeurIPS 2025 | Decoupled Activation Model: Uses a separate, lightweight Activation Model (e.g., 0.5B LLaVA) to detect "when to speak" (triggering), allowing the main offline Video-LLM to be plug-and-play for proactive streaming. Also uses Round-Decayed Compression for memory. |
| ViSpeak: Visual Instruction Feedback in Streaming Videos | ViSpeak | 2025/03 | Link | ICCV 2025 | Classification Head Trigger: Defines "Visual Instruction Feedback" tasks (e.g., visual wake-up, interruption). Uses a trained binary classification head (Informative Head) on top of the VLM to predict "when to speak" based on visual cues. |
| StreamMind: Unlocking Full Frame Rate Streaming Video Dialogue through Event-Gated Cognition | StreamMind | 2025/03 | Link | ICCV 2025 | Cognition Gate: Introduces an Event-Gated mechanism. A lightweight Cognition Gate (initialized from LLM shallow layers) continuously monitors the stream and only triggers/invokes the heavy LLM when relevant events occur, enabling 100 FPS processing. |
| EgoSpeak: Learning When to Speak for Egocentric Conversational Agents in the Wild | EgoSpeak | 2025/02 | link | NAACL 2025 | Classification Head Trigger:The model EgoSpeak outputs a continuous speak-probability that a conversational agent can leverage in real time (e.g., by triggering speech once the probability surpasses a threshold). |
| Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction | Dispider | 2025/01 | Link | CVPR 2025 | Disentangled Decision Module: Decouples Perception (streaming), Decision (when to speak), and Reaction (generation) into asynchronous modules. Uses a lightweight decision model to trigger the heavy reaction model only when needed. |
| VideoLLM Knows When to Speak: Enhancing Time-Sensitive Video Comprehension with Video-Text Duet Interaction Format | MMDuet | 2024/11 | Link | EMNLP 2025 | Dual-Head Trigger: Trains two binary classification heads (Informative Head & Relevance Head) to decide when to interrupt the video stream and generate a response. Enables "Duet" interaction format. |
| Streamlined Dense Video Captioning | SDVC | 2019/04 | Link | CVPR 2019 | Event Sequence Generation: Uses an Event Sequence Generation Network (Pointer Net) to adaptively select a sequence of event proposals, which then triggers the captioning network. (Note: Offline method). |
Models that monitor PPL spikes or uncertainty scores to initiate interaction. For previously spoken content, new frames are validated for perplexity: low perplexity indicates the content remains unchanged, thus no repeated decoding is needed (silent); high perplexity indicates new content in the frame, triggering a response.
| Paper | Model | Date | Link | Venue | Method / Key Contribution |
|---|---|---|---|---|---|
| LiveStar: Live Streaming Assistant for Real-World Online Video Understanding | LiveStar | 2025/11 | Link | NeurIPS 2025 | PPL-Based Verification (SVeD): Uses Streaming Verification Decoding (SVeD) which calculates the perplexity (PPL) of the generated caption to verify its validity. If PPL indicates high confidence/necessity, it triggers a response; otherwise, it stays silent. |
Models that trigger responses based on significant changes in the visual stream or detected events. Frames with substantial visual changes often trigger new responses, while frames with minimal changes typically correspond to unchanged content from before.
| Paper | Model | Date | Link | Venue | Method / Key Contribution |
|---|---|---|---|---|---|
| TimeChat-Online: 80% Visual Tokens are Naturally Redundant in Streaming Videos | TimeChat-Online | 2025/04 | Link | ACM MM 2025 | Visual Change Trigger: Uses Differential Token Drop (DTD) to prune redundant tokens. Monitors the token drop ratio; sudden drops indicate scene transitions, which serve as natural triggers for proactive responding. |
Methods focusing on optimizing the KV cache by evicting less important tokens (e.g., Heavy Hitter, Sliding Window).
| Paper | Model | Date | Link | Venue | Method / Key Contribution |
|---|---|---|---|---|---|
| StreamingAssistant: Efficient Visual Token Pruning for Accelerating Online Video Understanding | StreamingAssistant | 2025/12 | N/A | arXiv | Two Dimensions Tokens Pruning: Introduce a novel redundancy metric--MSSAVT; Video tokens are successively processed by the temporal pruning module and the spatial pruning module. |
| StreamingVLM: Real-Time Understanding for Infinite Video Streams | StreamingVLM | 2025/10 | Link | arXiv | Streaming-Aware KV Cache: Uses Attention Sinks + Sliding Window (Long Text + Short Vision) with Contiguous RoPE to enable infinite streaming without memory explosion or positional drift. Trains with overlapped-chunk full attention. |
| StreamMem: Query-Agnostic KV Cache Memory for Streaming Video Understanding | StreamMem | 2025/08 | Link | arXiv | Query-Agnostic Compression: Uses standard chat template tokens as Proxy Queries to calculate attention scores for Pruning and Merging KV cache, maintaining a fixed memory budget without needing the actual user query. |
| StreamVLN: Streaming Vision-and-Language Navigation via SlowFast Context Modeling | StreamVLN | 2025/07 | Link | arXiv | SlowFast Context (Pruning): Combines a Sliding Window (Fast Path) for recent dialogue with a 3D-Aware Token Pruning (Slow Path) to compress historical visual states into a compact memory, enabling long-horizon navigation. |
| InfiniPot-V: Memory-Constrained KV Cache Compression for Streaming Video Understanding | InfiniPot-V | 2025/06 | Link | NeurIPS 2025 | Continual KV Compression: Maintains a fixed memory budget by periodically compressing the KV cache using Temporal-axis Redundancy (TaR) (evicting repetitive frames) and Value-Norm (VaN) (keeping semantically important tokens). |
| SVBench: A Benchmark with Temporal Multi-Turn Dialogues for Streaming Video Understanding | StreamingChat | 2025/02 | Link | ICLR 2025 | Segment-Based KV Cache Bypass: Introduces a training and inference paradigm that splits long videos into sequential segments and conducts multi-turn dialogues per segment, avoiding unbounded KV cache growth. |
Methods that compress history into events, super-tokens, or hierarchical structures.
| Paper | Model | Date | Link | Venue | Method / Key Contribution |
|---|---|---|---|---|---|
| Think While Watching: Online Streaming Segment-Level Memory for Multi-Turn Video Reasoning in Multimodal Large Language Models | Think While Watching | 2026/03 | Link | arXiv | Streaming reasoning: Interleaves incremental observation and reasoning under causal streaming constraints. |
| FluxMem: Adaptive Hierarchical Memory for Streaming Video Understanding | FluxMem | 2026/03 | Link | CVPR 2026 | Training-Free Adaptive Hierarchical Memory: Maintains three-level memory (short/mid/long-term). Temporal Adjacency Selection (TAS) removes redundant visual tokens across adjacent frames into mid-term memory; Spatial Domain Consolidation (SDC) merges spatially repetitive regions into long-term memory. A self-adaptive compression mechanism auto-determines compression rate from scene statistics. Achieves 76.4 on StreamingBench and 67.2 on OVO-Bench, reducing latency by 69.9% and peak GPU memory by 34.5%. |
| HERMES: KV Cache as Hierarchical Memory for Efficient Streaming Video Understanding | HERMES | 2026/01 | N/A | arXiv | Hierarchical KV Cache Memory: Conceptualizes KV cache as hierarchical memory framework encapsulating video information across multiple granularities. Reuses compact KV cache for efficient streaming under resource constraints, achieving 10Γ faster TTFT. |
| VideoScaffold: Elastic-Scale Visual Hierarchies for Streaming Video Understanding in MLLMs | VideoScaffold | 2025/12 | Link | arXiv | Elastic-Scale Event Hierarchy: Introduces Elastic-Scale Event Segmentation (EES) with prediction-guided boundary refinement to dynamically adjust event granularity under causal streaming constraints, and Hierarchical Event Consolidation (HEC) to aggregate multi-level event representations from fine-grained frames to abstract events, preserving temporal continuity and semantic coherence. |
| video-SALMONN S: Streaming Audio-Visual LLMs Beyond Length Limits via Memory | video-SALMONN S | 2025/10 | N/A | arXiv | TTT Memory: Uses Test-Time Training (TTT) layers to compress video history into model weights (hidden state) + Prompt-dependent memory reading to extract relevant info from fixed-size memory. First to process >3h video at 1FPS. |
| StreamForest: Efficient Online Video Understanding with Persistent Event Memory | StreamForest | 2025/09 | Link | NeurIPS 2025 | Tree-Structured Event Memory: Organizes video frames into a Persistent Event Memory Forest (tree structure). Adaptively merges event nodes based on penalty functions (time, similarity, merge count) to maintain long-term history within a fixed token budget. |
| OVG-HQ: Online Video Grounding with Hybrid-modal Queries | OVG-HQ-Unify | 2025/08 | Link | ICCV 2025 | Parametric Memory (TTT): Uses a Parametric Memory Block (PMB) instantiated with a Test-Time Training (TTT) layer to compress historical video context into network parameters for online grounding. Supports hybrid-modal queries (text/image/video). |
| Flash-VStream: Efficient Real-Time Understanding for Long Video Streams | Flash-VStream | 2025/06 | Link | ICCV 2025 | Flash Memory: Two-process framework with 1. Context Synopsis Memory (CSM): Compresses history via K-means clustering (summarization). 2. Detail Augmentation Memory (DAM): Retrieves high-res spatial details for key frames based on CSM distribution. |
| Memory-efficient Streaming VideoLLMs for Real-time Procedural Video Understanding | ProVideLLM | 2025/04 | Link | ICCV 2025 | Verbalized Memory: Maintains a multimodal cache by verbalizing long-term video history into text steps (summarization) while keeping short-term history as visual tokens (extracted by DETR-QFormer), enabling extremely efficient streaming. |
| VideoScan: Enabling Efficient Streaming Video Understanding via Frame-level Semantic Carriers | VideoScan | 2025/03 | Link | arXiv | Semantic Carrier Token: Compresses each video frame into a single Semantic Carrier Token via average pooling to serve as a compact memory. Uses a feature duplication-based eviction strategy to maintain a fixed memory bank size. |
| Streaming Video Understanding and Multi-round Interaction with Memory-enhanced Knowledge | StreamChat | 2025/01 | Link | ICLR 2025 | Hierarchical Memory Tree: Builds a long-term memory tree by clustering and captioning video chunks. Uses a parallel scheduling system to update memory and retrieve relevant context for multi-turn dialogue. |
| Online Video Understanding: OVBench and VideoChat-Online | VideoChat-Online | 2025/01 | Link | CVPR 2025 |
Pyramid Memory Bank: Uses a hierarchical memory ( |
| VideoLLaMB: Long Streaming Video Understanding with Recurrent Memory Bridges | VideoLLaMB | 2024/09 | Link | ICCV 2025 | Recurrent Memory Bridge: Uses SceneTiling to segment video into semantic clips. Compresses clips into Memory Tokens via recurrent bridge layers, which are periodically updated via retrieval, enabling long-context understanding with linear memory scaling. |
| Streaming Long Video Understanding with Large Language Models | VideoStreaming | 2024/05 | N/A | NeurIPS 2024 | Memory-Propagated Encoding: Segments video into clips and encodes them into condensed memories using a small LLM, with memory propagated recursively. Uses Adaptive Memory Selection to retrieve relevant clips for QA. |
| Synchronized Video Storytelling: Generating Video Narrations with Structured Storyline | VideoNarrator | 2024/05 | Link | ACL 2024 | Memory Consolidation: Defines "Synchronized Video Storytelling". Uses Memory Consolidation to merge past visual tokens into fixed-length memory, and generates narrations guided by a structured storyline. |
| Streaming Dense Video Captioning | StreamingDVC | 2024/04 | Link | CVPR 2024 | Clustering-Based Memory: Compresses incoming visual tokens into a fixed-size memory using K-means clustering. Uses a streaming decoding algorithm to output captions before the entire video is processed. |
Methods employing external memory banks and retrieval systems.
| Paper | Model | Date | Link | Venue | Method / Key Contribution |
|---|---|---|---|---|---|
| WeaveTime: Stream from Earlier Frames into Emergent Memory in VideoLLMs | WeaveTime | 2026/02 | Link | CVPR2026 | Temporal Reconstruction + Past-Current Dynamic Focus: Introduces Temporal Reconstruction (Streaming Order Perception) to instill order-aware representations. At inference, uses Past-Current Dynamic Focus Cache for uncertainty-triggered, coarse-to-fine retrieval. Plugged into existing Video-LLM without architectural changes. |
| V-Rex: Real-Time Streaming Video LLM Acceleration via Dynamic KV Cache Retrieval | V-Rex | 2025/12 | N/A | HPCA 2026 | Software-Hardware Co-Designed Accelerator: A Training-Free dynamic KV Cache retrieval algorithm(ReSV); A dynamic KV Cache retrieval engine(DRE) |
| Venus: An Efficient Edge Memory-and-Retrieval System for VLM-based Online Video Understanding | Venus | 2025/12 | N/A | IEEE INFOCOM 2026 | EdgeβCloud Disaggregated Architecturer: Sinks memory construction and keyframe retrieval from cloud to edge, operating in two stages--Ingestion stage(builds a hierarchical memory) and Querying(employs a threshold-based progressive sampling). |
| CacheFlow: Compressive Streaming Memory for Efficient Long-Form Video Understanding | CacheFlow | 2025/11 | N/A | arXiv | Consensus-First Retrieval: Offloads KV cache to CPU. Compresses old blocks using a GRU-based memory. Retrieves top-K blocks based on a consensus score from shallow and deep layers, rehydrating them to GPU for inference. |
| StreamKV: Streaming Video Question-Answering with Segment-based KV Cache Retrieval and Compression | StreamKV | 2025/11 | Link | arXiv | Segment-based Retrieval: Partitions video into semantic segments and uses a Guidance Prompt to compress KV cache. Stores compressed KVs in a bank and retrieves relevant segments based on user query for QA. |
| StreamingTOM: Streaming Token Compression for Efficient Video Understanding | StreamingTOM | 2025/10 | Link | arXiv | Two-stage Framework: 1. CTR (Pre-LLM): Prunes input tokens based on temporal redundancy to speed up prefill. 2. OQM (Post-LLM): Stores 4-bit quantized KV groups and retrieves Top-K relevant groups on-demand for decoding. |
| Recurrent Attention-based Token Selection for Efficient Streaming Video-LLMs | rLiVS | 2025/10 | N/A | arXiv | Caption-Based Retrieval: 1. Token Selection: Uses LLM attention scores to select top ~5% visual tokens and passes them recurrently. 2. Retrieval: Generates captions for clips and retrieves top-K text captions to answer user queries, avoiding heavy KV storage. |
| CogStream: Context-guided Streaming Video Question Answering | CogReasoner | 2025/06 | Link | arXiv | Dialogue Retrieval & Visual Compression: 1. Visual Stream Compression: Clusters frames into events and compresses based on question relevance. 2. Historic Dialogue Retrieval: Uses LLM to retrieve relevant past QA pairs to support current reasoning. |
| LiveVLM: Efficient Online Video Understanding via Streaming-Oriented KV Cache and Retrieval | LiveVLM | 2025/05 | N/A | arXiv | Streaming-Oriented KV Cache & Retrieval: 1. Compresses video KV pairs via attention-based pruning and frame-wise merging. 2. Retrieves relevant long-term KV chunks based on query attention scores to answer questions efficiently. |
| Streaming Video Question-Answering with In-context Video KV-Cache Retrieval | ReKV | 2025/03 | Link | ICLR 2025 | KV-Cache Retrieval: Offloads video KV caches to CPU/Disk. Upon receiving a query, it retrieves and reloads only the relevant KV caches to GPU for efficient answer generation, decoupling encoding from QA. |
Methods reducing FLOPs via dynamic compute, sparse attention, or efficient backbone designs.
| Paper | Model | Date | Link | Venue | Method / Key Contribution |
|---|---|---|---|---|---|
| Accelerating Streaming Video Large Language Models via Hierarchical Token Compression | STC | 2025/12 | Link | arXiv | Hierarchical Token Compression: STC-Cacher caches/reuses features of temporally similar frames to reduce ViT encoding, and STC-Pruner compresses visual tokens before LLM prefill by retaining salient tokens based on spatial-temporal relevance (novelty). Designed as a plug-and-play module that can be integrated into diverse streaming frameworks (e.g., ReKV, StreamForest, Dispider, LiveCC) without altering their core logic. |
| Learning Streaming Video Representation via Multitask Training | StreamFormer | 2025/04 | Link | ICCV 2025 | Efficient Streaming Backbone: Introduces Causal Temporal Attention into Vision Transformers to enable efficient frame-by-frame processing. Trained via Multitask Learning (classification, detection, segmentation) to learn robust spatiotemporal representations. |
| Learning from Streaming Video with Orthogonal Gradients | / | 2025/04 | N/A | CVPR 2025 | Orthogonal Optimizer: Employs orthogonal gradients to reduce correlations between consecutive gradients, thereby enhancing the model's learning performance on continuous video streams. |
| VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction | VITA-1.5 | 2025/01 | Link | NeurIPS 2025 | Multi-Stage Training Methodology: Enables efficient speech-to-speech dialogue capabilities without separate ASR and TTS modules, significantly accelerating multimodal end-to-end response speed. |
| StreamChat: Chatting with Streaming Video | StreamChat | 2024/12 | Link | CVPR 2024 | Cross-Attention Streaming Architecture: Dynamically updates visual context during decoding via lightweight cross-attention, enhanced with V-FFN refinement and parallel 3D-RoPE for stable temporal alignment, enabling real-time streaming interaction without trigger modules. |
| Streaming Detection of Queried Event Start | SDQES | 2024/12 | N/A | NeurIPS 2024 | Adapter-Based Approach: Proposes a novel taskβStreaming Detection of Queried Event Start, as well as new task-specific metrics. |
| Paper | Dataset | Date | Link | Venue | Tasks |
|---|---|---|---|---|---|
| StreamEQA: Towards Streaming Video Understanding for Embodied Scenarios | StreamEQA | 2025/12 | N/A | arXiv | Two orthogonal dimensions: Embodied(perception, interaction, and planning) and Streaming(backward, realtime, and forward reasoning) |
| StreamingCoT: A Dataset for Temporal Dynamics and Multimodal Chain-of-Thought Reasoning in Streaming VideoQA | StreamingCoT | 2025/10 | Link | ACM MM 2025 | Streaming VideoQA, CoT Reasoning |
| StreamForest: Efficient Online Video Understanding with Persistent Event Memory | ODV-Bench | 2025/09 | Link | NeurIPS 2025 | Streaming VideoQA (Autonomous Driving), Real-time Perception, Future Prediction (Risk/Trajectory), Past Memory |
| OST-Bench: Evaluating the Capabilities of MLLMs in Online Spatio-temporal Scene Understanding | OST-Bench | 2025/07 | Link | NeurIPS 2025 | Online Spatio-Temporal QA, Agent State Estimation, 3D Spatial Reasoning, Memory Retrieval |
| RTV-Bench: Benchmarking MLLM Continuous Perception, Understanding and Reasoning through Real-Time Video | RTV-Bench | 2025/05 | Link | NeurIPS 2025 | Real-Time Video Reasoning, Sport / Driving / Ego Scenario, hierarchical Evaluation |
| Online Video Understanding: OVBench and VideoChat-Online | OVBench | 2025/04 | Link | CVPR 2025 | Online VideoQA, Past Memory, Future Prediction, Spatial Perception |
| EgoSpeak: Learning When to Speak for Egocentric Conversational Agents in the Wild | YT-Conversation | 2025/02 | link | NAACL 2025 | A dataset derived from diverse YouTube content including interviews, podcasts, and casual dialogues |
| SVBench: A Benchmark with Temporal Multi-Turn Dialogues for Streaming Video Understanding | SVBench | 2025/02 | Link | ICLR 2025 | Streaming VideoQA, Temporal Multi-Turn Dialogue, Long-Context Reasoning |
| Streaming Video Understanding and Multi-Round Interaction with Memory-Enhanced Knowledge | StreamBench | 2025/01 | Link | ICLR 2025 | Streaming VideoQA, Multi-turn Dialogue, Long/Short-term Memory, Object Search |
| StreamingBench: Assessing the Gap for MLLMs to Achieve Streaming Video Understanding | StreamingBench | 2024/11 | Link | arXiv | Real-time Visual QA, Omni-source QA, Contextual QA |
| TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models | TemporalBench | 2024/10 | Link | arXiv | Fine-grained Video Descriptions,Video QA, Video Captioning, Long Video Understanding |
| Paper | Dataset | Date | Link | Venue | Tasks |
|---|---|---|---|---|---|
| LiveStar: Live Streaming Assistant for Real-World Online Video Understanding | OmniStar-RNG | 2025/11 | Link | NeurIPS 2025 | Real-time Narration, Streaming Dense Captioning, Streaming Video-Text Alignment |
| LiveCC: Learning Video LLM with Streaming Speech Transcription at Scale | Live-CC-5M | 2025/04 | Link | CVPR 2025 | Large-scale Pre-training, Streaming Captioning (ASR-based), Video-Text Alignment |
| LiveCC: Learning Video LLM with Streaming Speech Transcription at Scale | Live-WhisperX-526K | 2025/04 | Link | CVPR 2025 | Real-time Video Commentary, Instruction Tuning, Dense Streaming Captioning |
| What to Say and When to Say it: Live Fitness Coaching as a Testbed for Situated Interaction | QEVD-FIT-COACH | 2024/07 | Link | NeurIPS 2024 | Fitness Activity Recognition and Guidance |
| Paper | Dataset | Date | Link | Venue | Tasks |
|---|---|---|---|---|---|
| StreamReady: Learning What to Answer and When in Long Streaming Videos | StreamReady | 2026/03 | N/A | CVPR2026 | Answer Readiness Score (ARS): Introduces timing-aware objective with asymmetric early and late penalties. Proposes ProReady-QA benchmark with annotated answer evidence windows and proactive multi-turn questions. Uses lightweight readiness mechanism to decide if sufficient evidence has been observed before responding. |
| StreamGaze: Gaze-Guided Temporal Reasoning and Proactive Understanding in Streaming Videos | StreamGaze | 2025/12 | N/A | arXiv | Gaze-Triggered Alert, Object Transition Prediction, Gaze Sequence Matching |
| Eyes Wide Open: Ego Proactive Video-LLM for Streaming Video | ESTP-Bench | 2025/10 | Link | NeurIPS 2025 | Proactive QA, Just-in-Time Response, Egocentric Reasoning |
| ProactiveVideoQA: A Comprehensive Benchmark Evaluating Proactive Interactions in Video Large Language Models | ProactiveVideoQA | 2025/07 | Link | arXiv | Proactive VideoQA (Web/Ego/TV), Response Timing Evaluation, Anomaly Detection |
| Proactive Assistant Dialogue Generation from Streaming Egocentric Videos | PROASSIST | 2025/06 | Link | arXiv | Proactive Task Guidance, Streaming Dialogue, Response Timing (When to speak) |
| OmniMMI: A Comprehensive Multi-modal Interaction Benchmark in Streaming Video Contexts | OmniMMI | 2025/03 | Link | arXiv | Streaming Video Understanding (State Grounding, Action Planning), Proactive Reasoning (Alerting, Turn-Taking) |
| AssistPDA: An Online Video Surveillance Assistant for Video Anomaly Prediction, Detection, and Analysis | VAPDA-127K | 2025/03 | N/A | arXiv | Proactive Anomaly Prediction, Online Anomaly Detection, Interactive Anomaly Analysis |
| OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding? | OVO-Bench | 2025/01 | Link | CVPR 2025 | Forward Active Responding (When to Answer), Backward Tracing, Real-time Perception |
| VideoLLM Knows When to Speak: Enhancing Time-Sensitive Video Comprehension with Video-Text Duet Interaction Format | MMDuetIT | 2024/11 | Link | arXiv | Multi-Answer Grounded QA, Proactive Response Generation |
| Can Multi-Modal LLMs Provide Live Step-by-Step Task Guidance? | QICD | 2025/11 | Link | NeurIPS 2025 | Streaming Dialogue, Proactive Response Generation, Response Timing (When to speak) |
| Name | Venue |
|---|---|
| AI Coach | CVPR 2026 |
Models
Benchmarks & Datasets
We welcome contributions! To add a resource, you can:
- Open a pull request with a clear title and brief description of your changes.
- Open an issue with a clear title and short explanation.
If you notice any errors, feel free to open an issue β we apologize in advance for any inconvenience.
If you have suggestions or find this project useful, weβd love to hear from you.
Email: yangzhenyu2022@ia.ac.cn and zhangkr2025@shanghaitech.edu.cn
