Skip to content

Latest commit

 

History

History
600 lines (488 loc) · 44.1 KB

File metadata and controls

600 lines (488 loc) · 44.1 KB

Awesome World Model Awesome Maintained PRs Welcome

Awesome World Models Survey

1. Core Concepts & General World Models
  • World-Simulator · [Paper] [Code]
  • Is Sora a World Simulator? A Comprehensive Survey on General World Models and Beyond · [Paper] [Code]
  • Understanding World or Predicting Future? A Comprehensive Survey of World Models · [Paper] [Code]
  • World Models in AI: Like a Child · [Paper] [Code]
  • The Trinity of Consistency as a Defining Principle for General World Models · [Paper] [Code]
  • Learning to Model the World: A Survey of World Models in Artificial Intelligence · [Paper]
2. World Representation & Generation
3. Application: Embodied AI
4. Application: Autonomous Driving
  • The Role of World Models in Shaping Autonomous Driving: A Comprehensive Survey · [Paper] [Code]
  • A Survey of World Models for Autonomous Driving · [Paper] [Code]
  • World Models for Autonomous Driving: An Initial Survey · [Paper] [Code]
  • Interplay Between Video Generation and World Models in Autonomous Driving · [Paper] [Code]
5. Safety, Efficiency & Learning Methods
6. Awesome Lists
7. Positional Paper
  • A Path Towards Autonomous Machine Intelligence · [Paper]
  • Critiques of World Models · [Paper]
  • Positional Encoding Field · [Paper] [Code]
  • Visual Generation Unlocks Human-Like Reasoning through Multimodal World Models · [Paper] [Code]
  • Research on World Models Is Not Injecting World Knowledge into Specific Tasks · [Paper] [Code]

World Model - Reasoning

spatial reasoning

spatial reasoning details
  • Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence · [Paper] [Code]
  • SpatialPIN: Enhancing Spatial Reasoning Capabilities of Vision-Language Models through Prompting and Interacting 3D Priors · [Paper] [Code]
  • SpatialBot: Precise Spatial Understanding with Vision Language Models · [Paper] [Code]
  • Multi-SpatialMLLM: Multi-Frame Spatial Understanding with Multi-Modal Large Language Models · [Paper] [Code]
  • SpatialCoT: Advancing Spatial Reasoning through Coordinate Alignment and Chain-of-Thought for Embodied Task Planning · [Paper] [Code]
  • SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities · [Paper] [Code]
  • SpatialLLM: A Compound 3D-Informed Design towards Spatially-Intelligent Large Multimodal Models · [Paper] [Code]
  • SpatialReasoner: Towards Explicit and Generalizable 3D Spatial Reasoning · [Paper] [Code]
  • Is A Picture Worth A Thousand Words? Delving Into Spatial Reasoning for Vision Language Models · [Paper] [Code]
  • LEO-VL: Efficient Scene Representation for Scalable 3D Vision-Language Learning · [Paper] [Code]
  • SD-VLM: Spatial Measuring and Understanding with Depth-Encoded Vision-Language Models · [Paper] [Code]
  • SpaceVista: All-Scale Visual Spatial Reasoning from mm to km · [Paper] [Code]
  • Grounded Reinforcement Learning for Visual Reasoning · [Paper] [Code]
  • SSR: Enhancing Depth Perception in Vision-Language Models via Rationale-Guided Spatial Reasoning · [Paper] [Code]
  • 3D Aware Region Prompted Vision Language Model · [Paper] [Code]
  • 3D‑R1: Enhancing Reasoning in 3D VLMs for Unified Scene Understanding · [Paper] [Code]
  • RoboSpatial: Teaching Spatial Understanding to 2D and 3D Vision-Language Models for Robotics · [Paper] [Code]
  • Continuous 3D Perception Model with Persistent State · [Paper] [Code]
  • Grasp Any Region: Towards Precise, Contextual Pixel Understanding for Multimodal LLMs · [Paper] [Code]
  • RL makes MLLMs see better than SFT · [Paper] [Code]
  • Identifying and Mitigating Position Bias of Multi-image Vision-Language Models · [Paper] [Code]
  • Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces · [Paper] [Code]
  • Spatial Reasoning with Vision-Language Models in Ego-Centric Multi-View Scenes · [Paper] [Code]
  • COARSE CORRESPONDENCES Boost Spatial-Temporal Reasoning in Multimodal Language Model · [Paper] [Code]
  • Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models · [Paper] [Code]
  • Cambrian-S · [Paper] [Code]
  • GS-reasoner · [Paper] [Code]
  • UniUGG · [Paper] [Code]
  • SenseNova-SI · [Paper] [Code]

omni reasoning

omni reasoning Details

World Model - Multimodel Synthesis

interactive video generation

interactive video generation Details
video camera view editing details

navigation video generation

navigation video generation Details

(long-term) video generation

general video generation model

audio generation

audio generation details

brain signal

brain signal corresponding details
  • Artificial Hippocampus Networks for Efficient Long-Context Modeling · [Paper] [Code]

World Model - Simulator and Representation

Feature Matching & Point Tracking

feature matching & point tracking details

Multi-View Stereo (MVS)

multi-view stereo details

3D generation

general 3D generation details

4D generation

general 4D generation details

Simulator

simulator details

Joint-Embedding Predictive Architecture (JEPA)

JEPA Family Models

World Model - Memory

reasoning memory

reasoning memory details

synthesis memory

synthesis memory details

World Model - VLA

embodied ai

embodied ai models detail
embodied ai with video generation
embodied ai with 3D generation

auto-driving

auto driving details

Other World Models Corresponding Works

Datasets

Dataset details
data curation framework
  • Unlocking Exocentric Video-Language Data for Egocentric Video Representation Learning · [Paper]

Benchmark

video/image reasoning benchmark details
interactive video/image generation benchmark details
navigation generation benchmark details
3D/4D reasoning benchmark details
3D/4D generation benchmark details
spatial intelligence

World Knowledge Model

World Knowledge Editing
Code generation
Detection

World Model Training

world model training methods details