1. Core Concepts & General World Models
- World-Simulator · [Paper] [Code]
- Is Sora a World Simulator? A Comprehensive Survey on General World Models and Beyond · [Paper] [Code]
- Understanding World or Predicting Future? A Comprehensive Survey of World Models · [Paper] [Code]
- World Models in AI: Like a Child · [Paper] [Code]
- The Trinity of Consistency as a Defining Principle for General World Models · [Paper] [Code]
- Learning to Model the World: A Survey of World Models in Artificial Intelligence · [Paper]
2. World Representation & Generation
- 3D and 4D World Modeling: A Survey · [Paper] [Code]
- Exploring the Evolution of Physics Cognition in Video Generation · [Paper] [Code]
- A Survey of Interactive Generative Video · [Paper] [Code]
- From 2D to 3D Cognition · [Paper] [Code]
- Advances in Feed-Forward 3D Reconstruction and View Synthesis: A Survey · [Paper] [Code]
- 3D Representation · [Paper] [Code]
3. Application: Embodied AI
4. Application: Autonomous Driving
- The Role of World Models in Shaping Autonomous Driving: A Comprehensive Survey · [Paper] [Code]
- A Survey of World Models for Autonomous Driving · [Paper] [Code]
- World Models for Autonomous Driving: An Initial Survey · [Paper] [Code]
- Interplay Between Video Generation and World Models in Autonomous Driving · [Paper] [Code]
5. Safety, Efficiency & Learning Methods
6. Awesome Lists
- Awesome World Model Evolution - Forging the World Model Universe from Unified Multimodal Models · [Paper] [Code]
- Awesome-Efficient-Video-Generation · [Paper] [Code]
- Awesome-World-Models · [Paper] [Code]
- OpenMM-Arena · [Paper] [Code]
- Awesome-From-Video-Generation-to-World-Model · [Paper] [Code]
- Awesome-Video-World-Models-with-AR-Diffusion · [Paper] [Code]
- Unified Operator on Interactive World Model · [Paper] [Code]
- Awesome-VLA-for-AD · [Paper] [Code]
7. Positional Paper
- A Path Towards Autonomous Machine Intelligence · [Paper]
- Critiques of World Models · [Paper]
- Positional Encoding Field · [Paper] [Code]
- Visual Generation Unlocks Human-Like Reasoning through Multimodal World Models · [Paper] [Code]
- Research on World Models Is Not Injecting World Knowledge into Specific Tasks · [Paper] [Code]
spatial reasoning details
- Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence · [Paper] [Code]
- SpatialPIN: Enhancing Spatial Reasoning Capabilities of Vision-Language Models through Prompting and Interacting 3D Priors · [Paper] [Code]
- SpatialBot: Precise Spatial Understanding with Vision Language Models · [Paper] [Code]
- Multi-SpatialMLLM: Multi-Frame Spatial Understanding with Multi-Modal Large Language Models · [Paper] [Code]
- SpatialCoT: Advancing Spatial Reasoning through Coordinate Alignment and Chain-of-Thought for Embodied Task Planning · [Paper] [Code]
- SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities · [Paper] [Code]
- SpatialLLM: A Compound 3D-Informed Design towards Spatially-Intelligent Large Multimodal Models · [Paper] [Code]
- SpatialReasoner: Towards Explicit and Generalizable 3D Spatial Reasoning · [Paper] [Code]
- Is A Picture Worth A Thousand Words? Delving Into Spatial Reasoning for Vision Language Models · [Paper] [Code]
- LEO-VL: Efficient Scene Representation for Scalable 3D Vision-Language Learning · [Paper] [Code]
- SD-VLM: Spatial Measuring and Understanding with Depth-Encoded Vision-Language Models · [Paper] [Code]
- SpaceVista: All-Scale Visual Spatial Reasoning from mm to km · [Paper] [Code]
- Grounded Reinforcement Learning for Visual Reasoning · [Paper] [Code]
- SSR: Enhancing Depth Perception in Vision-Language Models via Rationale-Guided Spatial Reasoning · [Paper] [Code]
- 3D Aware Region Prompted Vision Language Model · [Paper] [Code]
- 3D‑R1: Enhancing Reasoning in 3D VLMs for Unified Scene Understanding · [Paper] [Code]
- RoboSpatial: Teaching Spatial Understanding to 2D and 3D Vision-Language Models for Robotics · [Paper] [Code]
- Continuous 3D Perception Model with Persistent State · [Paper] [Code]
- Grasp Any Region: Towards Precise, Contextual Pixel Understanding for Multimodal LLMs · [Paper] [Code]
- RL makes MLLMs see better than SFT · [Paper] [Code]
- Identifying and Mitigating Position Bias of Multi-image Vision-Language Models · [Paper] [Code]
- Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces · [Paper] [Code]
- Spatial Reasoning with Vision-Language Models in Ego-Centric Multi-View Scenes · [Paper] [Code]
- COARSE CORRESPONDENCES Boost Spatial-Temporal Reasoning in Multimodal Language Model · [Paper] [Code]
- Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models · [Paper] [Code]
- Cambrian-S · [Paper] [Code]
- GS-reasoner · [Paper] [Code]
- UniUGG · [Paper] [Code]
- SenseNova-SI · [Paper] [Code]
omni reasoning Details
- DenseWorld · [Paper] [Code]
- MindJourney · [Paper] [Code]
- Think with 3D · [Paper] [Code]
- Factored Interactive Object-Centric World Model · [Paper] [Code]
- 3D-Aware Vision-Language Models Fine-Tuning with Geometric Distillation · [Paper] [Code]
- VIRAL: Visual Representation Alignment for MLLMs · [Paper] [Code]
- VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction · [Paper] [Code]
- Video models are zero-shot learners and reasoners · [Paper] [Code]
- MLLMs Need 3D-Aware Representation Supervision for Scene Understanding (3DRS) · [Paper] [Code]
- Diffusion Feedback Helps CLIP See Better (DIVA) · [Paper] [Code]
- Learning from Videos for 3D World: Enhancing MLLMs with 3D Vision Geometry Priors · [Paper] [Code]
- Thinking by Doing · [Paper] [Code]
- GalaxyWalker · [Paper] [Code]
interactive video generation Details
video camera view editing details
navigation video generation Details
- Mirage-2 · [Paper] [Code]
- Yan · [Paper] [Code]
- Context-as-Memory · [Paper] [Code]
- Hunyuan-GameCraft-1.0 · [Paper] [Code]
- Hunyuan-GameCraft-2.0 · [Paper] [Code]
- Hunyuan-Worldplay · [Paper] [Code]
- Matrix-Game · [Paper] [Code]
- Matrix-Game 2.0 · [Paper] [Code]
- Matrix-Game 3.0 · [Paper] [Code]
- Longvie-2 · [Paper] [Code]
- Hunyuan-Game · [Paper] [Code]
- Yume1.5 · [Paper] [Code]
- GameNGen · [Paper] [Code]
- Oasis-2.0 · [Paper] [Code]
- WorldExplorer · [Paper] [Code]
- Tinyworlds · [Paper] [Code]
- Videofrom3D · [Paper] [Code]
- LYRA · [Paper] [Code]
- Fantasyworld · [Paper] [Code]
- Navigation World Models · [Paper] [Code]
- WonderWorld · [Paper] [Code]
- CLONE DETERMINISTIC 3D WORLDS WITH GEOMETRICALLY-REGULARIZED WORLD MODELS · [Paper] [Code]
- NFD · [Paper] [Code]
- WorldCache · [Paper] [Code]
- Solaris · paper code
- Infinite-World · [Paper] [Code]
general video generation model
- Helios · [Paper] [Code]
- veo-3 · [Paper] [Code]
- wan-2.5 · [Paper] [Code]
- Waver · [Paper] [Code]
- Sora · [Paper] [Code]
- Seedance · [Paper] [Code]
- Kling · [Paper] [Code]
- Lynx · [Paper] [Code]
- Cameractrl · [Paper] [Code]
- Motionctrl · [Paper] [Code]
- Panacea · [Paper] [Code]
- WorldWeaver · [Paper] [Code]
- Longlive · [Paper] [Code]
- Vchain · [Paper] [Code]
- Videocanvas · [Paper] [Code]
- SANA-video · [Paper] [Code]
- Hunyuan-Video · [Paper] [Code]
- CogVideoX · [Paper] [Code]
- longcat-video · [Paper] [Code]
- Pretrained Video Generative Models as World Simulators · [Paper] [Code]
- SFP · [Paper] [Code]
audio generation details
brain signal corresponding details
feature matching & point tracking details
multi-view stereo details
general 3D generation details
- Matrix-3D · [Paper] [Code]
- GenEx · [Paper] [Code]
- HunyuanWorld · [Paper] [Code]
- dreamcube · [Paper] [Code]
- ViPE · [Paper] [Code]
- VGGT · [Paper] [Code]
- LGM · [Paper] [Code]
- GS-LRM · [Paper] [Code]
- LVSM · [Paper] [Code]
- DUSt3R · [Paper] [Code]
- MASt3R · [Paper] [Code]
- sam-3d-object · [Paper] [Code]
- tripo3D · [Paper] [Code]
- Meshanything · [Paper] [Code]
- BPT · [Paper] [Code]
- TreeMeshGPT · [Paper] [Code]
- DeepMesh · [Paper] [Code]
- MeshMosaic · [Paper] [Code]
- π3 · [Paper] [Code]
- DA2 · [Paper] [Code]
- EvoWorld · [Paper] [Code]
- omniVGGT · [Paper] [Code]
- M3arsSynth · [Paper] [Code]
- PartCrafter · [Paper] [Code]
- nano3D · [Paper] [Code]
- hunyuan3d · [Paper] [Code]
- hitem3d · [Paper] [Code]
- HunyuanWorld-Voyager · [Paper] [Code]
- Skybox · [Paper] [Code]
- WVD · [Paper] [Code]
- Flashworld · [Paper] [Code]
- Real-Time Frame Model (RTFM) · [Paper] [Code]
- WorldMirror · [Paper] [Code]
- Skyfall-GS · [Paper] [Code]
- Worldgrow · [Paper] [Code]
- Continuous 3D Perception Model with Persistent State (CUT3R) · [Paper] [Code]
- WonderZomom · [Paper] [Code]
- EchoWorld · [Paper] [Code]
- LoGeR · [Paper] [Code]
- Utonia · [Paper] [Code]
- inspatio-world · [Paper] [Code]
- DROID-W · [Paper] [Code]
- VGGT-World · [Paper] [Code]
- Marble 1.1 Plus · [Paper] [Code]
general 4D generation details
simulator details
JEPA Family Models
- I-JEPA · [Paper] [Code]
- MC-JEPA · [Paper] [Code]
- V-JEPA · [Paper] [Code]
- Point-JEPA · [Paper] [Code]
- 3D-JEPA · [Paper] [Code]
- ACT-JEPA · [Paper] [Code]
- V-JEPA 2 · [Paper] [Code]
- Audio-JEPA · [Paper] [Code]
- LeJEPA · [Paper] [Code]
- Causal-JEPA · [Paper] [Code]
- LeWorldModel · [Paper] [Code]
- Think-JEPA · [Paper] [Code]
reasoning memory details
synthesis memory details
embodied ai models detail
- pi0 · [Paper] [Code]
- pi0.5 · [Paper] [Code]
- pi0.6 · [Ckpt]
- gigabrain · [Paper] [Code]
- wall-oss · [Paper] [Code]
- spirit-v1.5 · [Paper] [Code]
- Evo-0 · [Paper] [Code]
- World-env · [Paper] [Code]
- Masquerade · [Paper] [Code]
- Uwm · [Paper] [Code]
- brickgpt · [Paper] [Code]
- V-JEPA 2 · [Paper] [Code]
- CtrlWorld · [Paper] [Code]
- MomaGEN · [Paper] [Code]
- WorldVLA · [Paper] [Code]
- PhysWorld · [Paper] [Code]
- RoboTracer · [Paper] [Code]
- OSVI-WM · [Paper] [Code]
- LPWM · [Paper] [Code]
- ACoT-VLA · [Paper] [Code]
- GigaWorld-Policy · [Paper] [Code]
embodied ai with video generation
auto driving details
Dataset details
- OmniWorld-Game · [Paper] [Code]
- AgiBot · [Paper] [Code]
- DROID · [Paper] [Code]
- RH20T · [Paper]
- HOI4D · [Paper]
- Epic-Kitchens · [Paper]
- Ego-Exo4D · [Paper]
- HoloAssist · [Paper]
- Assembly101 · [Code]
- EgoDex · [Paper]
- CityWalk · [Paper] [Access]
- GameFactory-Dataset · [Paper] [Access]
- Look and Tell · [Paper] [Access]
- WildWorld · [Paper] [Code]
- Sekai · [Paper] [Code]
data curation framework
- Unlocking Exocentric Video-Language Data for Egocentric Video Representation Learning · [Paper]
video/image reasoning benchmark details
- MMWorld · [Paper] [Code]
- MLVU · [Paper] [Code]
- FavorBench · [Paper] [Code]
- Videoverse · [Paper] [Code]
- VinoGround · [Paper] [Code]
- ShortVIdBench · [Paper] [Code]
- Motion-Bench · [Paper] [Code]
- OpenMM-Arena · [Code]
- Human-MME: A Holistic Evaluation Benchmark for Human-Centric Multimodal Large Language Models · [Paper] [Code]
interactive video/image generation benchmark details
navigation generation benchmark details
spatial intelligence
- VSI-Bench · [Paper] [Code]
- SITE · [Paper] [Code]
- MMSI-BENCH · [Paper] [Code]
- OMNISPATIAL · [Paper] [Code]
- MindCube · [Paper] [Code]
- Stare · [Paper] [Code]
- CoreCognition · [Paper] [Code]
- SpatialViz-Bench · [Paper] [Code]
- EASI · [Paper] [Code]
- DSI-BENCH: A Benchmark for Dynamic Spatial Intelligence · [Paper] [Code]
- Spatial Reasoning with Vision-Language Models in Ego-Centric Multi-View Scenes · [Paper] [Code]
- SEEING ACROSS VIEWS: Benchmarking Spatial Reasoning of Vision-Language Models in Robotic Scenes · [Paper] [Code]
- Mind the Gap: Benchmarking Spatial Reasoning in Vision-Language Models · [Paper] [Code]
- VisualTrans: A Benchmark for Real-World Visual Transformation Reasoning · [Paper] [Code]
- From Flatland to Space: Teaching Vision-Language Models to Perceive and Reason in 3D · [Paper] [Code]
- MMGR · [Paper] [Code]
- Phyx · [Paper] [Code]
- Seephys · [Paper] [Code]