Skip to content

Hedlen/Awesome-Multimodal-Intelligence

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Awesome Multimodal Intelligence

Awesome License: MIT PRs Welcome Stars

专注于 VLM、VLA、世界模型及通用具身智能等方向,收录前沿论文、开源代码与数据集,追踪从感知到决策的下一代智能体技术。

A curated collection for multimodal intelligence research, covering VLMs, VLAs, World Models, and embodied AI — tracking next-generation agent technologies from perception to decision-making, with a focus on papers, code, and datasets.

如果本仓库对你有帮助,欢迎 Star ⭐ 或分享 ⬆️,感谢支持! If you find this repository helpful, please consider Stars ⭐ or Sharing ⬆️. Thanks.


📌 News

  • 2026.4: 全面更新 VLM(新增 SigLIP/SigLIP2、Qwen2.5-VL、Qwen3-VL、InternVL3、Molmo、Emu3 等),新增世界模型与具身 AI 专题,补全训练数据集与评估 Benchmark。
  • 2026.4: Major update — added SigLIP/SigLIP2, Qwen2.5-VL, Qwen3-VL, InternVL3, Molmo, Emu3 to VLMs; launched World Models and Embodied AI collections with full datasets and benchmarks.

📂 Collection Index

Topic Description Link
🖼️ Vision Language Models (VLMs) 视觉语言模型:感知、理解与多模态推理 / Perception, understanding, and multimodal reasoning 📄 View
🤖 Vision Language Action Models (VLAs) 视觉语言动作模型:从感知到物理决策 / From perception to physical decision-making 📄 View
🌍 World Models 世界模型:环境建模与预测性规划 / Environment modeling and predictive planning 📄 View
🧠 Embodied AI 通用具身智能:感知、规划与执行的统一 / Unified perception, planning, and execution 📄 View

🗺️ Research Landscape

Perception ──► Understanding ──► Reasoning ──► Planning ──► Action
    │               │                │              │           │
   VLMs            VLMs             VLMs           VLAs        VLAs
                                  World Models   World Models  Embodied AI
  • VLMs 负责视觉感知与语言理解的桥接,是整个智能体栈的感知基础。
  • VLAs 在 VLM 基础上引入动作输出,实现端到端的感知-决策闭环。
  • World Models 对环境动态建模,为规划提供预测性先验。
  • Embodied AI 整合上述能力,面向真实世界的通用智能体。

🔖 Quick Links

  • Awesome VLMs →

    • Contrastive Pre-training (CLIP, SigLIP, SigLIP2, EVA-CLIP, MetaCLIP)
    • Generative VLMs (Flamingo, BLIP-2, Emu3, Molmo)
    • Instruction-Tuned VLMs (LLaVA series, InternVL3, Qwen3-VL, Qwen2.5-VL, Idefics3)
    • Grounding & Localization (KOSMOS-2, Grounding DINO 1.5, SAM2)
    • Efficient VLMs (PaliGemma2, Phi-4-Vision, SmolVLM)
    • Proprietary VLMs (GPT-4o, Gemini 2.0, Claude 3.5)
    • Training Datasets (LAION-5B, DataComp, MMC4, PixMo)
    • Benchmarks (MMMU-Pro, MMStar, MathVista, Video-MME)
  • Awesome VLAs →

    • Foundational Policies (ACT, Diffusion Policy, RT-1, SayCan)
    • VLA Base Models (RT-2, OpenVLA, π0, π0.5)
    • Generalist Policies (Octo, CrossFormer)
    • Manipulation, Navigation, Dexterous Control
    • RL & Self-Improvement (OpenVLA-OFT, VLARL)
    • Datasets & Benchmarks (Open X-Embodiment, LIBERO, CALVIN)
  • Awesome World Models →

    • Model-Based RL (DreamerV3, TD-MPC2, MuZero, IRIS)
    • JEPA (I-JEPA, V-JEPA, V-JEPA2)
    • Video Generation (Sora, Cosmos, Genie2, DIAMOND)
    • Autonomous Driving (GAIA-1, DriveDreamer, Vista, OccWorld)
    • Robot World Models (UniSim, GR-1, UniPi, Pandora)
    • Benchmarks (VBench, EvalCrafter, nuPlan, CARLA)
  • Awesome Embodied AI →

    • Perception (EmbodiedScan, ConceptFusion, AnyGrasp)
    • Navigation (NavGPT2, ViNT, NoMaD, CoW)
    • Manipulation (RVT-2, Diffusion Policy, 3D Diffusion Policy)
    • Task Planning (SayCan, Code-as-Policies, Voyager, ReKep)
    • General Agents (Gato, OpenVLA, π0, Octo, LEO)
    • Humanoid (Figure02, HumanPlus, OmniH2O, DexVLA)
    • Simulators (Isaac Lab, ManiSkill3, Habitat 3.0, Genesis)
    • Benchmarks (CALVIN, MetaWorld, EmbodiedBench, BEHAVIOR-1K)

🤝 Contributing

欢迎提交 PR 补充新论文、数据集或工具!请遵循各子文档的表格格式。

PRs are welcome to add new papers, datasets, or toolkits. Please follow the table format in each sub-document.

  1. Fork 本仓库
  2. 在对应的 .md 文件中添加条目
  3. 提交 Pull Request,简要说明新增内容

📜 License

This project is released under the MIT License.


Maintained with ❤️ — tracking the frontier of multimodal intelligence

About

专注于 VLM、VLA、世界模型及通用具身智能等方向,收录前沿论文、开源代码与数据集,追踪从感知到决策的下一代智能体技术。 A curated collection for multimodal intelligence research, covering VLMs, VLAs, World Models, and embodied AI — tracking next-generation agent technologies from perception to decision-making, with a focus on papers, code, and datasets.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors