Skip to content

MAC-AutoML/Trainingfree-LLM-Orchestration

Repository files navigation

Training-Free Multimodal Large Language Model Orchestration

Training-Free Multimodal Large Language Model Orchestration

Tianyu Xie · Yuexiao Ma · Yuhang Wu · Wang Chen · Jiayi Ji · Tat-Seng Chua · Xiawu Zheng · Rongrong Ji

🔥 Accepted by ICML 2026 🔥

arXiv Paper PDF GitHub Repo ICML 2026 GitHub stars
Python uv Training Free Multimodal

🚀 A training-free orchestration framework for building interactive omni-modal assistants by composing off-the-shelf modality experts, explicit LLM routing, text-centric cross-modal memory, and interruption-aware streaming interaction.

Highlights · Method · Results · Installation · Citation

Training-free orchestration overview

🔥 News

  • 2026: 🔥🔥🔥 Accepted to ICML 2026.
  • 2026-05-08: 🚀 arXiv v3 is available.
  • 2025-08-06: 📄 Initial arXiv release.

✨ Highlights

  • 🚀 Training-free multimodal integration: composes existing LLM, vision, ASR, and TTS experts without gradient-based multimodal alignment for system integration.
  • 🧭 Auditable LLM controller: predicts user intent and emits explicit control tokens for expert selection, sequencing, speaking, listening, and interruption.
  • 🧠 Text-centric cross-modal memory: compresses multimodal evidence into lightweight structured records for retrieval and reuse across turns.
  • Unified interaction layer: supports full-duplex streaming, speech interruption, live video grounding, and consistent modality transitions.
  • 🧩 Modular upgrade path: each expert can be replaced or upgraded independently through configuration instead of retraining the full system.

🚀 At a Glance

What changes Why it matters
No end-to-end multimodal alignment training Reduces integration cost and keeps the system easy to extend.
Explicit controller tokens Makes routing decisions inspectable instead of hiding them inside a monolithic model.
Memory-backed expert reuse Avoids repeated expert calls when prior multimodal evidence is already available.
Full-duplex runtime Supports natural turn-taking, interruption, streaming response, and speech output.

🧭 Method

LLM orchestration pipeline

🏗️ Framework Overview

LLM Orchestration separates multimodal assistant construction into three coordinated layers:

Layer Role
LLM Controller Interprets the dialogue state, chooses modality experts, and emits protocol-constrained control tokens.
Cross-modal Memory Stores dialogue history and compressed multimodal evidence for retrieval in later turns.
Interaction Runtime Executes ASR, video capture, LLM calls, TTS playback, streaming responses, and interruption handling.

This repository provides a runnable reference implementation with GUI and CLI entry points.

📊 Results

The figures below summarize the paper's reported gains from orchestration and memory reuse.

Model improvement comparison

Efficiency analysis

📁 Repository Structure

.
├── assets/       # README figures and visual assets
├── asr/          # ASR, VAD, and microphone input processing
├── dialogue/     # Main dialogue orchestration loop
├── gui/          # PyQt interface
├── llm/          # LLM controller, prompts, and model manager
├── memory/       # Dialogue and cross-modal memory records
├── paper/        # Local paper PDF
├── tts/          # TTS engines and audio playback
├── utils/        # Configuration and file utilities
├── video/        # Camera capture and visual processing
├── main.py       # CLI / GUI entry point
└── pyproject.toml

🛠️ Installation

This project is managed with uv.

git clone https://github.com/MAC-AutoML/Trainingfree-LLM-Orchestration.git
cd Trainingfree-LLM-Orchestration
uv sync

Create a local environment file:

cp .env.template .env

Then edit .env with the API endpoints and keys for the experts you want to use.

⚡ Quick Start

Run the GUI:

uv run python main.py --mode gui

Run the CLI:

uv run python main.py --mode cli

The runtime will initialize the configured ASR service, camera capture, LLM experts, memory manager, and TTS engine. Hardware devices and third-party APIs must be available according to your .env configuration.

🔧 Configuration

Most runtime behavior is configured through .env.

Area Example variables
Main dialogue LLM MAIN_LLM_MODEL, MAIN_LLM_API_BASE, MAIN_LLM_API_KEY
Vision LLM VISION_LLM_MODEL, VISION_LLM_API_BASE, VISION_LLM_API_KEY
Visual reasoning LLM QVQ_LLM_MODEL, QVQ_LLM_API_BASE, QVQ_LLM_API_KEY
ASR ASR_API_URL, ASR_API_KEY, ASR_MODEL, ASR_GAIN
TTS TTS_ENGINE, TTS_VOICE, COSY_API_BASE, GPUSTACK_API_BASE
Runtime DEBUG, VAD_SENSITIVITY, USE_STREAMING

See .env.template for the full list.

📄 Paper

📚 Citation

If this work is useful for your research, please cite:

@article{xie2025trainingfree,
  title   = {Training-Free Multimodal Large Language Model Orchestration},
  author  = {Xie, Tianyu and Ma, Yuexiao and Wu, Yuhang and Chen, Wang and Ji, Jiayi and Chua, Tat-Seng and Zheng, Xiawu and Ji, Rongrong},
  journal = {arXiv preprint arXiv:2508.10016},
  year    = {2025}
}

The citation will be updated when the official ICML 2026 proceedings entry is available.

About

A training-free orchestration framework for building interactive omni-modal assistants by composing off-the-shelf modality experts, explicit LLM routing, text-centric cross-modal memory, and interruption-aware streaming interaction.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages