Training-Free Multimodal Large Language Model Orchestration

Training-Free Multimodal Large Language Model Orchestration

Tianyu Xie · Yuexiao Ma · Yuhang Wu · Wang Chen · Jiayi Ji · Tat-Seng Chua · Xiawu Zheng · Rongrong Ji

🔥 Accepted by ICML 2026 🔥

🚀 A training-free orchestration framework for building interactive omni-modal assistants by composing off-the-shelf modality experts, explicit LLM routing, text-centric cross-modal memory, and interruption-aware streaming interaction.

Highlights · Method · Results · Installation · Citation

🔥 News

2026: 🔥🔥🔥 Accepted to ICML 2026.
2026-05-08: 🚀 arXiv v3 is available.
2025-08-06: 📄 Initial arXiv release.

✨ Highlights

🚀 Training-free multimodal integration: composes existing LLM, vision, ASR, and TTS experts without gradient-based multimodal alignment for system integration.
🧭 Auditable LLM controller: predicts user intent and emits explicit control tokens for expert selection, sequencing, speaking, listening, and interruption.
🧠 Text-centric cross-modal memory: compresses multimodal evidence into lightweight structured records for retrieval and reuse across turns.
⚡ Unified interaction layer: supports full-duplex streaming, speech interruption, live video grounding, and consistent modality transitions.
🧩 Modular upgrade path: each expert can be replaced or upgraded independently through configuration instead of retraining the full system.

🚀 At a Glance

What changes	Why it matters
No end-to-end multimodal alignment training	Reduces integration cost and keeps the system easy to extend.
Explicit controller tokens	Makes routing decisions inspectable instead of hiding them inside a monolithic model.
Memory-backed expert reuse	Avoids repeated expert calls when prior multimodal evidence is already available.
Full-duplex runtime	Supports natural turn-taking, interruption, streaming response, and speech output.

🧭 Method

🏗️ Framework Overview

LLM Orchestration separates multimodal assistant construction into three coordinated layers:

Layer	Role
LLM Controller	Interprets the dialogue state, chooses modality experts, and emits protocol-constrained control tokens.
Cross-modal Memory	Stores dialogue history and compressed multimodal evidence for retrieval in later turns.
Interaction Runtime	Executes ASR, video capture, LLM calls, TTS playback, streaming responses, and interruption handling.

This repository provides a runnable reference implementation with GUI and CLI entry points.

📊 Results

The figures below summarize the paper's reported gains from orchestration and memory reuse.

📁 Repository Structure

.
├── assets/       # README figures and visual assets
├── asr/          # ASR, VAD, and microphone input processing
├── dialogue/     # Main dialogue orchestration loop
├── gui/          # PyQt interface
├── llm/          # LLM controller, prompts, and model manager
├── memory/       # Dialogue and cross-modal memory records
├── paper/        # Local paper PDF
├── tts/          # TTS engines and audio playback
├── utils/        # Configuration and file utilities
├── video/        # Camera capture and visual processing
├── main.py       # CLI / GUI entry point
└── pyproject.toml

🛠️ Installation

This project is managed with uv.

git clone https://github.com/MAC-AutoML/Trainingfree-LLM-Orchestration.git
cd Trainingfree-LLM-Orchestration
uv sync

Create a local environment file:

cp .env.template .env

Then edit .env with the API endpoints and keys for the experts you want to use.

⚡ Quick Start

Run the GUI:

uv run python main.py --mode gui

Run the CLI:

uv run python main.py --mode cli

The runtime will initialize the configured ASR service, camera capture, LLM experts, memory manager, and TTS engine. Hardware devices and third-party APIs must be available according to your .env configuration.

🔧 Configuration

Most runtime behavior is configured through .env.

Area	Example variables
Main dialogue LLM	`MAIN_LLM_MODEL`, `MAIN_LLM_API_BASE`, `MAIN_LLM_API_KEY`
Vision LLM	`VISION_LLM_MODEL`, `VISION_LLM_API_BASE`, `VISION_LLM_API_KEY`
Visual reasoning LLM	`QVQ_LLM_MODEL`, `QVQ_LLM_API_BASE`, `QVQ_LLM_API_KEY`
ASR	`ASR_API_URL`, `ASR_API_KEY`, `ASR_MODEL`, `ASR_GAIN`
TTS	`TTS_ENGINE`, `TTS_VOICE`, `COSY_API_BASE`, `GPUSTACK_API_BASE`
Runtime	`DEBUG`, `VAD_SENSITIVITY`, `USE_STREAMING`

See .env.template for the full list.

📄 Paper

arXiv: https://arxiv.org/abs/2508.10016
Local PDF: paper/2508.10016.pdf

📚 Citation

If this work is useful for your research, please cite:

@article{xie2025trainingfree,
  title   = {Training-Free Multimodal Large Language Model Orchestration},
  author  = {Xie, Tianyu and Ma, Yuexiao and Wu, Yuhang and Chen, Wang and Ji, Jiayi and Chua, Tat-Seng and Zheng, Xiawu and Ji, Rongrong},
  journal = {arXiv preprint arXiv:2508.10016},
  year    = {2025}
}

The citation will be updated when the official ICML 2026 proceedings entry is available.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Training-Free Multimodal Large Language Model Orchestration

🔥 News

✨ Highlights

🚀 At a Glance

🧭 Method

🏗️ Framework Overview

📊 Results

📁 Repository Structure

🛠️ Installation

⚡ Quick Start

🔧 Configuration

📄 Paper

📚 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
asr		asr
assets/figures		assets/figures
dialogue		dialogue
gui		gui
llm		llm
memory		memory
paper		paper
tts		tts
utils		utils
video		video
.env.template		.env.template
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
main.py		main.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

Training-Free Multimodal Large Language Model Orchestration

🔥 News

✨ Highlights

🚀 At a Glance

🧭 Method

🏗️ Framework Overview

📊 Results

📁 Repository Structure

🛠️ Installation

⚡ Quick Start

🔧 Configuration

📄 Paper

📚 Citation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages