Step‑Audio‑EditX Multi‑Voice Cloner Node 🎙️

This project is a custom node implementation built on top of Step-Audio-EditX. It adapts and extends EditX capabilities to support multi‑speaker, long‑format, voice cloning, and emotion/style/speed editing, enabling you to feed in a script with multiple speakers, inline pauses, paralinguistic cues (like laughter, breathing), and get a concatenated audio output — all in one pass.

Description

The original Step‑Audio‑EditX model enables single‑speaker voice cloning and emotion/style editing given a reference prompt audio + text.

This node extends that capability, allowing you to:

Provide multiple “speaker” reference voices at once.
Write a simple script with speaker tags, inline pauses, and optionally emotion/style/speed tags.
Generate a single contiguous audio file with all voices, pauses, and editing applied.
Handle paralinguistic markers (like [Laughter], [Breathing], etc.) — these are preserved and synthesis attempts to reflect them as natural speech or silence, depending on your engine’s capabilities.

In short: you can build multi‑voice dialogues, audio stories, podcasts, or voice‑over sequences in one go.

Features

Multi‑speaker support (map each speaker to a reference audio + prompt).
Inline speaker switching via [speakerX] tags.
Inline pauses via [pause]N] syntax (pause of N milliseconds).
Emotion / style / speed tags (e.g. [happy], [serious], [faster]) for each line.
Paralinguistic tag support — e.g. [Laughter], [Breathing], [Sigh], [Dissatisfaction-hnn], etc. Those tags remain in the output text.
Automatic concatenation of generated audio segments into one final waveform.
Progress reporting (with progress bar).
Graceful handling of missing speaker‑tags (defaults to first speaker).

How It Works

Parse the input script line by line.
Detect tags:
- [speakerX] — which reference voice/prompt to use.
- Optional leading tags like emotion, style, speed (e.g. [happy], [whisper], [slower]).
- Paralinguistic tags (preserved).
- [pause] tags — interpreted as “generate silence for N ms.”
For each speech line, call clone_from_tensor(...) (and optionally repeated editing for emotion / style / speed).
For pause lines, generate a tensor of zeros of the requested duration.
Collect all segments (speech or silence), concatenate them, and return a single audio output.

Usage

Example usage in a ComfyUI flow:

[speaker1][happy]Hello there!  
[pause]500  
[speaker2][sad][whisper]I’m not sure about this…  
[speaker1][Laughter]That’s hilarious!

Provide reference audios & prompts for each speaker.
Feed this script to the node.
Get a single AUDIO output: concatenated waveform with cloned voices, pauses, and editing.
Emotion and Speaking Style Editing
- Remarkably effective iterative control over emotions and styles, supporting dozens of options for editing.
  - Emotion Editing : [ Angry, Happy, Sad, Excited, Fearful, Surprised, Disgusted, etc. ]
  - Speaking Style Editing: [ Act_coy, Older, Child, Whisper, Serious, Generous, Exaggerated, etc.]
  - Editing with more emotion and more speaking styles is on the way. Get Ready! 🚀
Paralinguistic Editing
- Precise control over 10 types of paralinguistic features for more natural, human-like, and expressive synthetic audio.
- Supporting Tags:
  - [ Breathing, Laughter, Suprise-oh, Confirmation-en, Uhm, Suprise-ah, Suprise-wa, Sigh, Question-ei, Dissatisfaction-hnn ]
Available Tags

emotion	happy	Expressing happiness	angry	Expressing anger
	sad	Expressing sadness	fear	Expressing fear
	surprised	Expressing surprise	confusion	Expressing confusion
	empathy	Expressing empathy and understanding	embarrass	Expressing embarrassment
	excited	Expressing excitement and enthusiasm	depressed	Expressing a depressed or discouraged mood
	admiration	Expressing admiration or respect	coldness	Expressing coldness and indifference
	disgusted	Expressing disgust or aversion	humour	Expressing humor or playfulness

speaking style	serious	Speaking in a serious or solemn manner	arrogant	Speaking in an arrogant manner
	child	Speaking in a childlike manner	older	Speaking in an elderly-sounding manner
	girl	Speaking in a light, youthful feminine manner	pure	Speaking in a pure, innocent manner
	sister	Speaking in a mature, confident feminine manner	sweet	Speaking in a sweet, lovely manner
	exaggerated	Speaking in an exaggerated, dramatic manner	ethereal	Speaking in a soft, airy, dreamy manner
	whisper	Speaking in a whispering, very soft manner	generous	Speaking in a hearty, outgoing, and straight-talking manner
	recite	Speaking in a clear, well-paced, poetry-reading manner	act_coy	Speaking in a sweet, playful, and endearing manner
	warm	Speaking in a warm, friendly manner	shy	Speaking in a shy, timid manner
	comfort	Speaking in a comforting, reassuring manner	authority	Speaking in an authoritative, commanding manner
	chat	Speaking in a casual, conversational manner	radio	Speaking in a radio-broadcast manner
	soulful	Speaking in a heartfelt, deeply emotional manner	gentle	Speaking in a gentle, soft manner
	story	Speaking in a narrative, audiobook-style manner	vivid	Speaking in a lively, expressive manner
	program	Speaking in a show-host/presenter manner	news	Speaking in a news broadcasting manner
	advertising	Speaking in a polished, high-end commercial voiceover manner	roar	Speaking in a loud, deep, roaring manner
	murmur	Speaking in a quiet, low manner	shout	Speaking in a loud, sharp, shouting manner
	deeply	Speaking in a deep and low-pitched tone	loudly	Speaking in a loud and high-pitched tone

paralinguistic	Breathing	Breathing sound	Laughter	Laughter or laughing sound
	Uhm	Hesitation sound: "Uhm"	Sigh	Sighing sound
	Surprise-oh	Expressing surprise: "Oh"	Surprise-ah	Expressing surprise: "Ah"
	Surprise-wa	Expressing surprise: "Wa"	Confirmation-en	Confirming: "En"
	Question-ei	Questioning: "Ei"	Dissatisfaction-hnn	Dissatisfied sound: "Hnn"

Installation / Integration

   cd custom_nodes
   git clone https://github.com/vantagewithai/Vantage-Step-Audio-EditX.git
   cd Vantage-Step-Audio-EditX
   pip install -r requirements.txt

Launch ComfyUI — the node should appear under category Vantage/Step-Audio-EditX.

Download Models

Download

After downloading the models, copy them into ComfyUI/models, you should have the following structure:

ComfyUI/
├── models/
│   ├── Step-Audio-EditX/
│   ├──── CosyVoice-300M-25Hz/
│   │     ├─── campplus.onnx
│   │     ├─── cosyvoice.yaml
│   │     ├─── flow.pt
│   │     └─── hift.pt
│   ├──── dengcunqin/
│   ├──── └─── speech_paraformer-large_asr_nat-zh-cantonese-en-16k-vocab8501-online/
│   │          ├─── am.mvn
│   │          ├─── config.yaml
│   │          ├─── configuration.json
│   │          ├─── model.pt
│   │          ├─── seg_dict
│   │          ├─── tokens.json
│   │          ├─── tokens.txt
│   │          └─── write_tokens_from_txt.py
│   ├── model.safetensors
│   └── speech_tokenizer_v1.onnx

Script Syntax

Tag Type	Syntax	Meaning
Speaker switch	`[speakerX]`	Use speaker number X (1-based)
Pause / silence	`[pause]300`	Insert 300 ms of silence
Emotion	`[happy]`, `[sad]`, …	First valid emotion tag per line is applied
Style	`[whisper]`, `[serious]`, …	First valid style tag per line is applied
Speed modifier	`[faster]`, `[slower]`, …	First valid speed tag per line is applied
Paralinguistic cue	`[Laughter]`, `[Breathing]`, `[Sigh]`, …	Preserved in the text — not stripped. May be used for downstream effects.

Tags must come before the actual text of the line (after [speakerX]).

Example:

[speaker2][happy][whisper][slower]I am fine!  
[speaker1][Laughter]That was funny!  
[pause]500

Limitations & Notes

The quality of voice cloning / emotion/style editing depends on the underlying Step‑Audio‑EditX engine and your reference audio & prompt.
Paralinguistic tags are preserved in the text passed to the engine — if the engine doesn’t support them, they may just render as silence or be ignored.
If sample rates of speakers vary, the node currently assumes uniform sample rate in the concatenation step.
Long scripts may consume significant VRAM / memory — monitor usage accordingly.
The node does not perform grammar or punctuation correction — script should be well‑formatted.

License & Credits

License: MIT

This project builds upon the original Step‑Audio‑EditX repository (see https://github.com/stepfun-ai/Step-Audio-EditX).

Please refer to the original repository for the base license.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
coreeditx		coreeditx
example_workflows		example_workflows
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
nodes.py		nodes.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Step‑Audio‑EditX Multi‑Voice Cloner Node 🎙️

Description

Features

How It Works

Usage

Installation / Integration

Download Models

Script Syntax

Limitations & Notes

License & Credits

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Step‑Audio‑EditX Multi‑Voice Cloner Node 🎙️

Description

Features

How It Works

Usage

Installation / Integration

Download Models

Script Syntax

Limitations & Notes

License & Credits

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages