This project is a custom node implementation built on top of Step-Audio-EditX. It adapts and extends EditX capabilities to support multi‑speaker, long‑format, voice cloning, and emotion/style/speed editing, enabling you to feed in a script with multiple speakers, inline pauses, paralinguistic cues (like laughter, breathing), and get a concatenated audio output — all in one pass.
The original Step‑Audio‑EditX model enables single‑speaker voice cloning and emotion/style editing given a reference prompt audio + text.
This node extends that capability, allowing you to:
- Provide multiple “speaker” reference voices at once.
- Write a simple script with speaker tags, inline pauses, and optionally emotion/style/speed tags.
- Generate a single contiguous audio file with all voices, pauses, and editing applied.
- Handle paralinguistic markers (like
[Laughter],[Breathing], etc.) — these are preserved and synthesis attempts to reflect them as natural speech or silence, depending on your engine’s capabilities.
In short: you can build multi‑voice dialogues, audio stories, podcasts, or voice‑over sequences in one go.
- Multi‑speaker support (map each speaker to a reference audio + prompt).
- Inline speaker switching via
[speakerX]tags. - Inline pauses via
[pause]N]syntax (pause of N milliseconds). - Emotion / style / speed tags (e.g.
[happy],[serious],[faster]) for each line. - Paralinguistic tag support — e.g.
[Laughter],[Breathing],[Sigh],[Dissatisfaction-hnn], etc. Those tags remain in the output text. - Automatic concatenation of generated audio segments into one final waveform.
- Progress reporting (with progress bar).
- Graceful handling of missing speaker‑tags (defaults to first speaker).
- Parse the input script line by line.
- Detect tags:
[speakerX]— which reference voice/prompt to use.- Optional leading tags like emotion, style, speed (e.g.
[happy],[whisper],[slower]). - Paralinguistic tags (preserved).
[pause]tags — interpreted as “generate silence for N ms.”
- For each speech line, call
clone_from_tensor(...)(and optionally repeated editing for emotion / style / speed). - For pause lines, generate a tensor of zeros of the requested duration.
- Collect all segments (speech or silence), concatenate them, and return a single audio output.
Example usage in a ComfyUI flow:
[speaker1][happy]Hello there!
[pause]500
[speaker2][sad][whisper]I’m not sure about this…
[speaker1][Laughter]That’s hilarious!
-
Provide reference audios & prompts for each speaker.
-
Feed this script to the node.
-
Get a single AUDIO output: concatenated waveform with cloned voices, pauses, and editing.
-
Emotion and Speaking Style Editing
- Remarkably effective iterative control over emotions and styles, supporting dozens of options for editing.
- Emotion Editing : [ Angry, Happy, Sad, Excited, Fearful, Surprised, Disgusted, etc. ]
- Speaking Style Editing: [ Act_coy, Older, Child, Whisper, Serious, Generous, Exaggerated, etc.]
- Editing with more emotion and more speaking styles is on the way. Get Ready! 🚀
- Remarkably effective iterative control over emotions and styles, supporting dozens of options for editing.
-
Paralinguistic Editing
- Precise control over 10 types of paralinguistic features for more natural, human-like, and expressive synthetic audio.
- Supporting Tags:
- [ Breathing, Laughter, Suprise-oh, Confirmation-en, Uhm, Suprise-ah, Suprise-wa, Sigh, Question-ei, Dissatisfaction-hnn ]
-
Available Tags
| emotion | happy | Expressing happiness | angry | Expressing anger |
| sad | Expressing sadness | fear | Expressing fear | |
| surprised | Expressing surprise | confusion | Expressing confusion | |
| empathy | Expressing empathy and understanding | embarrass | Expressing embarrassment | |
| excited | Expressing excitement and enthusiasm | depressed | Expressing a depressed or discouraged mood | |
| admiration | Expressing admiration or respect | coldness | Expressing coldness and indifference | |
| disgusted | Expressing disgust or aversion | humour | Expressing humor or playfulness | |
| speaking style | serious | Speaking in a serious or solemn manner | arrogant | Speaking in an arrogant manner |
| child | Speaking in a childlike manner | older | Speaking in an elderly-sounding manner | |
| girl | Speaking in a light, youthful feminine manner | pure | Speaking in a pure, innocent manner | |
| sister | Speaking in a mature, confident feminine manner | sweet | Speaking in a sweet, lovely manner | |
| exaggerated | Speaking in an exaggerated, dramatic manner | ethereal | Speaking in a soft, airy, dreamy manner | |
| whisper | Speaking in a whispering, very soft manner | generous | Speaking in a hearty, outgoing, and straight-talking manner | |
| recite | Speaking in a clear, well-paced, poetry-reading manner | act_coy | Speaking in a sweet, playful, and endearing manner | |
| warm | Speaking in a warm, friendly manner | shy | Speaking in a shy, timid manner | |
| comfort | Speaking in a comforting, reassuring manner | authority | Speaking in an authoritative, commanding manner | |
| chat | Speaking in a casual, conversational manner | radio | Speaking in a radio-broadcast manner | |
| soulful | Speaking in a heartfelt, deeply emotional manner | gentle | Speaking in a gentle, soft manner | |
| story | Speaking in a narrative, audiobook-style manner | vivid | Speaking in a lively, expressive manner | |
| program | Speaking in a show-host/presenter manner | news | Speaking in a news broadcasting manner | |
| advertising | Speaking in a polished, high-end commercial voiceover manner | roar | Speaking in a loud, deep, roaring manner | |
| murmur | Speaking in a quiet, low manner | shout | Speaking in a loud, sharp, shouting manner | |
| deeply | Speaking in a deep and low-pitched tone | loudly | Speaking in a loud and high-pitched tone | |
| paralinguistic | Breathing | Breathing sound | Laughter | Laughter or laughing sound |
| Uhm | Hesitation sound: "Uhm" | Sigh | Sighing sound | |
| Surprise-oh | Expressing surprise: "Oh" | Surprise-ah | Expressing surprise: "Ah" | |
| Surprise-wa | Expressing surprise: "Wa" | Confirmation-en | Confirming: "En" | |
| Question-ei | Questioning: "Ei" | Dissatisfaction-hnn | Dissatisfied sound: "Hnn" |
cd custom_nodes
git clone https://github.com/vantagewithai/Vantage-Step-Audio-EditX.git
cd Vantage-Step-Audio-EditX
pip install -r requirements.txt- Launch ComfyUI — the node should appear under category
Vantage/Step-Audio-EditX.
After downloading the models, copy them into ComfyUI/models, you should have the following structure:
ComfyUI/
├── models/
│ ├── Step-Audio-EditX/
│ ├──── CosyVoice-300M-25Hz/
│ │ ├─── campplus.onnx
│ │ ├─── cosyvoice.yaml
│ │ ├─── flow.pt
│ │ └─── hift.pt
│ ├──── dengcunqin/
│ ├──── └─── speech_paraformer-large_asr_nat-zh-cantonese-en-16k-vocab8501-online/
│ │ ├─── am.mvn
│ │ ├─── config.yaml
│ │ ├─── configuration.json
│ │ ├─── model.pt
│ │ ├─── seg_dict
│ │ ├─── tokens.json
│ │ ├─── tokens.txt
│ │ └─── write_tokens_from_txt.py
│ ├── model.safetensors
│ └── speech_tokenizer_v1.onnx
| Tag Type | Syntax | Meaning |
|---|---|---|
| Speaker switch | [speakerX] |
Use speaker number X (1-based) |
| Pause / silence | [pause]300 |
Insert 300 ms of silence |
| Emotion | [happy], [sad], … |
First valid emotion tag per line is applied |
| Style | [whisper], [serious], … |
First valid style tag per line is applied |
| Speed modifier | [faster], [slower], … |
First valid speed tag per line is applied |
| Paralinguistic cue | [Laughter], [Breathing], [Sigh], … |
Preserved in the text — not stripped. May be used for downstream effects. |
Tags must come before the actual text of the line (after [speakerX]).
Example:
[speaker2][happy][whisper][slower]I am fine!
[speaker1][Laughter]That was funny!
[pause]500
- The quality of voice cloning / emotion/style editing depends on the underlying
Step‑Audio‑EditXengine and your reference audio & prompt. - Paralinguistic tags are preserved in the text passed to the engine — if the engine doesn’t support them, they may just render as silence or be ignored.
- If sample rates of speakers vary, the node currently assumes uniform sample rate in the concatenation step.
- Long scripts may consume significant VRAM / memory — monitor usage accordingly.
- The node does not perform grammar or punctuation correction — script should be well‑formatted.
License: MIT
This project builds upon the original Step‑Audio‑EditX repository (see https://github.com/stepfun-ai/Step-Audio-EditX).
Please refer to the original repository for the base license.