Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@
|-------|----------|----------|
| [collaborating-with-codex](https://github.com/GuDaStudio/collaborating-with-codex) | 将编码任务委托给 Codex CLI,用于原型开发、调试和代码审查 | OpenAI Codex |
| [collaborating-with-gemini](https://github.com/GuDaStudio/collaborating-with-gemini) | 将编码任务委托给 Gemini CLI,用于原型开发、调试和代码审查 | Google Gemini |
| [local-tts](./local-tts) | 本地生成并检查 OmniVoice/VoxCPM 语音,交付前验证响度、静音和 ASR 可识别性 | Local GPU / TTS |

</details>

Expand Down Expand Up @@ -88,7 +89,7 @@ cd skills
./install.sh --user --skill collaborating-with-codex

# 安装多个指定 Skill
./install.sh --user -s collaborating-with-codex -s collaborating-with-gemini
./install.sh --user -s collaborating-with-codex -s collaborating-with-gemini -s local-tts
```

**方式三:自定义安装路径**
Expand Down
3 changes: 2 additions & 1 deletion docs/README_EN.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@ Star us on GitHub — your support means a lot! 🙏😊
|-------|-------------|---------------------|
| [collaborating-with-codex](../collaborating-with-codex) | Delegates coding tasks to Codex CLI for prototyping, debugging, and code review | OpenAI Codex |
| [collaborating-with-gemini](../collaborating-with-gemini) | Delegates coding tasks to Gemini CLI for prototyping, debugging, and code review | Google Gemini |
| [local-tts](../local-tts) | Generates and verifies local OmniVoice/VoxCPM speech, including loudness, silence, and ASR checks before delivery | Local GPU / TTS |

---

Expand Down Expand Up @@ -75,7 +76,7 @@ The install script provides flexible options for scope and target location.
./install.sh --user --skill collaborating-with-codex

# Install multiple specific Skills
./install.sh --user -s collaborating-with-codex -s collaborating-with-gemini
./install.sh --user -s collaborating-with-codex -s collaborating-with-gemini -s local-tts
```

**Option 3: Custom Installation Path**
Expand Down
2 changes: 1 addition & 1 deletion install.ps1
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ param(

$ErrorActionPreference = "Stop"
$ScriptDir = Split-Path -Parent $MyInvocation.MyCommand.Path
$AvailableSkills = @("collaborating-with-codex", "collaborating-with-gemini")
$AvailableSkills = @("collaborating-with-codex", "collaborating-with-gemini", "local-tts")

function Write-ColorOutput {
param([string]$Text, [string]$Color = "White")
Expand Down
2 changes: 1 addition & 1 deletion install.sh
100644 → 100755
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ BLUE='\033[0;34m'
NC='\033[0m'

# Available skills
AVAILABLE_SKILLS=("collaborating-with-codex" "collaborating-with-gemini")
AVAILABLE_SKILLS=("collaborating-with-codex" "collaborating-with-gemini" "local-tts")

# Script directory
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
Expand Down
161 changes: 161 additions & 0 deletions local-tts/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,161 @@
---
name: local-tts
description: "Generate, verify, and benchmark the local OmniVoice-Studio and VoxCPM text-to-speech setups on this machine. Use when the user asks whether OmniVoice-Studio or voxcpm can run locally, wants sample speech generation, wants movie/video narration voiceover, wants final-effect/performance comparisons, or wants to integrate these local TTS engines into a skill or workflow."
---

# Local TTS

## Quick Start

Use the generation runner for normal TTS output:

```bash
python /home/slam/.codex/skills/local-tts/scripts/tts_generate.py --engine auto --device auto --text "你好,这是本地语音合成测试。" --output /tmp/local_tts.wav --report-json /tmp/local_tts.json
```

Use the benchmark runner when the user wants performance numbers:

```bash
python /home/slam/.codex/skills/local-tts/scripts/tts_benchmark.py --device both --out-dir /tmp/tts_bench_run
```

Default assumptions:
- OmniVoice-Studio repo: `/home/slam/OmniVoice-Studio`
- Python env: `/home/slam/OmniVoice-Studio/.venv`
- OmniVoice CLI: `/home/slam/OmniVoice-Studio/.venv/bin/omnivoice-infer`
- VoxCPM package installed inside the same venv

If `CUDA_VISIBLE_DEVICES` is set incorrectly, unset it for GPU tests. For CPU tests, the runners force `CUDA_VISIBLE_DEVICES=""`.

## Workflow

1. Check environment:
- Run `env -u CUDA_VISIBLE_DEVICES /home/slam/OmniVoice-Studio/.venv/bin/python -c "import torch, voxcpm; print(torch.__version__, torch.cuda.is_available(), torch.cuda.device_count())"`.
- Run `nvidia-smi --query-compute-apps=pid,process_name,used_memory --format=csv,noheader` before GPU tests.
2. Use `tts_generate.py` for a normal short output file. Request `--report-json` when another agent or script needs structured output.
3. For user-facing narration longer than a few sentences, use the checked-delivery workflow below instead of one long raw generation.
4. Use `tts_benchmark.py` for wall-clock time, audio duration, RTF, sample rate, output path, and failures.
5. Separate cold-start/compile overhead from steady-state speed when interpreting VoxCPM GPU results.
6. Do not kill GPU processes unless the user explicitly authorizes it.

## Checked Delivery Workflow

Use this workflow before giving the user a local TTS result or uploading it to Baidu Netdisk.

1. Split long narration into short sentence groups and generate each chunk separately.
- Keep a chunk to roughly 1-2 sentences.
- Prefer OmniVoice with `--postprocess_output True` for these chunks if calling `omnivoice-infer` directly.
- Use stable narration settings such as `--language Chinese --speed 0.95 --num_step 8 --instruct "男,青年,低音调"` unless the user asks for a different voice.
2. Concatenate chunks with `ffmpeg` or another audio tool; insert only short, intentional pauses.
3. Convert the deliverable to MP3 for compatibility:

```bash
ffmpeg -y -hide_banner -i input.wav \
-af 'highpass=f=80,lowpass=f=8000,loudnorm=I=-16:TP=-1.5:LRA=11' \
-ar 48000 -ac 1 -codec:a libmp3lame -b:a 192k output.mp3
```

4. Verify technical decode and loudness:

```bash
ffprobe -v error -show_entries format=duration,size,bit_rate:stream=codec_name,sample_rate,channels -of json output.mp3
ffmpeg -hide_banner -nostats -i output.mp3 -af silencedetect=noise=-35dB:d=0.8,volumedetect -f null -
```

5. Verify audible speech with ASR before delivery. Use faster-whisper from the OmniVoice venv; on this dual-GPU machine, `CUDA_VISIBLE_DEVICES=1` selects the RTX 4090:

```bash
CUDA_VISIBLE_DEVICES=1 /home/slam/OmniVoice-Studio/.venv/bin/python - <<'PY'
from faster_whisper import WhisperModel
path = "output.mp3"
model = WhisperModel("small", device="cuda", compute_type="float16")
segments, info = model.transcribe(path, language="zh", beam_size=5, vad_filter=False)
texts = []
last_end = 0.0
for seg in segments:
texts.append(seg.text.strip())
last_end = max(last_end, float(seg.end))
print(f"[{seg.start:.2f}-{seg.end:.2f}] {seg.text.strip()}")
print({"language": info.language, "duration": info.duration, "last_end": last_end, "text": "".join(texts)})
PY
```

Pass criteria:
- `ffprobe` duration and file size are plausible for the requested output.
- `volumedetect` mean volume is not near silence, and `silencedetect` does not show large unintended blank spans.
- ASR returns recognizable speech across the whole file, especially near the end.
- Do not upload or share the file until these checks pass.
- Prefer delivering the checked MP3. Keep the WAV as a working artifact unless the user explicitly asks for WAV.

Known good pattern from the 4090 test: chunked OmniVoice generation, postprocessing enabled, final MP3 at 48 kHz mono 192 kbps, then ASR verification. A 64.8 second checked sample transcribed continuously from start to finish and avoided the earlier mostly blank long-WAV failure.

## Generation

Auto-select engine and device:

```bash
python /home/slam/.codex/skills/local-tts/scripts/tts_generate.py --engine auto --device auto --text "Hello." --output /tmp/local_tts.wav --report-json /tmp/local_tts.json
```

Force OmniVoice:

```bash
python /home/slam/.codex/skills/local-tts/scripts/tts_generate.py --engine omnivoice --device cuda --text "Hello." --output /tmp/local_tts.wav
```

Force VoxCPM:

```bash
python /home/slam/.codex/skills/local-tts/scripts/tts_generate.py --engine voxcpm --device cuda --text "Hello." --output /tmp/local_tts.wav
```

Movie narration preset:

```bash
python /home/slam/.codex/skills/local-tts/scripts/tts_generate.py \
--engine omnivoice \
--device cuda \
--language Chinese \
--speed 0.95 \
--num-step 8 \
--instruct "男,青年,低音调" \
--text "在这个看似平静的小镇里,一场被隐藏多年的秘密,正在悄悄浮出水面。" \
--output /tmp/movie_narration.wav \
--report-json /tmp/movie_narration.json
```

Use `--duration` when a narration segment must match a fixed video slot. Prefer `--speed 0.9` to `1.0` for suspense narration, and `--num-step 8` for better quality than the fastest benchmark setting.

OmniVoice `--instruct` accepts fixed labels, not arbitrary prose. Chinese labels must use full-width comma, for example `男,青年,低音调`, `女,中年,中音调`, or `男,中年,低音调`.

## Benchmark

CPU only:

```bash
python /home/slam/.codex/skills/local-tts/scripts/tts_benchmark.py --device cpu --out-dir /tmp/tts_bench_cpu
```

GPU only:

```bash
python /home/slam/.codex/skills/local-tts/scripts/tts_benchmark.py --device cuda --out-dir /tmp/tts_bench_gpu
```

Custom text:

```bash
python /home/slam/.codex/skills/local-tts/scripts/tts_benchmark.py --device both --text "你好,这是本地语音合成测试。" --out-dir /tmp/tts_bench_zh
```

## Interpretation

Use `elapsed_sec / audio_duration_sec` as RTF. Lower is faster. RTF is only a rough comparison when engines produce different output durations or sample rates.

OmniVoice-Studio is the app/integration layer. Prefer it when the user needs a service, UI, API, or multi-engine workflow.

VoxCPM is a direct model library. Prefer it when the user needs direct programmatic TTS calls and accepts model-level setup/latency tradeoffs.

`tts_generate.py --engine auto` currently chooses OmniVoice as the stable default integration path. Force `--engine voxcpm` when direct VoxCPM behavior is specifically needed.

Read `references/local-results.md` when the user asks what has already been tested on this machine.
4 changes: 4 additions & 0 deletions local-tts/agents/openai.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
interface:
display_name: "Local TTS"
short_description: "Generate and verify local TTS audio"
default_prompt: "Use $local-tts to generate checked local narration, convert it to a compatible MP3, and verify speech quality before delivery."
27 changes: 27 additions & 0 deletions local-tts/references/local-results.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
# Local Results

Machine state observed on 2026-06-16:

- GPU: NVIDIA GeForce RTX 4060 Ti
- Torch in `/home/slam/OmniVoice-Studio/.venv`: `2.8.0+cu128`
- CUDA runtime reported by torch: `12.8`
- `env -u CUDA_VISIBLE_DEVICES` detects one CUDA GPU.
- VoxCPM package is installed in `/home/slam/OmniVoice-Studio/.venv`.
- OmniVoice-Studio diagnostics and backend health check passed previously.

Benchmark text:

```text
Hello from the local text to speech benchmark. This is a short performance test.
```

Observed benchmark:

| Engine | Device | Elapsed | Audio | RTF | Sample rate |
| --- | ---: | ---: | ---: | ---: | ---: |
| OmniVoice | CPU | 59.9s | 5.04s | 11.88 | 24000 |
| VoxCPM | CPU | 70.8s | 10.16s | 6.97 | 16000 |
| OmniVoice | CUDA | 10.9s | 5.04s | 2.17 | 24000 |
| VoxCPM | CUDA | 55.9s | 10.16s | 5.50 | 16000 |

Interpret with care: outputs differ in duration and sample rate, and VoxCPM GPU includes heavy first-run torch compile/warmup overhead.
132 changes: 132 additions & 0 deletions local-tts/scripts/tts_benchmark.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,132 @@
#!/usr/bin/env python3
from __future__ import annotations

import argparse
import json
import os
import subprocess
import sys
import time
import wave
from pathlib import Path


DEFAULT_TEXT = "Hello from the local text to speech benchmark. This is a short performance test."
DEFAULT_ROOT = Path("/home/slam/OmniVoice-Studio")


def wav_info(path: Path) -> dict:
with wave.open(str(path), "rb") as f:
frames = f.getnframes()
rate = f.getframerate()
duration = frames / float(rate) if rate else 0.0
return {
"audio_duration_sec": duration,
"sample_rate": rate,
"channels": f.getnchannels(),
"bytes": path.stat().st_size,
}


def run_cmd(name: str, cmd: list[str], output: Path, env: dict[str, str], cwd: Path) -> dict:
if output.exists():
output.unlink()
start = time.perf_counter()
proc = subprocess.run(cmd, cwd=str(cwd), env=env, text=True, capture_output=True)
elapsed = time.perf_counter() - start
result = {
"name": name,
"ok": proc.returncode == 0,
"returncode": proc.returncode,
"elapsed_sec": elapsed,
"output": str(output),
"stdout_tail": proc.stdout[-2000:],
"stderr_tail": proc.stderr[-4000:],
}
if output.exists():
info = wav_info(output)
result.update(info)
duration = info["audio_duration_sec"]
result["rtf"] = elapsed / duration if duration else None
return result


def device_env(device: str) -> dict[str, str]:
env = os.environ.copy()
if device == "cpu":
env["CUDA_VISIBLE_DEVICES"] = ""
return env


def build_voxcpm_script(path: Path, text: str, output: Path) -> None:
path.write_text(
"from voxcpm import VoxCPM\n"
"import soundfile as sf\n"
"model = VoxCPM.from_pretrained('openbmb/VoxCPM-0.5B', load_denoiser=False)\n"
f"wav = model.generate(text={text!r}, normalize=True, denoise=False, inference_timesteps=1, max_length=128, retry_badcase=False)\n"
f"sf.write({str(output)!r}, wav, 16000)\n"
)


def benchmark_device(root: Path, out_dir: Path, text: str, device: str) -> list[dict]:
py = root / ".venv/bin/python"
omni = root / ".venv/bin/omnivoice-infer"
env = device_env(device)
omni_out = out_dir / f"omnivoice_{device}.wav"
vox_out = out_dir / f"voxcpm_{device}.wav"
vox_script = out_dir / f"run_voxcpm_{device}_once.py"
build_voxcpm_script(vox_script, text, vox_out)

omni_cmd = [
str(omni),
"--text",
text,
"--output",
str(omni_out),
"--device",
"cpu" if device == "cpu" else "cuda",
"--num_step",
"4",
"--denoise",
"False",
"--postprocess_output",
"False",
]

return [
run_cmd(f"omnivoice_{device}", omni_cmd, omni_out, env, root),
run_cmd(f"voxcpm_{device}", [str(py), str(vox_script)], vox_out, env, root),
]


def main() -> int:
parser = argparse.ArgumentParser(description="Benchmark local OmniVoice-Studio and VoxCPM TTS.")
parser.add_argument("--root", default=str(DEFAULT_ROOT), help="OmniVoice-Studio repo path.")
parser.add_argument("--out-dir", default="/tmp/tts_bench_run", help="Output directory for wav files and report.json.")
parser.add_argument("--text", default=DEFAULT_TEXT, help="Text to synthesize.")
parser.add_argument("--device", choices=["cpu", "cuda", "both"], default="cpu", help="Device mode to test.")
args = parser.parse_args()

root = Path(args.root).expanduser().resolve()
out_dir = Path(args.out_dir).expanduser().resolve()
out_dir.mkdir(parents=True, exist_ok=True)

devices = ["cpu", "cuda"] if args.device == "both" else [args.device]
results: list[dict] = []
for device in devices:
results.extend(benchmark_device(root, out_dir, args.text, device))

report = {
"root": str(root),
"text": args.text,
"device": args.device,
"results": results,
}
report_path = out_dir / "report.json"
report_path.write_text(json.dumps(report, indent=2, ensure_ascii=False))
print(json.dumps(report, indent=2, ensure_ascii=False))
return 0 if all(r["ok"] for r in results) else 1


if __name__ == "__main__":
raise SystemExit(main())
Loading