English (current) | 简体中文
This project contains modified versions of CosyVoice's Flow and HiFT modules to make them exportable to ONNX format. It also includes necessary source code from the original CosyVoice and Matcha-TTS repositories to ensure compatibility and usability.
matcha/hifigan/xutils.py: Removed dependency onmatplotlib.matcha/models/components/flow_matching.py: Removed logger.cosyvoice/utils/class_utils.py: Removed unused imports and functions.
cosyvoice/flow/flow_matching.py: Adjustedforwardto meet TorchScript requirements.cosyvoice/flow/flow.py: Adjustedforwardto meet TorchScript requirements; removed test code.cosyvoice/flow/DiT/modules.py: ModifiedAttnProcessorto manually implement scaled dot-product attention, casting QK to float32 before calculating attention scores to avoid NaN issues.cosyvoice/hifigan/generator.py: AddedISTFTclass to replacetorch.istft, avoiding complex tensors to adapt to ONNX; removed test code.cosyvoice/transformer/attention.py: Removed cache support.cosyvoice/transformer/encoder_layer.py: Removed cache support.cosyvoice/transformer/upsample_encoder.py: Removed streaming support; adjustedforwardimplementation to meet TorchScript requirements.
Please install dependencies first:
pip install -r requirements.txtNote: The export scripts themselves do not depend on ONNX Runtime. Install it additionally only if you need local inference verification.
Use the script convert_flow_to_onnx.py to export the model to ONNX.
--model_path: Path to the CosyVoice model checkpoint directory.--flow_name: Filename of the Flow module (default:flow.pt).--half: Convert module parameters to half precision.--output_path: Path to save the exported ONNX file.--int32_token: Use int32 for input tokens (default is int64 if not specified).--add_speed_control: Add speed control input for the Flow module.--device: PyTorch device used for export (default:default). Must be a valid device string (e.g.,cuda:0,cpu). Whendefault, the script usescudaif available, otherwisecpu.
python convert_flow_to_onnx.py --device cpu --half --model_path path/to/model --output_path path/to/output.onnx --int32_token --add_speed_controlUse the script convert_hift_to_onnx.py to export the model to ONNX.
--model_path: Path to the CosyVoice model checkpoint directory.--hift_name: Filename of the HiFT module (default:hift.pt).--output_path: Path to save the exported ONNX file.--device: PyTorch device used for export (default:default). Behavior is the same as the Flow export script.
python convert_hift_to_onnx.py --model_path path/to/model --output_path path/to/output.onnxUse the script compose_flow_hift.py to combine the exported Flow and HiFT modules into a single ONNX.
--flow_path: Path to the Flow ONNX.--hift_path: Path to the HiFT ONNX.--output_path: Path to save the composed ONNX.
python compose_flow_hift.py --flow_path path/to/flow.onnx --hift_path path/to/hift.onnx --output_path path/to/composed.onnx- Streaming support has been removed.
- The ONNX conversion tool only supports CosyVoice2-0.5B and Fun-CosyVoice3-0.5B-2512 versions.
Combination error: "Bad node spec for node. Name: flow_/decoder/Loop OpType: Loop". Tested to occur with onnx version 1.19.0. If this issue occurs, please use onnx version 1.16.0 for the combination operation, which has been tested to work.
Error when running Flow ONNX model with speed control.
Ensure the speed input is a scalar of type float32. For example:
speed = np.array(1.25, dtype=np.float32)Or C++ code:
float speed_value = 1.25f;
Ort::Value speed = Ort::Value::CreateTensor(memory_info, &speed_value, sizeof(float), nullptr, 0, ONNX_TENSOR_ELEMENT_DATA_TYPE_FLOAT);Combination error: "google.protobuf.message.EncodeError: Failed to serialize proto" This error is usually caused by insufficient computer memory. Try closing other programs that consume a lot of memory, or run the combination script on a computer with more memory.
I have tested in the following environment:
- Python 3.12.9/3.10.18
- PyTorch 2.8.0+cu129/2.3.1+cu121
- ONNX 1.16.0
- ONNX Runtime DirectML 1.22.0 (Python) / 1.23.0 (C++) (for runtime verification)
- Windows 11 25H2 x64
- CosyVoice (Apache-2.0): https://github.com/FunAudioLLM/CosyVoice
- Matcha-TTS (MIT): https://github.com/shivammehta25/Matcha-TTS
This repository redistributes code covered by Apache-2.0 (CosyVoice) and MIT (Matcha-TTS) licenses; the custom conversion scripts in the root directory of the repository and utils use the MIT license. See the LICENSE and NOTICE files for details; please retain upstream license headers when modifying the source code.