OmniCodec: Low Frame Rate Universal Audio Codec with Semantic–Acoustic Disentanglement
- Demo Page: OmniCodec Demo Page
- Huggingface: Huggingface
- Arxiv: Arxiv
This repo contains:
- Training:
train.py(Accelerate + GAN / WavLM-related losses per config) - Dataset:
dataset.py(multi-domain mixing; loads audio paths fromscp) - Inference:
infer.py(reconstructs audio with a pretrained checkpoint) - Config:
config/config_omnicodec.yaml
Install python dependencies:
pip install -r requirements.txtNotes:
requirements.txtcontains an editable install line-e OmniCodec/transformers-main. Make sure the referenced path exists in your environment, or adjust/remove that line if you already havetransformersinstalled.
The training config expects 3 scp files (one per domain): speech / music / sound.
Each line in scp can be either:
utt_id /abs/or/rel/path/to/audio.wav/abs/or/rel/path/to/audio.wav(utt will be inferred from filename)
Example:
utt0001 /data/speech/utt0001.wav
utt0002 /data/speech/utt0002.wav
For each item, dataset.py will:
- load audio with
librosa.load(..., sr=sample_rate, mono=True) - apply
librosa.util.normalize(wav) * 0.95 - crop/pad/repeat to
segment_size(default: 240000 samples @ 24kHz = 10s) - return a dict:
{"wav": Tensor[T], "utt": str, "text": None}
Failed samples return None and are filtered by collate_fn in train.py.
Edit config/config_omnicodec.yaml:
- Data
data.speech_train_shards_dir: path tospeech.scpdata.music_train_shards_dir: path tomusic.scpdata.sound_train_shards_dir: path tosound.scpdata.sample_rate: default24000data.segment_size: default240000
- Pretrained SSL (WavLM)
model.wavlmloss.ckpt_path: defaultpretrain_model/ssl/wavlm-base-pluswav_lm_model: defaultpretrain_model/ssl/wavlm_model/wavlm
- Output
train.save_dir: default./exps/omnicodec
Run training with the provided config:
python train.py -c config/config_omnicodec.yamlCheckpoints and logs are written to train.save_dir (default: ./exps/omnicodec).
infer.py loads the checkpoint from:
pretrained_model/omnicodec.pth
Place your pretrained weights at that path (or edit infer.py to point to your checkpoint).
Put test audio files in:
./testset/speech/
Then run:
python infer.py -c config/config_omnicodec.yamlOutputs will be written to:
./outputs/
.
├─ config/
│ └─ config_omnicodec.yaml
├─ dataset.py
├─ train.py
├─ infer.py
├─ models/
├─ modules/
├─ quantization/
├─ discriminators/
├─ losses/
├─ utils/
└─ requirements.txt
- This repo benefits from moshi
- This repo benefits from Qwen3Omni
- This repo benefits from DAC
- This repo benefits from BigVGAN
- This repo benefits from SpeechTokenizer
If you use this work, please cite:
@misc{hu2026omnicodeclowframerate,
title={OmniCodec: Low Frame Rate Universal Audio Codec with Semantic-Acoustic Disentanglement},
author={Jingbin Hu and Haoyu Zhang and Dake Guo and Qirui Zhan and Wenhao Li and Huakang Chen and Guobin Ma and Hanke Xie and Chengyou Wang and Pengyuan Xie and Chuan Xie and Qiang Zhang and Lei Xie},
year={2026},
eprint={2603.20638},
archivePrefix={arXiv},
primaryClass={eess.AS},
url={https://arxiv.org/abs/2603.20638},
}See the repository license
