This repo is mainly based on
🍵 Matcha-TTS Official Github and some codes are modified. The purpose of this repository is to study and study 🍵 Matcha-TTS: A fast TTS architecture with conditional flow matching.
- 🔥
Pytorch, ⚡Lightning, 🐉🐲🐲hydra-core - 🤗
wandbClick 👉
While studying
🍵 Matcha-TTS Official Github, I modified some codes to make it simpler.
- Logger: 🤗
wandb(More comfortable and easy access) - Vocoder: 🔥
[Pytorch-Hub]NVIDIA/HiFi-GAN - Alignment:
resemble-ai/monotonic_align
These codes are run and the example-speeches are synthesized in my vscode environment. I moved this Jupyter Notebook file to Colab to share the synthesized example-speeches below:
- 😲 trim_butterfly_16.ipynb
|
BS: 16|NVIDIA GeForce RTX 4080 (x1) - 😵 decent_meadow_46.ipynb
|
BS: 32|LR: 2e-5|NVIDIA GeForce RTX 4080 (x1) - ⭐ wobbly_frog_53.ipynb
|
BS: 16|bf16-mixed|NVIDIA GeForce RTX 4080 (x1) - 👽 wobbly_serenity_54.ipynb
|
BS: 32|bf16-mixed|NVIDIA GeForce RTX 4080 (x1) - 😣 jolly_frog_47.ipynb
|
BS: 32|LR: 2e-5|NVIDIA GeForce RTX 4090 (x1) - 🌟 eager_frost_50.ipynb
|
BS: 16|NVIDIA GeForce RTX 4090 (x1) - ✨ royal_grass_56.ipynb
|
BS: 16|bf16-mixed|NVIDIA GeForce RTX 4090 (x1)
import gc
import torch
import lightning as L
class MemoryCleanupCallback(L.Callback):
def on_train_epoch_end(self, trainer, pl_module):
if torch.cuda.is_available():
torch.cuda.empty_cache()
gc.collect()
def on_validation_epoch_end(self, trainer, pl_module):
if torch.cuda.is_available():
torch.cuda.empty_cache()
gc.collect()This is not included in requirements.txt. You can install MAS(Monotonic_Alignment_Search) with a following command below:
pip install git+https://github.com/resemble-ai/monotonic_align.gityou can use like this:
import monotonic_alignDataset: LJSpeech
Language: English 🇺🇸Speaker: Single Speakersample_rate: 22.05kHz
Let's assume we are training with LJ Speech
- Download the dataset from here, extract it to your own data dir (In my case:
data/LJSpeech/ljs/LJSpeech-1.1), and prepare the file lists to point to the extracted data like for item 5 in the setup of the NVIDIA Tacotron 2 repo. - Go to
configs/data/ljspeech.yamland change
train_filelist_path: data/filelists/ljs_audio_text_train_filelist.txt
valid_filelist_path: data/filelists/ljs_audio_text_val_filelist.txt- Generate normalisation statistics with the yaml file of dataset configuration
PYTHONPATH=. python matcha/utils/generate_data_statistics.py- Update these values in
configs/data/ljspeech.yamlunderdata_statisticskey.
data_statistics: # Computed for ljspeech dataset
mel_mean: -5.5170512199401855
mel_std: 2.0643811225891113Now you got ready to train!
First, you should log-in wandb with your token key in CLI.
wandb login --relogin '<your-wandb-api-token>'
And you can run training with one of these commands:
PYTHONPATH=. python matcha/train.py experiment=ljspeech# If you run training on a cetain gpu_id:
CUDA_VISIBLE_DEVICES=2 PYTHONPATH=. python matcha/train.py experiment=ljspeechAlso, you can run for multi-gpu training:
# If you run multi-gpu training:
CUDA_VISIBLE_DEVICES=2,3 PYTHONPATH=. python matcha/train.py experiment=ljspeech trainer.devices=[0,1]These codes are run and the example-speeches are synthesized in my vscode environment. I moved this Jupyter-Notebook file to Colab to share the synthesized example-speeches.
- you can check more samples Colab notebooks (Examples) above.
- You can refer to the code for synthesis:
matcha/utils/synthesize_utils.py - This notebook is also on this github repo:
notebooks/Samples_wobbly_frog_53.ipynb CLI Arguments: Will be Updated!
- 🍵 Paper: Matcha-TTS: A fast TTS architecture with conditional flow matching
└
Github: 🍵 Matcha-TTS Official Github - MAS(Monotonic Alignment Search)
└
resemble-ai/monotonic_align - 🔥
Pytorch - ⚡
Lightning - 🐉🐲🐲
hydra-core