An experimental variation of VITS with Microsoft's Differential Transformer method applied on its text encoder.
normal is the original VITS model with its default setting of 2 transformer heads for comparison. #
dtf is a modified VITS model with 1 differential transformer head. #
dtf_v2 is a modified VITS model with 2 differential transformer heads. #
Each model was trained with LJ Speech dataset for 20000 steps.
democomp.mp4
Demonstration of all 50 test samples by each model.
demo2.mp4
Comparison of normal and dtf.
melcompv.mp4
Comparison for the text "These principles of homology are essential to a correct interpretation of the facts of morphology." (LJ027-0052.wav from validation dataset defined in the original VITS repo). Demonstrated spectrograms can be found here.
Test results using NISQA
For each sentence in test script, 10 wav files were generated and tested through NISQA(v2.0) model.
| Model | MOS | Noisiness | Discontinuity | Coloration | Loudness |
|---|---|---|---|---|---|
normal |
4.32 ± 0.37 | 3.87 ± 0.41 | 4.53 ± 0.28 | 4.31 ± 0.22 | 4.51 ± 0.18 |
dtf |
4.24 ± 0.37 | 3.79 ± 0.42 | 4.51 ± 0.30 | 4.28 ± 0.24 | 4.49 ± 0.20 |
dtf_v2 |
4.24 ± 0.37 | 3.86 ± 0.44 | 4.53 ± 0.27 | 4.26 ± 0.23 | 4.47 ± 0.20 |
