Skip to content

FENRlR/DTF-VITS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DTF-VITS

An experimental variation of VITS with Microsoft's Differential Transformer method applied on its text encoder.

Models

normal is the original VITS model with its default setting of 2 transformer heads for comparison. #

dtf is a modified VITS model with 1 differential transformer head. #

dtf_v2 is a modified VITS model with 2 differential transformer heads. #

Each model was trained with LJ Speech dataset for 20000 steps.

Output demonstrations

democomp.mp4

Demonstration of all 50 test samples by each model.

demo2.mp4

Comparison of normal and dtf.

melcompv.mp4

Comparison for the text "These principles of homology are essential to a correct interpretation of the facts of morphology." (LJ027-0052.wav from validation dataset defined in the original VITS repo). Demonstrated spectrograms can be found here.

Test results using NISQA

For each sentence in test script, 10 wav files were generated and tested through NISQA(v2.0) model.

Model MOS Noisiness Discontinuity Coloration Loudness
normal 4.32 ± 0.37 3.87 ± 0.41 4.53 ± 0.28 4.31 ± 0.22 4.51 ± 0.18
dtf 4.24 ± 0.37 3.79 ± 0.42 4.51 ± 0.30 4.28 ± 0.24 4.49 ± 0.20
dtf_v2 4.24 ± 0.37 3.86 ± 0.44 4.53 ± 0.27 4.26 ± 0.23 4.47 ± 0.20

About

An experimental variation of VITS with Differential Transformer on its text encoder.

Topics

Resources

License

Stars

Watchers

Forks

Contributors