Hi,
Thanks for sharing the code.
Have you run a finetuning of Ella on SD1.5 ?
Also, shouldn't it be trained on only one timestep instead of a full generation ? And in the paper they mentioned a weight decay of 0.01
And perhaps using a training script from diffusers as the base could be better, it would allow using xformers, different dtypes, batch size & grad acc, adam decay, ...
I'm currently running a finetune of the existing weights for SD1.5 as a test (LR 1e-5, xformers + fp16 for the pipeline and fp16 for the T5encoder), I'll let it run for a few hours.