Based on extensive feedback from users and researchers, we have identified several differences between the Community Edition: FantasyTalking ComfyUI workflow and the official implementation that may degrade the user experience:
-
Inaccurate Lip-Sync
- See Issue #54.
- Cause Analysis: ComfyUI introduces an Audio CFG Scheduling mechanism that applies audio conditioning only during the first 10% of denoising steps. By contrast, the official implementation applies Audio CFG throughout all denoising steps, ensuring precise audio-video alignment.
- Recommendation: In the Audio CFG Schedule node, set
end_percentto 0.7 or higher (the paper uses 1.0). This extends the range of denoising steps influenced by audio conditioning and markedly improves lip-sync quality.
-
First-Frame Color Shift and Blur
- See Issue #194.
- This is a common artifact in current video diffusion / synthesis models: the first frame may suffer from color drift or loss of focus. Targeted post-processing or model configuration tweaks are usually required.
-
Limited Control and Consistency in Long-Form Video Generation
- Long sequences often exhibit motion drift or detail inconsistency. Mitigation strategies include segmented generation, key-frame locking and cyclic consistency constraints.
-
Key Parameters
end_percent(Audio CFG Schedule): proportion of denoising steps with audio conditioning. Official recommendation: ≥ 0.7.- Other critical hyper-parameters (sampling steps, resolution, CFG scale, etc.) follow standard video diffusion practice; adjust according to the official docs/examples.
-
Common Troubleshooting
- Lip-sync issues → Verify that Audio CFG Schedule is enabled and
end_percentis set correctly. - Blurred first frame → Add “Frame Interpolation” or “Deblur” post-processing nodes, or reduce the initial noise strength.
- Long-form drift → Generate in segments with key-frame interpolation, or use the official long-video consistency script.
- Lip-sync issues → Verify that Audio CFG Schedule is enabled and
-
Recommended Defaults
end_percent≥ 0.7 (strongly recommended)- Sampling steps ≥ 30; output resolution = target resolution
- Tune
CFG scale(typically 3.0 – 7.0) to balance fidelity and creativity
Following these guidelines should bring the Community Edition ComfyUI workflow much closer to the official implementation in terms of lip-sync accuracy and visual quality. If further assistance is needed, please comment under the relevant issue or contact us directly.
-
ComfyUI Workflow
- In the FantasyTalking ComfyUI workflow, set
end_percentin the Audio CFG Schedule node to ≥ 0.7.- This parameter defines the fraction of denoising steps receiving audio conditioning. The original workflow default is 0.1, which allows too few steps for FantasyTalking to influence lip-sync adequately.
- In the FantasyTalking ComfyUI workflow, set
-
Personalization
- You can control character motion, behavior and emotion via prompts, and adjust the influence of prompts and audio using
prompt_cfg_scaleandaudio_cfg_scale(both exposed in ComfyUI). - For stronger lip-sync, increase
audio_cfg_scale. A typical range for both scales is 3.0 – 7.0.
- You can control character motion, behavior and emotion via prompts, and adjust the influence of prompts and audio using
WanVideo CFG Schedule Float List
cfg_scale_start,cfg_scale_end,interpolation: define the scaling range of the audio CFG.start_percent,end_percent: delimit the denoising interval where audio CFG is applied.
The node outputs a float list assigning an audio CFG value to each denoising step; steps outside the specified range default to 1.0.
The community workflow applies audio_cfg = 5 only during the first 10 % of denoising steps, limiting FantasyTalking’s influence and causing misalignment. The official implementation applies audio_cfg = 4.5 across all steps.
🔥 Solution: Increase end_percent so that audio CFG spans more denoising steps—this greatly improves lip-sync.
Because FantasyTalking is trained on an I2V-based video generation model, first-frame artifacts (color shift, blur) inherit from the base model. FantasyTalking’s audio conditioning can further amplify the issue (see Issue #194).
🔥 Solution: Lower FantasyTalking’s influence by reducing the number of denoising steps affected (end_percent = 0.7) or lowering audio_cfg_scale. This biases generation toward the base model’s first-frame reference, at the cost of some lip-sync precision.
Experimental Comparison
Adjusting end_percent in the Audio CFG Schedule node yields markedly different results:
| end_percent | Generated Video | Effect Description |
|---|---|---|
| 0.1 |
end0.1.mp4 |
The lip-sync is not very good. |
| 0.4 |
end0.4.mp4 |
The lip-sync is quite good. |
| 0.7 |
end0.7.mp4 |
The lip-sync is very good. |
| 1.0 |
end1.0.mp4 |
The first frame may be blurry and have color deviations. |
Wan 2.1 still faces accuracy and consistency challenges for long clips (see Issue #87 and Issue #166). Current workarounds include
- Concatenating head/tail clips
- Sliding-window generation
We recommend the sliding-window approach: denoise in overlapping chunks and fuse latent features in the overlap region. See kijai’s long-video generation workflow for an implementation.
