Hi authors,
Thanks for sharing your inspiring work! After carefully reading through your paper, I have some questions regarding the method and dataset:
- In the two-stage training, you mentioned that in stage 1 only the VGGT part is trained with video generation part from Wan frozen, and camera control is added in Stage 2. Why not separately train the camera control modules before stage 1?
- Could you elaborate more on the dataset processing method used in this work, i.e., the reconstruction method to "generate multiview consistent depth maps using a reconstruction-based pipeline"?
- The minimum GPU hardware requirement to train and test this method.
- Is the FantasyWorld-1.0 on the leaderboard assosiated with the Wan 2.2 version in this repository?
Thanks a lot for your help in advance!
Hi authors,
Thanks for sharing your inspiring work! After carefully reading through your paper, I have some questions regarding the method and dataset:
Thanks a lot for your help in advance!