Why inference generate a video 3 times?

I have a single text and image prompt in the TXT file.
My configuration is default as downloaded from Github.
<img width="857" alt="image" src="https://github.com/user-attachments/assets/1ce294a6-697b-4aaa-8654-16699d4a1bbe">

Is it anything to do with `'gpus_per_machine': 3`?

```

[2024-09-16 13:24:54,206] INFO: {'__name__': 'Config: VideoLDM Decoder', 'mean': [0.5, 0.5, 0.5], 'std': [0.5, 0.5, 0.5], 'max_words': 1000, 'num_workers': 6, 'prefetch_factor': 2, 'resolution': [1280, 704], 'vit_out_dim': 1024, 'vit_resolution': [224, 224], 'depth_clamp': 10.0, 'misc_size': 384, 'depth_std': 20.0, 'frame_lens': [16, 16, 16, 16, 16, 32, 32, 32], 'sample_fps': [8, 8, 16, 16, 16, 8, 16, 16], 'vid_dataset': {'type': 'VideoDataset', 'data_list': ['data/vid_list.txt'], 'max_words': 1000, 'resolution': [1280, 704], 'data_dir_list': ['data/videos/'], 'vit_resolution': [224, 224], 'get_first_frame': True}, 'img_dataset': {'type': 'ImageDataset', 'data_list': ['data/img_list.txt'], 'max_words': 1000, 'resolution': [1280, 704], 'data_dir_list': ['data/images'], 'vit_resolution': [224, 224]}, 'batch_sizes': {'1': 32, '4': 8, '8': 4, '16': 2, '32': 1}, 'Diffusion': {'type': 'DiffusionDDIM', 'schedule': 'cosine', 'schedule_param': {'num_timesteps': 1000, 'cosine_s': 0.008, 'zero_terminal_snr': True}, 'mean_type': 'v', 'loss_type': 'mse', 'var_type': 'fixed_small', 'rescale_timesteps': False, 'noise_strength': 0.1, 'ddim_timesteps': 50}, 'ddim_timesteps': 50, 'use_div_loss': False, 'p_zero': 0.0, 'guide_scale': 9.0, 'vit_mean': [0.48145466, 0.4578275, 0.40821073], 'vit_std': [0.26862954, 0.26130258, 0.27577711], 'sketch_mean': [0.485, 0.456, 0.406], 'sketch_std': [0.229, 0.224, 0.225], 'hist_sigma': 10.0, 'scale_factor': 0.18215, 'use_checkpoint': True, 'use_sharded_ddp': False, 'use_fsdp': False, 'use_fp16': True, 'temporal_attention': True, 'UNet': {'type': 'UNetSD_I2VGen', 'in_dim': 4, 'dim': 320, 'y_dim': 1024, 'context_dim': 1024, 'out_dim': 4, 'dim_mult': [1, 2, 4, 4], 'num_heads': 8, 'head_dim': 64, 'num_res_blocks': 2, 'attn_scales': [1.0, 0.5, 0.25], 'dropout': 0.1, 'temporal_attention': True, 'temporal_attn_times': 1, 'use_checkpoint': True, 'use_fps_condition': False, 'use_sim_mask': False, 'upper_len': 128, 'concat_dim': 4, 'default_fps': 8}, 'guidances': [], 'auto_encoder': {'type': 'AutoencoderKL', 'ddconfig': {'double_z': True, 'z_channels': 4, 'resolution': 256, 'in_channels': 3, 'out_ch': 3, 'ch': 128, 'ch_mult': [1, 2, 4, 4], 'num_res_blocks': 2, 'attn_resolutions': [], 'dropout': 0.0, 'video_kernel_size': [3, 1, 1]}, 'embed_dim': 4, 'pretrained': 'models/v2-1_512-ema-pruned.ckpt'}, 'embedder': {'type': 'FrozenOpenCLIPTextVisualEmbedder', 'layer': 'penultimate', 'pretrained': 'models/open_clip_pytorch_model.bin', 'vit_resolution': [224, 224]}, 'ema_decay': 0.9999, 'num_steps': 1000000, 'lr': 3e-05, 'weight_decay': 0.0, 'betas': [0.9, 0.999], 'eps': 1e-08, 'chunk_size': 2, 'decoder_bs': 2, 'alpha': 0.7, 'save_ckp_interval': 50, 'warmup_steps': 10, 'decay_mode': 'cosine', 'use_ema': True, 'load_from': None, 'Pretrain': {'type': 'pretrain_specific_strategies', 'fix_weight': False, 'grad_scale': 0.5, 'resume_checkpoint': 'models/i2vgen_xl_00854500.pth', 'sd_keys_path': 'models/stable_diffusion_image_key_temporal_attention_x1.json'}, 'viz_interval': 50, 'visual_train': {'type': 'VisualTrainTextImageToVideo', 'partial_keys': [['y', 'image', 'local_image', 'fps']], 'use_offset_noise': True, 'guide_scale': 9.0}, 'visual_inference': {'type': 'VisualGeneratedVideos'}, 'inference_list_path': '', 'log_interval': 1, 'log_dir': 'workspace/experiments/test_list_for_i2vgen', 'reward_type': 'HPSv2', 'temporal_reward_type': [], 'data_align_method': None, 'data_align_coef': 10, 'segments': 8, 'selection_method': 'fixed_first', 'exponential_TSN': True, 'lambda_TAR': 1.0, 'reward_normalization': False, 'positive_reward': False, 'partial_timestep': None, 'ddim_steps': [981, 961, 941, 921, 901, 881, 861, 841, 821, 801, 781, 761, 741, 721, 701, 681, 661, 641, 621, 601, 581, 561, 541, 521, 501, 481, 461, 441, 421, 401, 381, 361, 341, 321, 301, 281, 261, 241, 221, 201, 181, 161, 141, 121, 101, 81, 61, 41, 21, 1], 'motion_rep': None, 'low_penal_threshold': 0.05, 'reward_weights': {'reward': 1, 'reg': 1}, 'temp_dir': 'workspace/temp_dir', 'adv_clip_max': 5, 'ST_reward_weights': {'spatial': 1, 'temporal': 1}, 'seed': 8888, 'negative_prompt': 'Distorted, discontinuous, Ugly, blurry, low resolution, motionless, static, disfigured, disconnected limbs, Ugly faces, incomplete arms', 'ENABLE': True, 'DATASET': 'webvid10m', 'TASK_TYPE': 'inference_i2vgen_entrance', 'max_frames': 16, 'target_fps': 16, 'scale': 8, 'round': 1, 'batch_size': 1, 'use_zero_infer': True, 'vldm_cfg': 'configs/i2vgen_xl_train.yaml', 'test_list_path': 'data/test_list_for_i2vgen.txt', 'test_model': 'models/i2vgen_xl_00854500.pth', 'cfg_file': 'configs/i2vgen_xl_infer.yaml', 'init_method': 'tcp://localhost:9999', 'debug': False, 'opts': [], 'pmi_rank': 0, 'pmi_world_size': 1, 'gpus_per_machine': 3, 'world_size': 3, 'noise_strength': 0.1, 'gpu': 1, 'rank': 1, 'log_file': 'workspace/experiments/test_list_for_i2vgen/log_01.txt'}
[2024-09-16 13:24:54,206] INFO: {'__name__': 'Config: VideoLDM Decoder', 'mean': [0.5, 0.5, 0.5], 'std': [0.5, 0.5, 0.5], 'max_words': 1000, 'num_workers': 6, 'prefetch_factor': 2, 'resolution': [1280, 704], 'vit_out_dim': 1024, 'vit_resolution': [224, 224], 'depth_clamp': 10.0, 'misc_size': 384, 'depth_std': 20.0, 'frame_lens': [16, 16, 16, 16, 16, 32, 32, 32], 'sample_fps': [8, 8, 16, 16, 16, 8, 16, 16], 'vid_dataset': {'type': 'VideoDataset', 'data_list': ['data/vid_list.txt'], 'max_words': 1000, 'resolution': [1280, 704], 'data_dir_list': ['data/videos/'], 'vit_resolution': [224, 224], 'get_first_frame': True}, 'img_dataset': {'type': 'ImageDataset', 'data_list': ['data/img_list.txt'], 'max_words': 1000, 'resolution': [1280, 704], 'data_dir_list': ['data/images'], 'vit_resolution': [224, 224]}, 'batch_sizes': {'1': 32, '4': 8, '8': 4, '16': 2, '32': 1}, 'Diffusion': {'type': 'DiffusionDDIM', 'schedule': 'cosine', 'schedule_param': {'num_timesteps': 1000, 'cosine_s': 0.008, 'zero_terminal_snr': True}, 'mean_type': 'v', 'loss_type': 'mse', 'var_type': 'fixed_small', 'rescale_timesteps': False, 'noise_strength': 0.1, 'ddim_timesteps': 50}, 'ddim_timesteps': 50, 'use_div_loss': False, 'p_zero': 0.0, 'guide_scale': 9.0, 'vit_mean': [0.48145466, 0.4578275, 0.40821073], 'vit_std': [0.26862954, 0.26130258, 0.27577711], 'sketch_mean': [0.485, 0.456, 0.406], 'sketch_std': [0.229, 0.224, 0.225], 'hist_sigma': 10.0, 'scale_factor': 0.18215, 'use_checkpoint': True, 'use_sharded_ddp': False, 'use_fsdp': False, 'use_fp16': True, 'temporal_attention': True, 'UNet': {'type': 'UNetSD_I2VGen', 'in_dim': 4, 'dim': 320, 'y_dim': 1024, 'context_dim': 1024, 'out_dim': 4, 'dim_mult': [1, 2, 4, 4], 'num_heads': 8, 'head_dim': 64, 'num_res_blocks': 2, 'attn_scales': [1.0, 0.5, 0.25], 'dropout': 0.1, 'temporal_attention': True, 'temporal_attn_times': 1, 'use_checkpoint': True, 'use_fps_condition': False, 'use_sim_mask': False, 'upper_len': 128, 'concat_dim': 4, 'default_fps': 8}, 'guidances': [], 'auto_encoder': {'type': 'AutoencoderKL', 'ddconfig': {'double_z': True, 'z_channels': 4, 'resolution': 256, 'in_channels': 3, 'out_ch': 3, 'ch': 128, 'ch_mult': [1, 2, 4, 4], 'num_res_blocks': 2, 'attn_resolutions': [], 'dropout': 0.0, 'video_kernel_size': [3, 1, 1]}, 'embed_dim': 4, 'pretrained': 'models/v2-1_512-ema-pruned.ckpt'}, 'embedder': {'type': 'FrozenOpenCLIPTextVisualEmbedder', 'layer': 'penultimate', 'pretrained': 'models/open_clip_pytorch_model.bin', 'vit_resolution': [224, 224]}, 'ema_decay': 0.9999, 'num_steps': 1000000, 'lr': 3e-05, 'weight_decay': 0.0, 'betas': [0.9, 0.999], 'eps': 1e-08, 'chunk_size': 2, 'decoder_bs': 2, 'alpha': 0.7, 'save_ckp_interval': 50, 'warmup_steps': 10, 'decay_mode': 'cosine', 'use_ema': True, 'load_from': None, 'Pretrain': {'type': 'pretrain_specific_strategies', 'fix_weight': False, 'grad_scale': 0.5, 'resume_checkpoint': 'models/i2vgen_xl_00854500.pth', 'sd_keys_path': 'models/stable_diffusion_image_key_temporal_attention_x1.json'}, 'viz_interval': 50, 'visual_train': {'type': 'VisualTrainTextImageToVideo', 'partial_keys': [['y', 'image', 'local_image', 'fps']], 'use_offset_noise': True, 'guide_scale': 9.0}, 'visual_inference': {'type': 'VisualGeneratedVideos'}, 'inference_list_path': '', 'log_interval': 1, 'log_dir': 'workspace/experiments/test_list_for_i2vgen', 'reward_type': 'HPSv2', 'temporal_reward_type': [], 'data_align_method': None, 'data_align_coef': 10, 'segments': 8, 'selection_method': 'fixed_first', 'exponential_TSN': True, 'lambda_TAR': 1.0, 'reward_normalization': False, 'positive_reward': False, 'partial_timestep': None, 'ddim_steps': [981, 961, 941, 921, 901, 881, 861, 841, 821, 801, 781, 761, 741, 721, 701, 681, 661, 641, 621, 601, 581, 561, 541, 521, 501, 481, 461, 441, 421, 401, 381, 361, 341, 321, 301, 281, 261, 241, 221, 201, 181, 161, 141, 121, 101, 81, 61, 41, 21, 1], 'motion_rep': None, 'low_penal_threshold': 0.05, 'reward_weights': {'reward': 1, 'reg': 1}, 'temp_dir': 'workspace/temp_dir', 'adv_clip_max': 5, 'ST_reward_weights': {'spatial': 1, 'temporal': 1}, 'seed': 8888, 'negative_prompt': 'Distorted, discontinuous, Ugly, blurry, low resolution, motionless, static, disfigured, disconnected limbs, Ugly faces, incomplete arms', 'ENABLE': True, 'DATASET': 'webvid10m', 'TASK_TYPE': 'inference_i2vgen_entrance', 'max_frames': 16, 'target_fps': 16, 'scale': 8, 'round': 1, 'batch_size': 1, 'use_zero_infer': True, 'vldm_cfg': 'configs/i2vgen_xl_train.yaml', 'test_list_path': 'data/test_list_for_i2vgen.txt', 'test_model': 'models/i2vgen_xl_00854500.pth', 'cfg_file': 'configs/i2vgen_xl_infer.yaml', 'init_method': 'tcp://localhost:9999', 'debug': False, 'opts': [], 'pmi_rank': 0, 'pmi_world_size': 1, 'gpus_per_machine': 3, 'world_size': 3, 'noise_strength': 0.1, 'gpu': 0, 'rank': 0, 'log_file': 'workspace/experiments/test_list_for_i2vgen/log_00.txt'}
```
```
[2024-09-16 13:24:54,206] INFO: Going into it2v_fullid_img_text inference on 1 gpu
[2024-09-16 13:24:54,207] INFO: Going into it2v_fullid_img_text inference on 0 gpu
[2024-09-16 13:24:54,208] INFO: {'__name__': 'Config: VideoLDM Decoder', 'mean': [0.5, 0.5, 0.5], 'std': [0.5, 0.5, 0.5], 'max_words': 1000, 'num_workers': 6, 'prefetch_factor': 2, 'resolution': [1280, 704], 'vit_out_dim': 1024, 'vit_resolution': [224, 224], 'depth_clamp': 10.0, 'misc_size': 384, 'depth_std': 20.0, 'frame_lens': [16, 16, 16, 16, 16, 32, 32, 32], 'sample_fps': [8, 8, 16, 16, 16, 8, 16, 16], 'vid_dataset': {'type': 'VideoDataset', 'data_list': ['data/vid_list.txt'], 'max_words': 1000, 'resolution': [1280, 704], 'data_dir_list': ['data/videos/'], 'vit_resolution': [224, 224], 'get_first_frame': True}, 'img_dataset': {'type': 'ImageDataset', 'data_list': ['data/img_list.txt'], 'max_words': 1000, 'resolution': [1280, 704], 'data_dir_list': ['data/images'], 'vit_resolution': [224, 224]}, 'batch_sizes': {'1': 32, '4': 8, '8': 4, '16': 2, '32': 1}, 'Diffusion': {'type': 'DiffusionDDIM', 'schedule': 'cosine', 'schedule_param': {'num_timesteps': 1000, 'cosine_s': 0.008, 'zero_terminal_snr': True}, 'mean_type': 'v', 'loss_type': 'mse', 'var_type': 'fixed_small', 'rescale_timesteps': False, 'noise_strength': 0.1, 'ddim_timesteps': 50}, 'ddim_timesteps': 50, 'use_div_loss': False, 'p_zero': 0.0, 'guide_scale': 9.0, 'vit_mean': [0.48145466, 0.4578275, 0.40821073], 'vit_std': [0.26862954, 0.26130258, 0.27577711], 'sketch_mean': [0.485, 0.456, 0.406], 'sketch_std': [0.229, 0.224, 0.225], 'hist_sigma': 10.0, 'scale_factor': 0.18215, 'use_checkpoint': True, 'use_sharded_ddp': False, 'use_fsdp': False, 'use_fp16': True, 'temporal_attention': True, 'UNet': {'type': 'UNetSD_I2VGen', 'in_dim': 4, 'dim': 320, 'y_dim': 1024, 'context_dim': 1024, 'out_dim': 4, 'dim_mult': [1, 2, 4, 4], 'num_heads': 8, 'head_dim': 64, 'num_res_blocks': 2, 'attn_scales': [1.0, 0.5, 0.25], 'dropout': 0.1, 'temporal_attention': True, 'temporal_attn_times': 1, 'use_checkpoint': True, 'use_fps_condition': False, 'use_sim_mask': False, 'upper_len': 128, 'concat_dim': 4, 'default_fps': 8}, 'guidances': [], 'auto_encoder': {'type': 'AutoencoderKL', 'ddconfig': {'double_z': True, 'z_channels': 4, 'resolution': 256, 'in_channels': 3, 'out_ch': 3, 'ch': 128, 'ch_mult': [1, 2, 4, 4], 'num_res_blocks': 2, 'attn_resolutions': [], 'dropout': 0.0, 'video_kernel_size': [3, 1, 1]}, 'embed_dim': 4, 'pretrained': 'models/v2-1_512-ema-pruned.ckpt'}, 'embedder': {'type': 'FrozenOpenCLIPTextVisualEmbedder', 'layer': 'penultimate', 'pretrained': 'models/open_clip_pytorch_model.bin', 'vit_resolution': [224, 224]}, 'ema_decay': 0.9999, 'num_steps': 1000000, 'lr': 3e-05, 'weight_decay': 0.0, 'betas': [0.9, 0.999], 'eps': 1e-08, 'chunk_size': 2, 'decoder_bs': 2, 'alpha': 0.7, 'save_ckp_interval': 50, 'warmup_steps': 10, 'decay_mode': 'cosine', 'use_ema': True, 'load_from': None, 'Pretrain': {'type': 'pretrain_specific_strategies', 'fix_weight': False, 'grad_scale': 0.5, 'resume_checkpoint': 'models/i2vgen_xl_00854500.pth', 'sd_keys_path': 'models/stable_diffusion_image_key_temporal_attention_x1.json'}, 'viz_interval': 50, 'visual_train': {'type': 'VisualTrainTextImageToVideo', 'partial_keys': [['y', 'image', 'local_image', 'fps']], 'use_offset_noise': True, 'guide_scale': 9.0}, 'visual_inference': {'type': 'VisualGeneratedVideos'}, 'inference_list_path': '', 'log_interval': 1, 'log_dir': 'workspace/experiments/test_list_for_i2vgen', 'reward_type': 'HPSv2', 'temporal_reward_type': [], 'data_align_method': None, 'data_align_coef': 10, 'segments': 8, 'selection_method': 'fixed_first', 'exponential_TSN': True, 'lambda_TAR': 1.0, 'reward_normalization': False, 'positive_reward': False, 'partial_timestep': None, 'ddim_steps': [981, 961, 941, 921, 901, 881, 861, 841, 821, 801, 781, 761, 741, 721, 701, 681, 661, 641, 621, 601, 581, 561, 541, 521, 501, 481, 461, 441, 421, 401, 381, 361, 341, 321, 301, 281, 261, 241, 221, 201, 181, 161, 141, 121, 101, 81, 61, 41, 21, 1], 'motion_rep': None, 'low_penal_threshold': 0.05, 'reward_weights': {'reward': 1, 'reg': 1}, 'temp_dir': 'workspace/temp_dir', 'adv_clip_max': 5, 'ST_reward_weights': {'spatial': 1, 'temporal': 1}, 'seed': 8888, 'negative_prompt': 'Distorted, discontinuous, Ugly, blurry, low resolution, motionless, static, disfigured, disconnected limbs, Ugly faces, incomplete arms', 'ENABLE': True, 'DATASET': 'webvid10m', 'TASK_TYPE': 'inference_i2vgen_entrance', 'max_frames': 16, 'target_fps': 16, 'scale': 8, 'round': 1, 'batch_size': 1, 'use_zero_infer': True, 'vldm_cfg': 'configs/i2vgen_xl_train.yaml', 'test_list_path': 'data/test_list_for_i2vgen.txt', 'test_model': 'models/i2vgen_xl_00854500.pth', 'cfg_file': 'configs/i2vgen_xl_infer.yaml', 'init_method': 'tcp://localhost:9999', 'debug': False, 'opts': [], 'pmi_rank': 0, 'pmi_world_size': 1, 'gpus_per_machine': 3, 'world_size': 3, 'noise_strength': 0.1, 'gpu': 2, 'rank': 2, 'log_file': 'workspace/experiments/test_list_for_i2vgen/log_02.txt'}
[2024-09-16 13:24:54,209] INFO: Going into it2v_fullid_img_text inference on 2 gpu
[2024-09-16 13:24:54,237] INFO: Loading ViT-H-14 model config.
[2024-09-16 13:24:54,238] INFO: Loading ViT-H-14 model config.
[2024-09-16 13:24:54,247] INFO: Loading ViT-H-14 model config.
[2024-09-16 13:25:02,048] INFO: Loading pretrained ViT-H-14 weights (models/open_clip_pytorch_model.bin).
[2024-09-16 13:25:02,208] INFO: Loading pretrained ViT-H-14 weights (models/open_clip_pytorch_model.bin).
[2024-09-16 13:25:02,218] INFO: Loading pretrained ViT-H-14 weights (models/open_clip_pytorch_model.bin).
[2024-09-16 13:25:13,304] INFO: Restored from models/v2-1_512-ema-pruned.ckpt
[2024-09-16 13:25:13,529] INFO: Restored from models/v2-1_512-ema-pruned.ckpt
[2024-09-16 13:25:13,579] INFO: Restored from models/v2-1_512-ema-pruned.ckpt
[2024-09-16 13:25:27,219] INFO: Load model from models/i2vgen_xl_00854500.pth with status <All keys matched successfully>
[2024-09-16 13:25:27,337] INFO: Load model from models/i2vgen_xl_00854500.pth with status <All keys matched successfully>
[2024-09-16 13:25:27,771] INFO: Load model from models/i2vgen_xl_00854500.pth with status <All keys matched successfully>
[2024-09-16 13:25:34,727] INFO: There are 1 videos. with 1 times
[2024-09-16 13:25:34,728] INFO: [0]/[1] Begin to sample data/test_images/img_0013.png|||Five goldfish in the style of Chinese painting are swimming ...
[2024-09-16 13:25:34,729] INFO: There are 1 videos. with 1 times
[2024-09-16 13:25:34,729] INFO: [0]/[1] Begin to sample data/test_images/img_0013.png|||Five goldfish in the style of Chinese painting are swimming ...
[2024-09-16 13:25:34,730] INFO: There are 1 videos. with 1 times
[2024-09-16 13:25:34,730] INFO: [0]/[1] Begin to sample data/test_images/img_0013.png|||Five goldfish in the style of Chinese painting are swimming ...
[2024-09-16 13:25:37,776] INFO: GPU Memory used 17.43 GB
[2024-09-16 13:25:37,795] INFO: GPU Memory used 17.43 GB
[2024-09-16 13:25:37,818] INFO: GPU Memory used 17.43 GB
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why inference generate a video 3 times? #149

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Why inference generate a video 3 times? #149

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions