Skip to content

Results for 4090 #1

@BenjaminBossan

Description

@BenjaminBossan

Unfortunately, one the first two experiments, no_fa3 and no_compile_fa3 succeeded, the other ones all ran OOM. So it seems that FP8 is the common denominator and required for the experiment to succeed.

Click to see logs
args=Namespace(ckpt_id='black-forest-labs/FLUX.1-dev', seed=0, disable_fa3=True, disable_fp8=False, disable_compile=False, disable_recompile_error=False, disable_hotswap=False, quantize_t5=True, offload=False, max_rank=128, out_dir=PosixPath('no_fa3'))
Loading pipeline components...:   0%|          | 0/7 [00:00<?, ?it/s]You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
Loading checkpoint shards: 100%|██████████| 2/2 [00:04<00:00,  2.44s/it]it/s]
Loading checkpoint shards: 100%|██████████| 3/3 [00:26<00:00,  8.74s/it]s/it]
Loading pipeline components...: 100%|██████████| 7/7 [00:31<00:00,  4.50s/it]
Loading repo_id='glif/l0w-r3z'

WARN  Feature `utils/Perplexity` requires python GIL. Feature is currently skipped/disabled.
INFO  ENV: Auto setting PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' for memory saving.
INFO  ENV: Auto setting CUDA_DEVICE_ORDER=PCI_BUS_ID for correctness.          
No LoRA keys associated to CLIPTextModel found with the prefix='text_encoder'. This is safe to ignore if LoRA state dict didn't originally have any CLIPTextModel related params. You can also try specifying `prefix=None` to resolve the warning. Otherwise, open an issue if you think it's unexpected: https://github.com/huggingface/diffusers/issues/new
No LoRA keys associated to CLIPTextModel found with the prefix='text_encoder'. This is safe to ignore if LoRA state dict didn't originally have any CLIPTextModel related params. You can also try specifying `prefix=None` to resolve the warning. Otherwise, open an issue if you think it's unexpected: https://github.com/huggingface/diffusers/issues/new
Token indices sequence length is longer than the specified maximum sequence length for this model (96 > 77). Running this sequence through the model will result in indexing errors
The following part of your input was truncated because CLIP can only handle sequences up to 77 tokens: ['in the lower corner show a price of 1 5 cents and the date sep 2 0 2 4']
The following part of your input was truncated because CLIP can only handle sequences up to 77 tokens: ['in the lower corner show a price of 1 5 cents and the date sep 2 0 2 4']
The following part of your input was truncated because CLIP can only handle sequences up to 77 tokens: ['in the lower corner show a price of 1 5 cents and the date sep 2 0 2 4']
The following part of your input was truncated because CLIP can only handle sequences up to 77 tokens: ['in the lower corner show a price of 1 5 cents and the date sep 2 0 2 4']
The following part of your input was truncated because CLIP can only handle sequences up to 77 tokens: ['in the lower corner show a price of 1 5 cents and the date sep 2 0 2 4']
Loading repo_id='renderartist/retrocomicflux'
Benchmark completed in 351.19 seconds.
out_dict={'timings': [17.71, 17.722], 'time_mean': 17.715999603271484, 'time_var': 7.201245171017945e-05, 'img_paths': ['no_fa3/glif_l0w-r3z.png', 'no_fa3/renderartist_retrocomicflux.png']}


args=Namespace(ckpt_id='black-forest-labs/FLUX.1-dev', seed=0, disable_fa3=True, disable_fp8=False, disable_compile=True, disable_recompile_error=False, disable_hotswap=False, quantize_t5=True, offload=False, max_rank=128, out_dir=PosixPath('no_compile_fa3'))
Loading checkpoint shards: 100%|██████████| 3/3 [00:26<00:00,  8.76s/it]
Loading checkpoint shards: 100%|██████████| 2/2 [00:04<00:00,  2.43s/it]s/it]
Loading pipeline components...:  29%|██▊       | 2/7 [00:31<01:08, 13.78s/it]You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
Loading pipeline components...: 100%|██████████| 7/7 [00:31<00:00,  4.50s/it]
Loading repo_id='glif/l0w-r3z'

WARN  Feature `utils/Perplexity` requires python GIL. Feature is currently skipped/disabled.
INFO  ENV: Auto setting PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' for memory saving.
INFO  ENV: Auto setting CUDA_DEVICE_ORDER=PCI_BUS_ID for correctness.          
No LoRA keys associated to CLIPTextModel found with the prefix='text_encoder'. This is safe to ignore if LoRA state dict didn't originally have any CLIPTextModel related params. You can also try specifying `prefix=None` to resolve the warning. Otherwise, open an issue if you think it's unexpected: https://github.com/huggingface/diffusers/issues/new
No LoRA keys associated to CLIPTextModel found with the prefix='text_encoder'. This is safe to ignore if LoRA state dict didn't originally have any CLIPTextModel related params. You can also try specifying `prefix=None` to resolve the warning. Otherwise, open an issue if you think it's unexpected: https://github.com/huggingface/diffusers/issues/new
Token indices sequence length is longer than the specified maximum sequence length for this model (96 > 77). Running this sequence through the model will result in indexing errors
The following part of your input was truncated because CLIP can only handle sequences up to 77 tokens: ['in the lower corner show a price of 1 5 cents and the date sep 2 0 2 4']
The following part of your input was truncated because CLIP can only handle sequences up to 77 tokens: ['in the lower corner show a price of 1 5 cents and the date sep 2 0 2 4']
The following part of your input was truncated because CLIP can only handle sequences up to 77 tokens: ['in the lower corner show a price of 1 5 cents and the date sep 2 0 2 4']
The following part of your input was truncated because CLIP can only handle sequences up to 77 tokens: ['in the lower corner show a price of 1 5 cents and the date sep 2 0 2 4']
The following part of your input was truncated because CLIP can only handle sequences up to 77 tokens: ['in the lower corner show a price of 1 5 cents and the date sep 2 0 2 4']
Loading repo_id='renderartist/retrocomicflux'
Benchmark completed in 237.86 seconds.
out_dict={'timings': [23.446, 23.306], 'time_mean': 23.375999450683594, 'time_var': 0.009799914434552193, 'img_paths': ['no_compile_fa3/glif_l0w-r3z.png', 'no_compile_fa3/renderartist_retrocomicflux.png']}


args=Namespace(ckpt_id='black-forest-labs/FLUX.1-dev', seed=0, disable_fa3=True, disable_fp8=True, disable_compile=False, disable_recompile_error=True, disable_hotswap=False, quantize_t5=False, offload=True, max_rank=128, out_dir=PosixPath('no_fa3_fp8_nf4'))
Loading checkpoint shards: 100%|██████████| 2/2 [00:00<00:00, 120.79it/s]
Loading checkpoint shards: 100%|██████████| 3/3 [00:00<00:00, 88.45it/s]
Loading pipeline components...:  43%|████▎     | 3/7 [00:00<00:00, 26.85it/s]You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
Loading pipeline components...: 100%|██████████| 7/7 [00:00<00:00, 24.32it/s]
Loading repo_id='glif/l0w-r3z'

WARN  Feature `utils/Perplexity` requires python GIL. Feature is currently skipped/disabled.
INFO  ENV: Auto setting PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' for memory saving.
INFO  ENV: Auto setting CUDA_DEVICE_ORDER=PCI_BUS_ID for correctness.          
No LoRA keys associated to CLIPTextModel found with the prefix='text_encoder'. This is safe to ignore if LoRA state dict didn't originally have any CLIPTextModel related params. You can also try specifying `prefix=None` to resolve the warning. Otherwise, open an issue if you think it's unexpected: https://github.com/huggingface/diffusers/issues/new
/home/name/anaconda3/envs/peft/lib/python3.12/site-packages/torch/_dynamo/variables/functions.py:1263: UserWarning: Dynamo does not know how to trace the builtin `posix.putenv.` This function is either a Python builtin (e.g. _warnings.warn) or a third-party C/C++ Python extension (perhaps created with pybind).
If it is a Python builtin, please file an issue on GitHub so the PyTorch team can add support for it and see the next case for a workaround.
If it is a third-party C/C++ Python extension, please either wrap it into a PyTorch-understood custom operator (see https://pytorch.org/tutorials/advanced/custom_ops_landing_page.html for more details) or, if it is traceable, use `torch.compiler.allow_in_graph`.
  torch._dynamo.utils.warn_once(explanation + "\n" + "\n".join(hints))
Traceback (most recent call last):
  File "/home/name/work/clones/lora-fast/run_benchmark.py", line 26, in <module>
    out_dict = bench_manager.run_benchmark(LORA_MAPPINGS)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/name/work/clones/lora-fast/utils/benchmark_utils.py", line 108, in run_benchmark
    image = self.run_inference(self.pipe, pipe_kwargs, args)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/name/work/clones/lora-fast/utils/benchmark_utils.py", line 85, in run_inference
    return pipe(**pipe_kwargs).images[0]
           ^^^^^^^^^^^^^^^^^^^
  File "/home/name/anaconda3/envs/peft/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/name/work/forks/diffusers/src/diffusers/pipelines/flux/pipeline_flux.py", line 913, in __call__
    noise_pred = self.transformer(
                 ^^^^^^^^^^^^^^^^^
  File "/home/name/anaconda3/envs/peft/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1749, in _wrapped_call_impl
    return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/name/anaconda3/envs/peft/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 655, in _fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/name/anaconda3/envs/peft/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/name/anaconda3/envs/peft/lib/python3.12/site-packages/accelerate/hooks.py", line 170, in new_forward
    args, kwargs = module._hf_hook.pre_forward(module, *args, **kwargs)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/name/anaconda3/envs/peft/lib/python3.12/site-packages/accelerate/hooks.py", line 718, in pre_forward
    self.prev_module_hook.offload()
  File "/home/name/anaconda3/envs/peft/lib/python3.12/site-packages/accelerate/hooks.py", line 719, in torch_dynamo_resume_in_pre_forward_at_718
    clear_device_cache()
  File "/home/name/anaconda3/envs/peft/lib/python3.12/site-packages/accelerate/hooks.py", line 720, in torch_dynamo_resume_in_pre_forward_at_719
    module.to(self.execution_device)
  File "/home/name/work/forks/diffusers/src/diffusers/models/modeling_utils.py", line 1383, in to
    return super().to(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/name/anaconda3/envs/peft/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1355, in to
    return self._apply(convert)
           ^^^^^^^^^^^^^^^^^^^^
  File "/home/name/anaconda3/envs/peft/lib/python3.12/site-packages/torch/nn/modules/module.py", line 915, in _apply
    module._apply(fn)
  File "/home/name/anaconda3/envs/peft/lib/python3.12/site-packages/torch/nn/modules/module.py", line 915, in _apply
    module._apply(fn)
  File "/home/name/anaconda3/envs/peft/lib/python3.12/site-packages/torch/nn/modules/module.py", line 915, in _apply
    module._apply(fn)
  [Previous line repeated 1 more time]
  File "/home/name/anaconda3/envs/peft/lib/python3.12/site-packages/torch/nn/modules/module.py", line 942, in _apply
    param_applied = fn(param)
                    ^^^^^^^^^
  File "/home/name/anaconda3/envs/peft/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1341, in convert
    return t.to(
           ^^^^^
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 90.00 MiB. GPU 0 has a total capacity of 23.55 GiB of which 45.75 MiB is free. Including non-PyTorch memory, this process has 23.48 GiB memory in use. Of the allocated memory 22.72 GiB is allocated by PyTorch, and 257.71 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)


args=Namespace(ckpt_id='black-forest-labs/FLUX.1-dev', seed=0, disable_fa3=True, disable_fp8=True, disable_compile=False, disable_recompile_error=True, disable_hotswap=False, quantize_t5=True, offload=True, max_rank=128, out_dir=PosixPath('no_fa3_fp8'))
Loading checkpoint shards: 100%|██████████| 2/2 [00:04<00:00,  2.49s/it]
Loading checkpoint shards: 100%|██████████| 3/3 [00:00<00:00, 96.44it/s]s/it]
You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
Loading pipeline components...: 100%|██████████| 7/7 [00:05<00:00,  1.30it/s]
Loading repo_id='glif/l0w-r3z'

WARN  Feature `utils/Perplexity` requires python GIL. Feature is currently skipped/disabled.
INFO  ENV: Auto setting PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' for memory saving.
INFO  ENV: Auto setting CUDA_DEVICE_ORDER=PCI_BUS_ID for correctness.          
No LoRA keys associated to CLIPTextModel found with the prefix='text_encoder'. This is safe to ignore if LoRA state dict didn't originally have any CLIPTextModel related params. You can also try specifying `prefix=None` to resolve the warning. Otherwise, open an issue if you think it's unexpected: https://github.com/huggingface/diffusers/issues/new
/home/name/anaconda3/envs/peft/lib/python3.12/site-packages/torch/_dynamo/variables/functions.py:1263: UserWarning: Dynamo does not know how to trace the builtin `<unknown module>.TensorBase._make_subclass.` This function is either a Python builtin (e.g. _warnings.warn) or a third-party C/C++ Python extension (perhaps created with pybind).
If it is a Python builtin, please file an issue on GitHub so the PyTorch team can add support for it and see the next case for a workaround.
If it is a third-party C/C++ Python extension, please either wrap it into a PyTorch-understood custom operator (see https://pytorch.org/tutorials/advanced/custom_ops_landing_page.html for more details) or, if it is traceable, use `torch.compiler.allow_in_graph`.
  torch._dynamo.utils.warn_once(explanation + "\n" + "\n".join(hints))
/home/name/anaconda3/envs/peft/lib/python3.12/site-packages/torch/_dynamo/variables/functions.py:1263: UserWarning: Dynamo does not know how to trace the builtin `posix.putenv.` This function is either a Python builtin (e.g. _warnings.warn) or a third-party C/C++ Python extension (perhaps created with pybind).
If it is a Python builtin, please file an issue on GitHub so the PyTorch team can add support for it and see the next case for a workaround.
If it is a third-party C/C++ Python extension, please either wrap it into a PyTorch-understood custom operator (see https://pytorch.org/tutorials/advanced/custom_ops_landing_page.html for more details) or, if it is traceable, use `torch.compiler.allow_in_graph`.
  torch._dynamo.utils.warn_once(explanation + "\n" + "\n".join(hints))
Traceback (most recent call last):
  File "/home/name/work/clones/lora-fast/run_benchmark.py", line 26, in <module>
    out_dict = bench_manager.run_benchmark(LORA_MAPPINGS)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/name/work/clones/lora-fast/utils/benchmark_utils.py", line 108, in run_benchmark
    image = self.run_inference(self.pipe, pipe_kwargs, args)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/name/work/clones/lora-fast/utils/benchmark_utils.py", line 85, in run_inference
    return pipe(**pipe_kwargs).images[0]
           ^^^^^^^^^^^^^^^^^^^
  File "/home/name/anaconda3/envs/peft/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/name/work/forks/diffusers/src/diffusers/pipelines/flux/pipeline_flux.py", line 913, in __call__
    noise_pred = self.transformer(
                 ^^^^^^^^^^^^^^^^^
  File "/home/name/anaconda3/envs/peft/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1749, in _wrapped_call_impl
    return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/name/anaconda3/envs/peft/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 655, in _fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/name/anaconda3/envs/peft/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/name/anaconda3/envs/peft/lib/python3.12/site-packages/accelerate/hooks.py", line 170, in new_forward
    args, kwargs = module._hf_hook.pre_forward(module, *args, **kwargs)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/name/anaconda3/envs/peft/lib/python3.12/site-packages/accelerate/hooks.py", line 718, in pre_forward
    self.prev_module_hook.offload()
  File "/home/name/anaconda3/envs/peft/lib/python3.12/site-packages/accelerate/hooks.py", line 719, in torch_dynamo_resume_in_pre_forward_at_718
    clear_device_cache()
  File "/home/name/anaconda3/envs/peft/lib/python3.12/site-packages/accelerate/hooks.py", line 720, in torch_dynamo_resume_in_pre_forward_at_719
    module.to(self.execution_device)
  File "/home/name/work/forks/diffusers/src/diffusers/models/modeling_utils.py", line 1383, in to
    return super().to(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/name/anaconda3/envs/peft/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1355, in to
    return self._apply(convert)
           ^^^^^^^^^^^^^^^^^^^^
  File "/home/name/anaconda3/envs/peft/lib/python3.12/site-packages/torch/nn/modules/module.py", line 915, in _apply
    module._apply(fn)
  File "/home/name/anaconda3/envs/peft/lib/python3.12/site-packages/torch/nn/modules/module.py", line 915, in _apply
    module._apply(fn)
  File "/home/name/anaconda3/envs/peft/lib/python3.12/site-packages/torch/nn/modules/module.py", line 915, in _apply
    module._apply(fn)
  [Previous line repeated 1 more time]
  File "/home/name/anaconda3/envs/peft/lib/python3.12/site-packages/torch/nn/modules/module.py", line 942, in _apply
    param_applied = fn(param)
                    ^^^^^^^^^
  File "/home/name/anaconda3/envs/peft/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1341, in convert
    return t.to(
           ^^^^^
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 90.00 MiB. GPU 0 has a total capacity of 23.55 GiB of which 47.75 MiB is free. Including non-PyTorch memory, this process has 23.48 GiB memory in use. Of the allocated memory 22.72 GiB is allocated by PyTorch, and 256.21 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)


args=Namespace(ckpt_id='black-forest-labs/FLUX.1-dev', seed=0, disable_fa3=True, disable_fp8=True, disable_compile=True, disable_recompile_error=True, disable_hotswap=False, quantize_t5=False, offload=True, max_rank=128, out_dir=PosixPath('no_fa3_fp8_nf4_compile'))
Loading checkpoint shards: 100%|██████████| 2/2 [00:00<00:00, 129.33it/s]
Loading pipeline components...:  71%|███████▏  | 5/7 [00:00<00:00, 48.06it/s]You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
Loading checkpoint shards: 100%|██████████| 3/3 [00:00<00:00, 92.10it/s]
Loading pipeline components...: 100%|██████████| 7/7 [00:00<00:00, 24.94it/s]
Loading repo_id='glif/l0w-r3z'

WARN  Feature `utils/Perplexity` requires python GIL. Feature is currently skipped/disabled.
INFO  ENV: Auto setting PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' for memory saving.
INFO  ENV: Auto setting CUDA_DEVICE_ORDER=PCI_BUS_ID for correctness.          
No LoRA keys associated to CLIPTextModel found with the prefix='text_encoder'. This is safe to ignore if LoRA state dict didn't originally have any CLIPTextModel related params. You can also try specifying `prefix=None` to resolve the warning. Otherwise, open an issue if you think it's unexpected: https://github.com/huggingface/diffusers/issues/new
Traceback (most recent call last):
  File "/home/name/work/clones/lora-fast/run_benchmark.py", line 26, in <module>
    out_dict = bench_manager.run_benchmark(LORA_MAPPINGS)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/name/work/clones/lora-fast/utils/benchmark_utils.py", line 108, in run_benchmark
    image = self.run_inference(self.pipe, pipe_kwargs, args)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/name/work/clones/lora-fast/utils/benchmark_utils.py", line 85, in run_inference
    return pipe(**pipe_kwargs).images[0]
           ^^^^^^^^^^^^^^^^^^^
  File "/home/name/anaconda3/envs/peft/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/name/work/forks/diffusers/src/diffusers/pipelines/flux/pipeline_flux.py", line 913, in __call__
    noise_pred = self.transformer(
                 ^^^^^^^^^^^^^^^^^
  File "/home/name/anaconda3/envs/peft/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/name/anaconda3/envs/peft/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/name/anaconda3/envs/peft/lib/python3.12/site-packages/accelerate/hooks.py", line 170, in new_forward
    args, kwargs = module._hf_hook.pre_forward(module, *args, **kwargs)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/name/anaconda3/envs/peft/lib/python3.12/site-packages/accelerate/hooks.py", line 720, in pre_forward
    module.to(self.execution_device)
  File "/home/name/work/forks/diffusers/src/diffusers/models/modeling_utils.py", line 1383, in to
    return super().to(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/name/anaconda3/envs/peft/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1355, in to
    return self._apply(convert)
           ^^^^^^^^^^^^^^^^^^^^
  File "/home/name/anaconda3/envs/peft/lib/python3.12/site-packages/torch/nn/modules/module.py", line 915, in _apply
    module._apply(fn)
  File "/home/name/anaconda3/envs/peft/lib/python3.12/site-packages/torch/nn/modules/module.py", line 915, in _apply
    module._apply(fn)
  File "/home/name/anaconda3/envs/peft/lib/python3.12/site-packages/torch/nn/modules/module.py", line 915, in _apply
    module._apply(fn)
  [Previous line repeated 1 more time]
  File "/home/name/anaconda3/envs/peft/lib/python3.12/site-packages/torch/nn/modules/module.py", line 942, in _apply
    param_applied = fn(param)
                    ^^^^^^^^^
  File "/home/name/anaconda3/envs/peft/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1341, in convert
    return t.to(
           ^^^^^
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 90.00 MiB. GPU 0 has a total capacity of 23.55 GiB of which 45.75 MiB is free. Including non-PyTorch memory, this process has 23.48 GiB memory in use. Of the allocated memory 22.72 GiB is allocated by PyTorch, and 257.71 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)


args=Namespace(ckpt_id='black-forest-labs/FLUX.1-dev', seed=0, disable_fa3=True, disable_fp8=True, disable_compile=True, disable_recompile_error=True, disable_hotswap=False, quantize_t5=True, offload=True, max_rank=128, out_dir=PosixPath('no_fa3_fp8_compile'))
Loading checkpoint shards: 100%|██████████| 3/3 [00:00<00:00, 87.39it/s]
You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
Loading checkpoint shards: 100%|██████████| 2/2 [00:04<00:00,  2.50s/it]it/s]
Loading pipeline components...: 100%|██████████| 7/7 [00:05<00:00,  1.30it/s]
Loading repo_id='glif/l0w-r3z'

WARN  Feature `utils/Perplexity` requires python GIL. Feature is currently skipped/disabled.
INFO  ENV: Auto setting PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' for memory saving.
INFO  ENV: Auto setting CUDA_DEVICE_ORDER=PCI_BUS_ID for correctness.          
No LoRA keys associated to CLIPTextModel found with the prefix='text_encoder'. This is safe to ignore if LoRA state dict didn't originally have any CLIPTextModel related params. You can also try specifying `prefix=None` to resolve the warning. Otherwise, open an issue if you think it's unexpected: https://github.com/huggingface/diffusers/issues/new
Traceback (most recent call last):
  File "/home/name/work/clones/lora-fast/run_benchmark.py", line 26, in <module>
    out_dict = bench_manager.run_benchmark(LORA_MAPPINGS)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/name/work/clones/lora-fast/utils/benchmark_utils.py", line 108, in run_benchmark
    image = self.run_inference(self.pipe, pipe_kwargs, args)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/name/work/clones/lora-fast/utils/benchmark_utils.py", line 85, in run_inference
    return pipe(**pipe_kwargs).images[0]
           ^^^^^^^^^^^^^^^^^^^
  File "/home/name/anaconda3/envs/peft/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/name/work/forks/diffusers/src/diffusers/pipelines/flux/pipeline_flux.py", line 913, in __call__
    noise_pred = self.transformer(
                 ^^^^^^^^^^^^^^^^^
  File "/home/name/anaconda3/envs/peft/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/name/anaconda3/envs/peft/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/name/anaconda3/envs/peft/lib/python3.12/site-packages/accelerate/hooks.py", line 170, in new_forward
    args, kwargs = module._hf_hook.pre_forward(module, *args, **kwargs)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/name/anaconda3/envs/peft/lib/python3.12/site-packages/accelerate/hooks.py", line 720, in pre_forward
    module.to(self.execution_device)
  File "/home/name/work/forks/diffusers/src/diffusers/models/modeling_utils.py", line 1383, in to
    return super().to(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/name/anaconda3/envs/peft/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1355, in to
    return self._apply(convert)
           ^^^^^^^^^^^^^^^^^^^^
  File "/home/name/anaconda3/envs/peft/lib/python3.12/site-packages/torch/nn/modules/module.py", line 915, in _apply
    module._apply(fn)
  File "/home/name/anaconda3/envs/peft/lib/python3.12/site-packages/torch/nn/modules/module.py", line 915, in _apply
    module._apply(fn)
  File "/home/name/anaconda3/envs/peft/lib/python3.12/site-packages/torch/nn/modules/module.py", line 915, in _apply
    module._apply(fn)
  [Previous line repeated 1 more time]
  File "/home/name/anaconda3/envs/peft/lib/python3.12/site-packages/torch/nn/modules/module.py", line 942, in _apply
    param_applied = fn(param)
                    ^^^^^^^^^
  File "/home/name/anaconda3/envs/peft/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1341, in convert
    return t.to(
           ^^^^^
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 90.00 MiB. GPU 0 has a total capacity of 23.55 GiB of which 47.75 MiB is free. Including non-PyTorch memory, this process has 23.48 GiB memory in use. Of the allocated memory 22.72 GiB is allocated by PyTorch, and 256.21 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)


args=Namespace(ckpt_id='black-forest-labs/FLUX.1-dev', seed=0, disable_fa3=True, disable_fp8=True, disable_compile=False, disable_recompile_error=True, disable_hotswap=True, quantize_t5=False, offload=True, max_rank=128, out_dir=PosixPath('no_fa3_fp8_nf4_hotswap'))
Loading checkpoint shards: 100%|██████████| 3/3 [00:00<00:00, 92.39it/s]
Loading checkpoint shards: 100%|██████████| 2/2 [00:00<00:00, 129.88it/s]
You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
Loading pipeline components...: 100%|██████████| 7/7 [00:00<00:00, 25.84it/s]
Loading repo_id='glif/l0w-r3z'

WARN  Feature `utils/Perplexity` requires python GIL. Feature is currently skipped/disabled.
INFO  ENV: Auto setting PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' for memory saving.
INFO  ENV: Auto setting CUDA_DEVICE_ORDER=PCI_BUS_ID for correctness.          
No LoRA keys associated to CLIPTextModel found with the prefix='text_encoder'. This is safe to ignore if LoRA state dict didn't originally have any CLIPTextModel related params. You can also try specifying `prefix=None` to resolve the warning. Otherwise, open an issue if you think it's unexpected: https://github.com/huggingface/diffusers/issues/new
/home/name/anaconda3/envs/peft/lib/python3.12/site-packages/torch/_dynamo/variables/functions.py:1263: UserWarning: Dynamo does not know how to trace the builtin `posix.putenv.` This function is either a Python builtin (e.g. _warnings.warn) or a third-party C/C++ Python extension (perhaps created with pybind).
If it is a Python builtin, please file an issue on GitHub so the PyTorch team can add support for it and see the next case for a workaround.
If it is a third-party C/C++ Python extension, please either wrap it into a PyTorch-understood custom operator (see https://pytorch.org/tutorials/advanced/custom_ops_landing_page.html for more details) or, if it is traceable, use `torch.compiler.allow_in_graph`.
  torch._dynamo.utils.warn_once(explanation + "\n" + "\n".join(hints))
Traceback (most recent call last):
  File "/home/name/work/clones/lora-fast/run_benchmark.py", line 26, in <module>
    out_dict = bench_manager.run_benchmark(LORA_MAPPINGS)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/name/work/clones/lora-fast/utils/benchmark_utils.py", line 108, in run_benchmark
    image = self.run_inference(self.pipe, pipe_kwargs, args)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/name/work/clones/lora-fast/utils/benchmark_utils.py", line 85, in run_inference
    return pipe(**pipe_kwargs).images[0]
           ^^^^^^^^^^^^^^^^^^^
  File "/home/name/anaconda3/envs/peft/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/name/work/forks/diffusers/src/diffusers/pipelines/flux/pipeline_flux.py", line 913, in __call__
    noise_pred = self.transformer(
                 ^^^^^^^^^^^^^^^^^
  File "/home/name/anaconda3/envs/peft/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1749, in _wrapped_call_impl
    return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/name/anaconda3/envs/peft/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 655, in _fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/name/anaconda3/envs/peft/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/name/anaconda3/envs/peft/lib/python3.12/site-packages/accelerate/hooks.py", line 170, in new_forward
    args, kwargs = module._hf_hook.pre_forward(module, *args, **kwargs)
  File "/home/name/anaconda3/envs/peft/lib/python3.12/site-packages/accelerate/hooks.py", line 170, in torch_dynamo_resume_in_new_forward_at_170
    args, kwargs = module._hf_hook.pre_forward(module, *args, **kwargs)
  File "/home/name/anaconda3/envs/peft/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 838, in _fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/name/anaconda3/envs/peft/lib/python3.12/site-packages/torch/_functorch/aot_autograd.py", line 1209, in forward
    return compiled_fn(full_args)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/name/anaconda3/envs/peft/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 328, in runtime_wrapper
    all_outs = call_func_at_runtime_with_args(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/name/anaconda3/envs/peft/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/utils.py", line 126, in call_func_at_runtime_with_args
    out = normalize_as_list(f(args))
                            ^^^^^^^
  File "/home/name/anaconda3/envs/peft/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 689, in inner_fn
    outs = compiled_fn(args)
           ^^^^^^^^^^^^^^^^^
  File "/home/name/anaconda3/envs/peft/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 495, in wrapper
    return compiled_fn(runtime_args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/name/anaconda3/envs/peft/lib/python3.12/site-packages/torch/_inductor/output_code.py", line 460, in __call__
    return self.current_callable(inputs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/name/anaconda3/envs/peft/lib/python3.12/site-packages/torch/_inductor/utils.py", line 2404, in run
    return model(new_inputs)
           ^^^^^^^^^^^^^^^^^
  File "/tmp/torchinductor_name/ju/cjulupfc4jcwfrfojcurabtjn7mwycfwl354nfp5hsnbhvhoys3h.py", line 5892, in call
    triton_poi_fused_mm_18.run(buf94, buf95, 196608, stream=stream0)
  File "/home/name/anaconda3/envs/peft/lib/python3.12/site-packages/torch/_inductor/runtime/triton_heuristics.py", line 909, in run
    self.autotune_to_one_config(*args, **kwargs)
  File "/home/name/anaconda3/envs/peft/lib/python3.12/site-packages/torch/_inductor/runtime/triton_heuristics.py", line 763, in autotune_to_one_config
    timings = self.benchmark_all_configs(*args, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/name/anaconda3/envs/peft/lib/python3.12/site-packages/torch/_inductor/runtime/triton_heuristics.py", line 738, in benchmark_all_configs
    launcher: self.bench(launcher, *args, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/name/anaconda3/envs/peft/lib/python3.12/site-packages/torch/_inductor/runtime/triton_heuristics.py", line 616, in bench
    return benchmarker.benchmark_gpu(kernel_call, rep=40)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/name/anaconda3/envs/peft/lib/python3.12/site-packages/torch/_inductor/runtime/benchmarking.py", line 39, in wrapper
    return fn(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/name/anaconda3/envs/peft/lib/python3.12/site-packages/torch/_inductor/runtime/benchmarking.py", line 247, in benchmark_gpu
    buffer = torch.empty(self.L2_cache_size // 4, dtype=torch.int, device="cuda")
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 72.00 MiB. GPU 0 has a total capacity of 23.55 GiB of which 59.75 MiB is free. Including non-PyTorch memory, this process has 23.46 GiB memory in use. Of the allocated memory 22.94 GiB is allocated by PyTorch, and 17.31 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)


args=Namespace(ckpt_id='black-forest-labs/FLUX.1-dev', seed=0, disable_fa3=True, disable_fp8=True, disable_compile=False, disable_recompile_error=True, disable_hotswap=True, quantize_t5=True, offload=True, max_rank=128, out_dir=PosixPath('no_fa3_fp8_hotswap'))
Loading checkpoint shards: 100%|██████████| 2/2 [00:05<00:00,  2.51s/it]
Loading pipeline components...:  14%|█▍        | 1/7 [00:05<00:31,  5.19s/it]You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
Loading checkpoint shards: 100%|██████████| 3/3 [00:00<00:00, 91.64it/s]s/it]
Loading pipeline components...: 100%|██████████| 7/7 [00:05<00:00,  1.29it/s]
Loading repo_id='glif/l0w-r3z'

WARN  Feature `utils/Perplexity` requires python GIL. Feature is currently skipped/disabled.
INFO  ENV: Auto setting PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' for memory saving.
INFO  ENV: Auto setting CUDA_DEVICE_ORDER=PCI_BUS_ID for correctness.          
No LoRA keys associated to CLIPTextModel found with the prefix='text_encoder'. This is safe to ignore if LoRA state dict didn't originally have any CLIPTextModel related params. You can also try specifying `prefix=None` to resolve the warning. Otherwise, open an issue if you think it's unexpected: https://github.com/huggingface/diffusers/issues/new
/home/name/anaconda3/envs/peft/lib/python3.12/site-packages/torch/_dynamo/variables/functions.py:1263: UserWarning: Dynamo does not know how to trace the builtin `<unknown module>.TensorBase._make_subclass.` This function is either a Python builtin (e.g. _warnings.warn) or a third-party C/C++ Python extension (perhaps created with pybind).
If it is a Python builtin, please file an issue on GitHub so the PyTorch team can add support for it and see the next case for a workaround.
If it is a third-party C/C++ Python extension, please either wrap it into a PyTorch-understood custom operator (see https://pytorch.org/tutorials/advanced/custom_ops_landing_page.html for more details) or, if it is traceable, use `torch.compiler.allow_in_graph`.
  torch._dynamo.utils.warn_once(explanation + "\n" + "\n".join(hints))
/home/name/anaconda3/envs/peft/lib/python3.12/site-packages/torch/_dynamo/variables/functions.py:1263: UserWarning: Dynamo does not know how to trace the builtin `posix.putenv.` This function is either a Python builtin (e.g. _warnings.warn) or a third-party C/C++ Python extension (perhaps created with pybind).
If it is a Python builtin, please file an issue on GitHub so the PyTorch team can add support for it and see the next case for a workaround.
If it is a third-party C/C++ Python extension, please either wrap it into a PyTorch-understood custom operator (see https://pytorch.org/tutorials/advanced/custom_ops_landing_page.html for more details) or, if it is traceable, use `torch.compiler.allow_in_graph`.
  torch._dynamo.utils.warn_once(explanation + "\n" + "\n".join(hints))
Traceback (most recent call last):
  File "/home/name/work/clones/lora-fast/run_benchmark.py", line 26, in <module>
    out_dict = bench_manager.run_benchmark(LORA_MAPPINGS)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/name/work/clones/lora-fast/utils/benchmark_utils.py", line 108, in run_benchmark
    image = self.run_inference(self.pipe, pipe_kwargs, args)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/name/work/clones/lora-fast/utils/benchmark_utils.py", line 85, in run_inference
    return pipe(**pipe_kwargs).images[0]
           ^^^^^^^^^^^^^^^^^^^
  File "/home/name/anaconda3/envs/peft/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/name/work/forks/diffusers/src/diffusers/pipelines/flux/pipeline_flux.py", line 913, in __call__
    noise_pred = self.transformer(
                 ^^^^^^^^^^^^^^^^^
  File "/home/name/anaconda3/envs/peft/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1749, in _wrapped_call_impl
    return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/name/anaconda3/envs/peft/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 655, in _fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/name/anaconda3/envs/peft/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/name/anaconda3/envs/peft/lib/python3.12/site-packages/accelerate/hooks.py", line 170, in new_forward
    args, kwargs = module._hf_hook.pre_forward(module, *args, **kwargs)
  File "/home/name/anaconda3/envs/peft/lib/python3.12/site-packages/accelerate/hooks.py", line 170, in torch_dynamo_resume_in_new_forward_at_170
    args, kwargs = module._hf_hook.pre_forward(module, *args, **kwargs)
  File "/home/name/anaconda3/envs/peft/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 838, in _fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/name/anaconda3/envs/peft/lib/python3.12/site-packages/torch/_functorch/aot_autograd.py", line 1209, in forward
    return compiled_fn(full_args)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/name/anaconda3/envs/peft/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 328, in runtime_wrapper
    all_outs = call_func_at_runtime_with_args(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/name/anaconda3/envs/peft/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/utils.py", line 126, in call_func_at_runtime_with_args
    out = normalize_as_list(f(args))
                            ^^^^^^^
  File "/home/name/anaconda3/envs/peft/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 495, in wrapper
    return compiled_fn(runtime_args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/name/anaconda3/envs/peft/lib/python3.12/site-packages/torch/_inductor/output_code.py", line 460, in __call__
    return self.current_callable(inputs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/name/anaconda3/envs/peft/lib/python3.12/site-packages/torch/_inductor/utils.py", line 2404, in run
    return model(new_inputs)
           ^^^^^^^^^^^^^^^^^
  File "/tmp/torchinductor_name/ju/cjulupfc4jcwfrfojcurabtjn7mwycfwl354nfp5hsnbhvhoys3h.py", line 5892, in call
    triton_poi_fused_mm_18.run(buf94, buf95, 196608, stream=stream0)
  File "/home/name/anaconda3/envs/peft/lib/python3.12/site-packages/torch/_inductor/runtime/triton_heuristics.py", line 909, in run
    self.autotune_to_one_config(*args, **kwargs)
  File "/home/name/anaconda3/envs/peft/lib/python3.12/site-packages/torch/_inductor/runtime/triton_heuristics.py", line 763, in autotune_to_one_config
    timings = self.benchmark_all_configs(*args, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/name/anaconda3/envs/peft/lib/python3.12/site-packages/torch/_inductor/runtime/triton_heuristics.py", line 738, in benchmark_all_configs
    launcher: self.bench(launcher, *args, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/name/anaconda3/envs/peft/lib/python3.12/site-packages/torch/_inductor/runtime/triton_heuristics.py", line 616, in bench
    return benchmarker.benchmark_gpu(kernel_call, rep=40)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/name/anaconda3/envs/peft/lib/python3.12/site-packages/torch/_inductor/runtime/benchmarking.py", line 39, in wrapper
    return fn(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/name/anaconda3/envs/peft/lib/python3.12/site-packages/torch/_inductor/runtime/benchmarking.py", line 247, in benchmark_gpu
    buffer = torch.empty(self.L2_cache_size // 4, dtype=torch.int, device="cuda")
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 72.00 MiB. GPU 0 has a total capacity of 23.55 GiB of which 67.75 MiB is free. Including non-PyTorch memory, this process has 23.46 GiB memory in use. Of the allocated memory 22.95 GiB is allocated by PyTorch, and 6.31 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)


args=Namespace(ckpt_id='black-forest-labs/FLUX.1-dev', seed=0, disable_fa3=True, disable_fp8=True, disable_compile=True, disable_recompile_error=True, disable_hotswap=True, quantize_t5=False, offload=True, max_rank=128, out_dir=PosixPath('no_fa3_fp8_nf4_hotswap_comp'))
Loading pipeline components...:   0%|          | 0/7 [00:00<?, ?it/s]You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
Loading checkpoint shards: 100%|██████████| 3/3 [00:00<00:00, 89.40it/s]it/s]
Loading checkpoint shards: 100%|██████████| 2/2 [00:00<00:00, 133.26it/s]
Loading pipeline components...: 100%|██████████| 7/7 [00:00<00:00, 25.31it/s]
Loading repo_id='glif/l0w-r3z'

WARN  Feature `utils/Perplexity` requires python GIL. Feature is currently skipped/disabled.
INFO  ENV: Auto setting PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' for memory saving.
INFO  ENV: Auto setting CUDA_DEVICE_ORDER=PCI_BUS_ID for correctness.          
No LoRA keys associated to CLIPTextModel found with the prefix='text_encoder'. This is safe to ignore if LoRA state dict didn't originally have any CLIPTextModel related params. You can also try specifying `prefix=None` to resolve the warning. Otherwise, open an issue if you think it's unexpected: https://github.com/huggingface/diffusers/issues/new
Traceback (most recent call last):
  File "/home/name/work/clones/lora-fast/run_benchmark.py", line 26, in <module>
    out_dict = bench_manager.run_benchmark(LORA_MAPPINGS)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/name/work/clones/lora-fast/utils/benchmark_utils.py", line 108, in run_benchmark
    image = self.run_inference(self.pipe, pipe_kwargs, args)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/name/work/clones/lora-fast/utils/benchmark_utils.py", line 85, in run_inference
    return pipe(**pipe_kwargs).images[0]
           ^^^^^^^^^^^^^^^^^^^
  File "/home/name/anaconda3/envs/peft/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/name/work/forks/diffusers/src/diffusers/pipelines/flux/pipeline_flux.py", line 913, in __call__
    noise_pred = self.transformer(
                 ^^^^^^^^^^^^^^^^^
  File "/home/name/anaconda3/envs/peft/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/name/anaconda3/envs/peft/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/name/anaconda3/envs/peft/lib/python3.12/site-packages/accelerate/hooks.py", line 175, in new_forward
    output = module._old_forward(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/name/work/forks/diffusers/src/diffusers/models/transformers/transformer_flux.py", line 490, in forward
    encoder_hidden_states, hidden_states = block(
                                           ^^^^^^
  File "/home/name/anaconda3/envs/peft/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/name/anaconda3/envs/peft/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/name/work/forks/diffusers/src/diffusers/models/transformers/transformer_flux.py", line 151, in forward
    attention_outputs = self.attn(
                        ^^^^^^^^^^
  File "/home/name/anaconda3/envs/peft/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/name/anaconda3/envs/peft/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/name/work/forks/diffusers/src/diffusers/models/attention_processor.py", line 605, in forward
    return self.processor(
           ^^^^^^^^^^^^^^^
  File "/home/name/work/forks/diffusers/src/diffusers/models/attention_processor.py", line 2339, in __call__
    query = apply_rotary_emb(query, image_rotary_emb)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/name/work/forks/diffusers/src/diffusers/models/embeddings.py", line 1211, in apply_rotary_emb
    out = (x.float() * cos + x_rotated.float() * sin).to(x.dtype)
                             ~~~~~~~~~~~~~~~~~~^~~~~
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 54.00 MiB. GPU 0 has a total capacity of 23.55 GiB of which 41.75 MiB is free. Including non-PyTorch memory, this process has 23.48 GiB memory in use. Of the allocated memory 22.91 GiB is allocated by PyTorch, and 68.94 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)


args=Namespace(ckpt_id='black-forest-labs/FLUX.1-dev', seed=0, disable_fa3=True, disable_fp8=True, disable_compile=True, disable_recompile_error=True, disable_hotswap=True, quantize_t5=True, offload=True, max_rank=128, out_dir=PosixPath('no_fa3_fp8_hotswap_comp'))
Loading checkpoint shards: 100%|██████████| 2/2 [00:04<00:00,  2.48s/it]
Loading pipeline components...:  43%|████▎     | 3/7 [00:05<00:06,  1.72s/it]You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
Loading checkpoint shards: 100%|██████████| 3/3 [00:00<00:00, 92.49it/s]s/it]
Loading pipeline components...: 100%|██████████| 7/7 [00:05<00:00,  1.30it/s]
Loading repo_id='glif/l0w-r3z'

WARN  Feature `utils/Perplexity` requires python GIL. Feature is currently skipped/disabled.
INFO  ENV: Auto setting PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' for memory saving.
INFO  ENV: Auto setting CUDA_DEVICE_ORDER=PCI_BUS_ID for correctness.          
No LoRA keys associated to CLIPTextModel found with the prefix='text_encoder'. This is safe to ignore if LoRA state dict didn't originally have any CLIPTextModel related params. You can also try specifying `prefix=None` to resolve the warning. Otherwise, open an issue if you think it's unexpected: https://github.com/huggingface/diffusers/issues/new
Traceback (most recent call last):
  File "/home/name/work/clones/lora-fast/run_benchmark.py", line 26, in <module>
    out_dict = bench_manager.run_benchmark(LORA_MAPPINGS)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/name/work/clones/lora-fast/utils/benchmark_utils.py", line 108, in run_benchmark
    image = self.run_inference(self.pipe, pipe_kwargs, args)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/name/work/clones/lora-fast/utils/benchmark_utils.py", line 85, in run_inference
    return pipe(**pipe_kwargs).images[0]
           ^^^^^^^^^^^^^^^^^^^
  File "/home/name/anaconda3/envs/peft/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/name/work/forks/diffusers/src/diffusers/pipelines/flux/pipeline_flux.py", line 913, in __call__
    noise_pred = self.transformer(
                 ^^^^^^^^^^^^^^^^^
  File "/home/name/anaconda3/envs/peft/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/name/anaconda3/envs/peft/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/name/anaconda3/envs/peft/lib/python3.12/site-packages/accelerate/hooks.py", line 175, in new_forward
    output = module._old_forward(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/name/work/forks/diffusers/src/diffusers/models/transformers/transformer_flux.py", line 490, in forward
    encoder_hidden_states, hidden_states = block(
                                           ^^^^^^
  File "/home/name/anaconda3/envs/peft/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/name/anaconda3/envs/peft/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/name/work/forks/diffusers/src/diffusers/models/transformers/transformer_flux.py", line 151, in forward
    attention_outputs = self.attn(
                        ^^^^^^^^^^
  File "/home/name/anaconda3/envs/peft/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/name/anaconda3/envs/peft/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/name/work/forks/diffusers/src/diffusers/models/attention_processor.py", line 605, in forward
    return self.processor(
           ^^^^^^^^^^^^^^^
  File "/home/name/work/forks/diffusers/src/diffusers/models/attention_processor.py", line 2339, in __call__
    query = apply_rotary_emb(query, image_rotary_emb)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/name/work/forks/diffusers/src/diffusers/models/embeddings.py", line 1211, in apply_rotary_emb
    out = (x.float() * cos + x_rotated.float() * sin).to(x.dtype)
                             ~~~~~~~~~~~~~~~~~~^~~~~
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 54.00 MiB. GPU 0 has a total capacity of 23.55 GiB of which 41.75 MiB is free. Including non-PyTorch memory, this process has 23.48 GiB memory in use. Of the allocated memory 22.91 GiB is allocated by PyTorch, and 68.94 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions