Skip to content

Per-chunk gc.collect() + cuda.empty_cache() in batch loops (runs on CPU-only too) #327

Description

@stanlrt

From /simplify codebase sweep (2026-06-10).

Full generational gc.collect() + CUDA allocator flush run once per mini-batch: transparency/explainers/base_explainer.py:231-233,265-268, robustness/assessors/base_assessor.py:333-336, same pattern in task_families/classification.py:88-99. On CPU-only runs empty_cache is skipped but gc.collect() is not.

Cost: gc.collect() is O(all live objects); with batch sizes 4-8 over hundreds of samples this adds tens of ms per chunk and can dominate cheap explainers (Saliency). Per-chunk empty_cache() also forces allocator round-trips that slow subsequent allocations.

Fix: collect once after the loop, or guard the per-chunk collect behind torch.cuda.is_available() (the leak it mitigates is CUDA-graph/hook retention; CPU runs gain nothing). Keep per-chunk empty_cache only if the memory-leak tests demand it. Verify against src/raitap/tests/test_memory_leaks.py before changing cadence.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions