Skip to content

[Discussion] Sub-file batching — Decouple GPU memory usage from model file size #71

@ABNER-1

Description

@ABNER-1

Problem

Current state: Loading granularity is tightly coupled with model file size

Currently, fastsafetensors requires each rank to load an entire safetensors shard file into GPU memory (device buffer) at once, then broadcast the tensors within it to all other ranks before moving on to the next batch.

This means the peak extra GPU memory per rank is at least the size of a single safetensors shard file (for the device buffer), plus additional overhead from clone/empty tensors allocated during broadcast.

For models with small shards (e.g., 2–5 GB per file), this overhead is manageable. However, state-of-the-art open-source models are shipping increasingly large individual shard files:

Model Shard size Params
DeepSeek-V3 ~4 GB 671B
Qwen3.5-397B-A17B ~8.6 GB 397B
DeepSeek-V4 (base) ~25 GB 861B
DeepSeek-V4 (fp4) ~13.9 GB 861B

A 25 GB shard means 25 GB of extra GPU memory just for the file buffer — on top of the memory needed for the model weights themselves. On an 80 GB GPU, this alone consumes 31% of total VRAM, making OOM very likely.

This effectively narrows fastsafetensors' applicability, even though achieving optimal storage performance does not necessarily require loading an entire file at once.

Root cause: The Loader's core interface semantics operate at file granularity

The root cause of this limitation is not in any particular upper-layer caller (such as PipelineParallel), but in the fact that the two core interfaces of BaseSafeTensorsFileLoader themselves have "entire file" as their semantic granularity:

# fastsafetensors/loader.py

def add_filenames(self, filenames: Dict[int, List[str]]):
    """Register files to ranks to be copied at copy_file_to_device()."""
    # Registers by filename, fully parsing metadata for each file and associating it with a rank
    for rank in filenames.keys():
        realpath = filenames[rank][next_idx]
        metadata = SafeTensorsMetadata.from_file(realpath, self.framework)
        self.meta[realpath] = (metadata, rank)
        ...

def copy_files_to_device(self, ...) -> FilesBufferOnDevice:
    """trigger copying all the files to device buffers."""
    # Iterates over each registered file, creates a copier and allocates a device buffer for the entire file
    for _, (meta, rank) in sorted(self.meta.items(), ...):
        copier = self.copier_constructor(meta, self.device, self.framework)
        factory = LazyTensorFactory(meta, ...)
        factory.submit_io(use_buf_register, max_copy_block_size)
        ...
  • add_filenames: Input is Dict[int, List[str]] (rank → file path list), with granularity at the whole-file level.
  • copy_files_to_device: Iterates over all registered files, creates a CopierInterface instance for each entire file, and submits I/O. Internally, CopierInterface.submit_io allocates a device buffer equal to the entire file size (e.g., alloc_tensor_memory(total_length, ...) in NoGdsFileCopier).

Since the Loader's interface semantics are at the file level, all upper-layer callers that depend on this interface — whether PipelineParallel, UnifiedLoader, or user code directly using SafeTensorsFileLoader — can only perform batch partitioning at the "file" granularity, unable to split tensors from a single file across multiple batches.

As an example, PipelineParallel's batching logic is constrained by this to group exactly world_size files per batch:

# fastsafetensors/parallel_loader.py
def _create_batches(self, pg) -> List[List[str]]:
    batch_size = pg.size()  # == world_size
    return [
        self.hf_weights_files[i : i + batch_size]
        for i in range(0, len(self.hf_weights_files), batch_size)
    ]

To enable sub-file batching, the core requirement is to extend or redesign the Loader's interface semantics to support tensor-level (or byte-range-level) registration and loading.

Related

  • The existing max_copy_block_size (default 16 GB) controls the maximum single I/O submission size, but does not limit the total device buffer allocation per file — the entire file's buffer is allocated upfront regardless of this setting.
  • queue_size controls the number of concurrent batches in the pipeline, but each batch still loads a full file per rank.

Discussion

The above is an idea based on GPU memory pressure we observed while deploying large model inference in production. We're sharing it to open a discussion and gauge whether this direction has broad value for the community.

This is not a proposal that needs to land in the short term, nor does it have to be completed in a single PR. If the direction is considered valuable, it could be approached incrementally, for example:

  1. Phase 1: Loader interface extension — enable BaseSafeTensorsFileLoader to support tensor-level (or byte-range-level) registration and loading. This is the foundation.
  2. Phase 2: Fixed byte budget sub-batching — implement max_batch_bytes in PipelineParallel, partitioning tensors into sub-batches based on pre-read header metadata.
  3. Phase 3 (optional): Adaptive batching and batch read-order optimization.

Questions for discussion

  • Have you encountered similar GPU memory bottlenecks? How large are your typical shard files?
  • For extending the add_filenames / copy_files_to_device interface semantics, are there better approaches?
  • Are there complexity issues or edge cases that might have been overlooked?

Feedback welcome 🙏

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions