[Discussion] Sub-file batching — Decouple GPU memory usage from model file size

## Problem

### Current state: Loading granularity is tightly coupled with model file size

Currently, fastsafetensors requires each rank to load an **entire** safetensors shard file into GPU memory (device buffer) at once, then broadcast the tensors within it to all other ranks before moving on to the next batch.

This means the **peak extra GPU memory per rank** is at least the size of a single safetensors shard file (for the device buffer), plus additional overhead from clone/empty tensors allocated during broadcast.

For models with small shards (e.g., 2–5 GB per file), this overhead is manageable. However, state-of-the-art open-source models are shipping increasingly large individual shard files:

| Model              | Shard size    | Params |
|--------------------|---------------|--------|
| DeepSeek-V3        | ~4 GB         | 671B   |
| Qwen3.5-397B-A17B  | ~8.6 GB       | 397B   |
| DeepSeek-V4 (base) | **~25 GB**    | 861B   |
| DeepSeek-V4 (fp4)  | **~13.9 GB**  | 861B   |

A 25 GB shard means **25 GB of extra GPU memory just for the file buffer** — on top of the memory needed for the model weights themselves. On an 80 GB GPU, this alone consumes 31% of total VRAM, making OOM very likely.

This effectively **narrows fastsafetensors' applicability**, even though achieving optimal storage performance does not necessarily require loading an entire file at once.

### Root cause: The Loader's core interface semantics operate at file granularity

The root cause of this limitation is not in any particular upper-layer caller (such as `PipelineParallel`), but in the fact that **the two core interfaces of `BaseSafeTensorsFileLoader` themselves have "entire file" as their semantic granularity**:

```python
# fastsafetensors/loader.py

def add_filenames(self, filenames: Dict[int, List[str]]):
    """Register files to ranks to be copied at copy_file_to_device()."""
    # Registers by filename, fully parsing metadata for each file and associating it with a rank
    for rank in filenames.keys():
        realpath = filenames[rank][next_idx]
        metadata = SafeTensorsMetadata.from_file(realpath, self.framework)
        self.meta[realpath] = (metadata, rank)
        ...

def copy_files_to_device(self, ...) -> FilesBufferOnDevice:
    """trigger copying all the files to device buffers."""
    # Iterates over each registered file, creates a copier and allocates a device buffer for the entire file
    for _, (meta, rank) in sorted(self.meta.items(), ...):
        copier = self.copier_constructor(meta, self.device, self.framework)
        factory = LazyTensorFactory(meta, ...)
        factory.submit_io(use_buf_register, max_copy_block_size)
        ...
```

- **`add_filenames`**: Input is `Dict[int, List[str]]` (rank → file path list), with granularity at the whole-file level.
- **`copy_files_to_device`**: Iterates over all registered files, creates a `CopierInterface` instance **for each entire file**, and submits I/O. Internally, `CopierInterface.submit_io` allocates a device buffer equal to the entire file size (e.g., `alloc_tensor_memory(total_length, ...)` in `NoGdsFileCopier`).

Since the Loader's interface semantics are at the file level, **all upper-layer callers that depend on this interface** — whether `PipelineParallel`, `UnifiedLoader`, or user code directly using `SafeTensorsFileLoader` — can only perform batch partitioning at the "file" granularity, **unable to split tensors from a single file across multiple batches**.

As an example, `PipelineParallel`'s batching logic is constrained by this to group exactly `world_size` files per batch:

```python
# fastsafetensors/parallel_loader.py
def _create_batches(self, pg) -> List[List[str]]:
    batch_size = pg.size()  # == world_size
    return [
        self.hf_weights_files[i : i + batch_size]
        for i in range(0, len(self.hf_weights_files), batch_size)
    ]
```

To enable sub-file batching, the core requirement is to extend or redesign the Loader's interface semantics to support tensor-level (or byte-range-level) registration and loading.


## Related

- The existing `max_copy_block_size` (default 16 GB) controls the maximum *single I/O submission* size, but does **not** limit the total device buffer allocation per file — the entire file's buffer is allocated upfront regardless of this setting.
- `queue_size` controls the number of *concurrent batches* in the pipeline, but each batch still loads a full file per rank.

## Discussion

The above is an idea based on GPU memory pressure we observed while deploying large model inference in production. We're sharing it to open a discussion and gauge whether this direction has broad value for the community.

This is not a proposal that needs to land in the short term, nor does it have to be completed in a single PR. If the direction is considered valuable, it could be approached incrementally, for example:

1. **Phase 1**: Loader interface extension — enable `BaseSafeTensorsFileLoader` to support tensor-level (or byte-range-level) registration and loading. This is the foundation.
2. **Phase 2**: Fixed byte budget sub-batching — implement `max_batch_bytes` in `PipelineParallel`, partitioning tensors into sub-batches based on pre-read header metadata.
3. **Phase 3** (optional): Adaptive batching and batch read-order optimization.

### Questions for discussion

- Have you encountered similar GPU memory bottlenecks? How large are your typical shard files?
- For extending the `add_filenames` / `copy_files_to_device` interface semantics, are there better approaches?
- Are there complexity issues or edge cases that might have been overlooked?

Feedback welcome 🙏


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Discussion] Sub-file batching — Decouple GPU memory usage from model file size #71

Problem

Current state: Loading granularity is tightly coupled with model file size

Root cause: The Loader's core interface semantics operate at file granularity

Related

Discussion

Questions for discussion

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Model	Shard size	Params
DeepSeek-V3	~4 GB	671B
Qwen3.5-397B-A17B	~8.6 GB	397B
DeepSeek-V4 (base)	~25 GB	861B
DeepSeek-V4 (fp4)	~13.9 GB	861B

[Discussion] Sub-file batching — Decouple GPU memory usage from model file size #71

Description

Problem

Current state: Loading granularity is tightly coupled with model file size

Root cause: The Loader's core interface semantics operate at file granularity

Related

Discussion

Questions for discussion

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions