[FEAT] Windows support through DirectX 12 DirectStorage interop by SystemPanic · Pull Request #72 · foundation-model-stack/fastsafetensors

SystemPanic · 2026-04-30T17:06:31Z

Windows Support: DirectStorage backend + platform fixes

Summary

This PR adds Windows support (#37) for fastsafetensors, including a DirectStorage-based NVMe→GPU loading path (the Windows equivalent of Linux GDS/cuFile) and fixes for several platform-specific issues that caused build failures, data corruption, and metadata validation errors on Windows.

Tested on vLLM with --load-format fastsafetensors, will be included in the new v0.20.0 release of vLLM for Windows.

DirectStorage Backend

On Linux, GDS uses cuFile to DMA data directly from NVMe into GPU memory. Windows has no cuFile — instead, it offers DirectStorage, a DirectX 12 API designed for the same purpose.

Since DirectStorage writes into D3D12 resources (not CUDA buffers), we bridge the two APIs through CUDA external memory interop:

NVMe -> [DirectStorage] -> D3D12 shared buffer -> [cudaImportExternalMemory] -> CUDA device pointer

The key steps are:

Create a D3D12 committed resource with D3D12_HEAP_FLAG_SHARED so it can be exported
DirectStorage reads from NVMe into this D3D12 buffer via IDStorageQueue
Export the D3D12 resource as an NT handle via CreateSharedHandle
Import into CUDA via cudaImportExternalMemory + cudaExternalMemoryGetMappedBuffer to get a regular CUDA device pointer
Synchronize using a D3D12 fence imported as a cudaExternalSemaphore

All DirectStorage, D3D12, and DXGI libraries are loaded at runtime via LoadLibrary/GetProcAddress — no link-time SDK dependency on DirectStorage is required.

Binary file I/O (`O_BINARY`)

On Windows, os.open() without os.O_BINARY opens files in text mode, which translates \r\n → \n and treats 0x1A as EOF. This silently corrupted tensor data, causing models to produce garbage output.

Library name resolution

CUDA runtime DLL name (cudart64_12.dll, cudart64_13.dll, etc.) is now resolved at runtime from CUDA_HOME instead of being hardcoded at compile time. load_library_functions() accepts an optional cudart_lib_name parameter.

Metadata validation

Relaxed the strict start + header_length != size_bytes check to allow trailing padding bytes, which occur with sub-byte dtypes (FP4/NF4) used in quantized models.

Signed-off-by: Javier <25750030+SystemPanic@users.noreply.github.com>

takeshi-yoshimura · 2026-05-03T09:05:39Z

@SystemPanic
This is super exciting, and thanks for working on Windows support!

I may be missing something, but it looks like the Python-side copier integration for dstorage_file_reader may be missing.

I see dstorage_file_reader implemented and exported through pybind as a Windows equivalent of gds_file_reader. However, I do not see where it is selected from the Python copier layer. gds.py still appears to use gds_file_reader, and nogds.py uses nogds_file_reader.

Was there supposed to be an additional copier implementation, e.g. copier/win_dstorage.py or copier/dstorage.py, that uses fstcpp.dstorage_file_handle and fstcpp.dstorage_file_reader?

If so, it may be missing from this PR. If not, I think DirectStorage should probably be introduced through such a separate copier backend, rather than being mixed into the existing gds/nogds path.

Signed-off-by: Javier <25750030+SystemPanic@users.noreply.github.com>

SystemPanic · 2026-05-05T05:48:44Z

@takeshi-yoshimura Original commit was silently falling back to NoGDS path and not using DirectStorage at all. As model loading was working and speed was good, I thought DS was working correctly.

Original implementation (based on llama.cpp, it was a draft, never merged) was totally wrong, so I needed to review the DS documentation and start from scratch.

Spent the last two days on this, and make it work has been a headache.

Anyway, here it is. Fully working, 25-30% faster than NoGDS, which is already fast.

Implementation maxes out my WD SN850X, it's even faster than NVMe -> CPU memory with block size 64M and FILE_FLAG_NO_BUFFERING (no OS cache).

SystemPanic added 2 commits April 30, 2026 11:35

Windows support through DirectX 12 DirectStorage interop

f8582f9

Signed-off-by: Javier <25750030+SystemPanic@users.noreply.github.com>

Remove cuFile dlopen on Windows

3ca29a1

Signed-off-by: Javier <25750030+SystemPanic@users.noreply.github.com>

takeshi-yoshimura self-requested a review May 3, 2026 08:42

DirectStorage full rework

3af2ce0

Signed-off-by: Javier <25750030+SystemPanic@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEAT] Windows support through DirectX 12 DirectStorage interop#72

[FEAT] Windows support through DirectX 12 DirectStorage interop#72
SystemPanic wants to merge 3 commits intofoundation-model-stack:mainfrom
SystemPanic:fastsafetensors-windows

SystemPanic commented Apr 30, 2026 •

edited

Loading

Uh oh!

takeshi-yoshimura commented May 3, 2026

Uh oh!

SystemPanic commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

SystemPanic commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Windows Support: DirectStorage backend + platform fixes

Summary

DirectStorage Backend

Binary file I/O (O_BINARY)

Library name resolution

Metadata validation

Uh oh!

takeshi-yoshimura commented May 3, 2026

Uh oh!

SystemPanic commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

SystemPanic commented Apr 30, 2026 •

edited

Loading

Binary file I/O (`O_BINARY`)