Skip to content

[FEAT] Windows support through DirectX 12 DirectStorage interop#72

Open
SystemPanic wants to merge 3 commits intofoundation-model-stack:mainfrom
SystemPanic:fastsafetensors-windows
Open

[FEAT] Windows support through DirectX 12 DirectStorage interop#72
SystemPanic wants to merge 3 commits intofoundation-model-stack:mainfrom
SystemPanic:fastsafetensors-windows

Conversation

@SystemPanic
Copy link
Copy Markdown

@SystemPanic SystemPanic commented Apr 30, 2026

Windows Support: DirectStorage backend + platform fixes

Summary

This PR adds Windows support (#37) for fastsafetensors, including a DirectStorage-based NVMe→GPU loading path (the Windows equivalent of Linux GDS/cuFile) and fixes for several platform-specific issues that caused build failures, data corruption, and metadata validation errors on Windows.

Tested on vLLM with --load-format fastsafetensors, will be included in the new v0.20.0 release of vLLM for Windows.

DirectStorage Backend

directstorage

On Linux, GDS uses cuFile to DMA data directly from NVMe into GPU memory. Windows has no cuFile — instead, it offers DirectStorage, a DirectX 12 API designed for the same purpose.

Since DirectStorage writes into D3D12 resources (not CUDA buffers), we bridge the two APIs through CUDA external memory interop:

NVMe -> [DirectStorage] -> D3D12 shared buffer -> [cudaImportExternalMemory] -> CUDA device pointer

The key steps are:

  1. Create a D3D12 committed resource with D3D12_HEAP_FLAG_SHARED so it can be exported
  2. DirectStorage reads from NVMe into this D3D12 buffer via IDStorageQueue
  3. Export the D3D12 resource as an NT handle via CreateSharedHandle
  4. Import into CUDA via cudaImportExternalMemory + cudaExternalMemoryGetMappedBuffer to get a regular CUDA device pointer
  5. Synchronize using a D3D12 fence imported as a cudaExternalSemaphore

All DirectStorage, D3D12, and DXGI libraries are loaded at runtime via LoadLibrary/GetProcAddress — no link-time SDK dependency on DirectStorage is required.

Binary file I/O (O_BINARY)

On Windows, os.open() without os.O_BINARY opens files in text mode, which translates \r\n\n and treats 0x1A as EOF. This silently corrupted tensor data, causing models to produce garbage output.

Library name resolution

  • CUDA runtime DLL name (cudart64_12.dll, cudart64_13.dll, etc.) is now resolved at runtime from CUDA_HOME instead of being hardcoded at compile time. load_library_functions() accepts an optional cudart_lib_name parameter.

Metadata validation

Relaxed the strict start + header_length != size_bytes check to allow trailing padding bytes, which occur with sub-byte dtypes (FP4/NF4) used in quantized models.

Signed-off-by: Javier <25750030+SystemPanic@users.noreply.github.com>
Signed-off-by: Javier <25750030+SystemPanic@users.noreply.github.com>
@takeshi-yoshimura takeshi-yoshimura self-requested a review May 3, 2026 08:42
@takeshi-yoshimura
Copy link
Copy Markdown
Collaborator

@SystemPanic
This is super exciting, and thanks for working on Windows support!

I may be missing something, but it looks like the Python-side copier integration for dstorage_file_reader may be missing.

I see dstorage_file_reader implemented and exported through pybind as a Windows equivalent of gds_file_reader. However, I do not see where it is selected from the Python copier layer. gds.py still appears to use gds_file_reader, and nogds.py uses nogds_file_reader.

Was there supposed to be an additional copier implementation, e.g. copier/win_dstorage.py or copier/dstorage.py, that uses fstcpp.dstorage_file_handle and fstcpp.dstorage_file_reader?

If so, it may be missing from this PR. If not, I think DirectStorage should probably be introduced through such a separate copier backend, rather than being mixed into the existing gds/nogds path.

Signed-off-by: Javier <25750030+SystemPanic@users.noreply.github.com>
@SystemPanic
Copy link
Copy Markdown
Author

@takeshi-yoshimura Original commit was silently falling back to NoGDS path and not using DirectStorage at all. As model loading was working and speed was good, I thought DS was working correctly.

Original implementation (based on llama.cpp, it was a draft, never merged) was totally wrong, so I needed to review the DS documentation and start from scratch.

Spent the last two days on this, and make it work has been a headache.

Anyway, here it is. Fully working, 25-30% faster than NoGDS, which is already fast.

Implementation maxes out my WD SN850X, it's even faster than NVMe -> CPU memory with block size 64M and FILE_FLAG_NO_BUFFERING (no OS cache).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants