[FEAT] Windows support through DirectX 12 DirectStorage interop#72
[FEAT] Windows support through DirectX 12 DirectStorage interop#72SystemPanic wants to merge 3 commits intofoundation-model-stack:mainfrom
Conversation
Signed-off-by: Javier <25750030+SystemPanic@users.noreply.github.com>
Signed-off-by: Javier <25750030+SystemPanic@users.noreply.github.com>
|
@SystemPanic I may be missing something, but it looks like the Python-side copier integration for dstorage_file_reader may be missing. I see dstorage_file_reader implemented and exported through pybind as a Windows equivalent of gds_file_reader. However, I do not see where it is selected from the Python copier layer. gds.py still appears to use gds_file_reader, and nogds.py uses nogds_file_reader. Was there supposed to be an additional copier implementation, e.g. copier/win_dstorage.py or copier/dstorage.py, that uses fstcpp.dstorage_file_handle and fstcpp.dstorage_file_reader? If so, it may be missing from this PR. If not, I think DirectStorage should probably be introduced through such a separate copier backend, rather than being mixed into the existing gds/nogds path. |
Signed-off-by: Javier <25750030+SystemPanic@users.noreply.github.com>
|
@takeshi-yoshimura Original commit was silently falling back to NoGDS path and not using DirectStorage at all. As model loading was working and speed was good, I thought DS was working correctly. Original implementation (based on llama.cpp, it was a draft, never merged) was totally wrong, so I needed to review the DS documentation and start from scratch. Spent the last two days on this, and make it work has been a headache. Anyway, here it is. Fully working, 25-30% faster than NoGDS, which is already fast. Implementation maxes out my WD SN850X, it's even faster than NVMe -> CPU memory with block size 64M and FILE_FLAG_NO_BUFFERING (no OS cache). |
Windows Support: DirectStorage backend + platform fixes
Summary
This PR adds Windows support (#37) for fastsafetensors, including a DirectStorage-based NVMe→GPU loading path (the Windows equivalent of Linux GDS/cuFile) and fixes for several platform-specific issues that caused build failures, data corruption, and metadata validation errors on Windows.
Tested on vLLM with
--load-format fastsafetensors, will be included in the new v0.20.0 release of vLLM for Windows.DirectStorage Backend
On Linux, GDS uses cuFile to DMA data directly from NVMe into GPU memory. Windows has no cuFile — instead, it offers DirectStorage, a DirectX 12 API designed for the same purpose.
Since DirectStorage writes into D3D12 resources (not CUDA buffers), we bridge the two APIs through CUDA external memory interop:
The key steps are:
D3D12_HEAP_FLAG_SHAREDso it can be exportedIDStorageQueueCreateSharedHandlecudaImportExternalMemory+cudaExternalMemoryGetMappedBufferto get a regular CUDA device pointercudaExternalSemaphoreAll DirectStorage, D3D12, and DXGI libraries are loaded at runtime via
LoadLibrary/GetProcAddress— no link-time SDK dependency on DirectStorage is required.Binary file I/O (
O_BINARY)On Windows,
os.open()withoutos.O_BINARYopens files in text mode, which translates\r\n→\nand treats0x1Aas EOF. This silently corrupted tensor data, causing models to produce garbage output.Library name resolution
cudart64_12.dll,cudart64_13.dll, etc.) is now resolved at runtime fromCUDA_HOMEinstead of being hardcoded at compile time.load_library_functions()accepts an optionalcudart_lib_nameparameter.Metadata validation
Relaxed the strict
start + header_length != size_bytescheck to allow trailing padding bytes, which occur with sub-byte dtypes (FP4/NF4) used in quantized models.