feat: Add mock servers for DGX H100, H200, and B200 systems#163
feat: Add mock servers for DGX H100, H200, and B200 systems#163fabiendupont wants to merge 6 commits intoNVIDIA:mainfrom
Conversation
There was a problem hiding this comment.
Pull Request Overview
This PR adds comprehensive mock implementations for three additional NVIDIA GPU architectures (H100, H200, and B200) to support multi-architecture testing without requiring physical hardware.
- Implements complete mock servers for DGX H100, H200, and B200 systems with 8-GPU configurations
- Provides architecture-specific MIG profiles with realistic memory allocations and compute capabilities
- Adds fabric management integration and enhanced topology features for newer GPU architectures
Reviewed Changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| pkg/nvml/mock/dgxh100/dgxh100.go | Main mock server implementation for H100 with Hopper architecture (80GB memory, 9.0 compute capability) |
| pkg/nvml/mock/dgxh100/mig-profile.go | H100-specific MIG profiles with enhanced memory allocations compared to A100 |
| pkg/nvml/mock/dgxh200/dgxh200.go | H200 mock server with enhanced memory (141GB) and improved fabric capabilities |
| pkg/nvml/mock/dgxh200/mig-profile.go | H200 MIG profiles with 76% more memory than H100 profiles |
| pkg/nvml/mock/dgxb200/dgxb200.go | B200 mock server with Blackwell architecture (192GB memory, 10.0 compute capability) |
| pkg/nvml/mock/dgxb200/mig-profile.go | B200 MIG profiles with massive memory allocations for next-generation AI workloads |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
e892f1c to
9252b27
Compare
ArangoGutierrez
left a comment
There was a problem hiding this comment.
First pass looks good, I'll run some manual tests and provide feedback
9252b27 to
697c1d4
Compare
| return device | ||
| } | ||
|
|
||
| func (d *Device) GetUUID() (string, nvml.Return) { |
There was a problem hiding this comment.
Why do we not set the mock functions as we do for the a100?
There was a problem hiding this comment.
Agree @fabiendupont let's do as https://github.com/NVIDIA/go-nvml/blob/main/pkg/nvml/mock/dgxa100/dgxa100.go#L195 and https://github.com/NVIDIA/go-nvml/blob/main/pkg/nvml/mock/dgxa100/dgxa100.go#L136 for the 3 B200,H100 and H200
| CudaDriverVersion int | ||
| } | ||
|
|
||
| type Device struct { |
There was a problem hiding this comment.
Question: Why do we need a new type. Are the properties not expected to be the same across different devices?
There was a problem hiding this comment.
I think @fabiendupont intend was to create a folder for each GPU type (given that current A100 implementation is a folder named a100) If we were to add more GPU types, then we should rename that folder to a generic name, and that way we don't need to duplicate this much code, simply a file per GPU type and MIG profile. reusing a core set of defined variables
There was a problem hiding this comment.
Pull Request Overview
Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
| MigMode: nvml.DEVICE_MIG_ENABLE, | ||
| GpuInstances: make(map[*GpuInstance]struct{}), | ||
| GpuInstanceCounter: 0, | ||
| MemoryInfo: nvml.Memory{Total: 151397597184}, // 141GB for H200 (vs 80GB for H100) |
There was a problem hiding this comment.
The memory calculation is incorrect. 151397597184 bytes equals approximately 141GB, but this should be calculated as 141 * 1024^3 = 151473709056 bytes for exact 141GB.
| MemoryInfo: nvml.Memory{Total: 151397597184}, // 141GB for H200 (vs 80GB for H100) | |
| MemoryInfo: nvml.Memory{Total: 141 * 1024 * 1024 * 1024}, // 141GB for H200 (vs 80GB for H100) |
| MigMode: nvml.DEVICE_MIG_ENABLE, | ||
| GpuInstances: make(map[*GpuInstance]struct{}), | ||
| GpuInstanceCounter: 0, | ||
| MemoryInfo: nvml.Memory{Total: 206158430208}, // 192GB for B200 (massive memory) |
There was a problem hiding this comment.
The memory calculation is incorrect. 206158430208 bytes equals approximately 192GB, but this should be calculated as 192 * 1024^3 = 206158430208 bytes. While the value is correct, it should be calculated consistently.
| MemoryInfo: nvml.Memory{Total: 206158430208}, // 192GB for B200 (massive memory) | |
| MemoryInfo: nvml.Memory{Total: 192 * 1024 * 1024 * 1024}, // 192GB for B200 (massive memory) |
| MigMode: nvml.DEVICE_MIG_ENABLE, | ||
| GpuInstances: make(map[*GpuInstance]struct{}), | ||
| GpuInstanceCounter: 0, | ||
| MemoryInfo: nvml.Memory{Total: 85899345920}, // 80GB for H100 |
There was a problem hiding this comment.
The memory calculation is incorrect. 85899345920 bytes equals approximately 80GB, but this should be calculated as 80 * 1024^3 = 85899345920 bytes for exact 80GB.
| MemoryInfo: nvml.Memory{Total: 85899345920}, // 80GB for H100 | |
| MemoryInfo: nvml.Memory{Total: 80 * 1024 * 1024 * 1024}, // 80GB for H100 |
697c1d4 to
51f3c75
Compare
|
Added mock tests, to have basic regression testsing. |
There was a problem hiding this comment.
Pull Request Overview
Copilot reviewed 16 out of 16 changed files in this pull request and generated 1 comment.
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
| }, | ||
| nvml.GPU_INSTANCE_PROFILE_1_SLICE_REV1: { | ||
| nvml.COMPUTE_INSTANCE_PROFILE_1_SLICE: { | ||
| Id: nvml.COMPUTE_INSTANCE_PROFILE_1_SLICE_REV1, |
There was a problem hiding this comment.
The compute instance profile ID should be nvml.COMPUTE_INSTANCE_PROFILE_1_SLICE not nvml.COMPUTE_INSTANCE_PROFILE_1_SLICE_REV1. The REV1 suffix applies to GPU instance profiles, not compute instance profiles.
| Id: nvml.COMPUTE_INSTANCE_PROFILE_1_SLICE_REV1, | |
| Id: nvml.COMPUTE_INSTANCE_PROFILE_1_SLICE, |
51f3c75 to
30e5dae
Compare
elezar
left a comment
There was a problem hiding this comment.
Some quick thoughts on this. I will make a more critical pass later this week.
| require.Implements(t, (*nvml.Interface)(nil), server) | ||
| require.Implements(t, (*nvml.ExtendedInterface)(nil), server) |
There was a problem hiding this comment.
We can test these at compile time with something like:
var _ nvml.Interface = (*server)(nil)
var _ nvml.ExtendedInterface = (*server)(nil)
|
|
||
| Add new configurations to the appropriate file in `shared/gpus/`: | ||
| ```go | ||
| var A100_PCIE_24GB = shared.Config{ |
There was a problem hiding this comment.
nit: Do we want a more go-like name?
|
|
||
| ``` | ||
| pkg/nvml/mock/ | ||
| ├── shared/ |
There was a problem hiding this comment.
Should shared be internal? Do we expect that users interact with this package?
There was a problem hiding this comment.
Moved it to internal. Users still have an external API through package-specific wrappers.
| PciDeviceId: config.PciDeviceId, | ||
| MIGProfiles: config.MIGProfiles, | ||
| } | ||
| device.SetMockFuncs() |
There was a problem hiding this comment.
Note: Since a device is only functional after setting these mocks, should we be exposing it as a public concrete type?
| type Server = shared.Server | ||
| type Device = shared.Device | ||
| type GpuInstance = shared.GpuInstance | ||
| type ComputeInstance = shared.ComputeInstance | ||
| type CudaComputeCapability = shared.CudaComputeCapability |
There was a problem hiding this comment.
Since this is a new package, we probably don't need backward compatible types.
| func New() *Server { | ||
| return shared.NewServerFromConfig(shared.ServerConfig{ | ||
| Config: gpus.H100_SXM5_80GB, | ||
| GPUCount: 8, |
There was a problem hiding this comment.
Question: Instead of specifying the count, why not pass in a list of devices? This will also make it simpler to handle heterogenous configs.
There was a problem hiding this comment.
Great point. Modified to passing a list and added test for heterogeneous GPUs.
elezar
left a comment
There was a problem hiding this comment.
Sorry for the delay here. I am revisiting this due to my wanting to add a specific mock for something that I'm working on.
Looking at the orginisation, I think the SPECIFIC GPU implementations are useful -- as is the scaffolding to create a server with a specific list of GPUs or a COUNG of a single GPU type.
In my case, I want to mock a Tegra-based system which includes an iGPU and a dGPU. Here I would imaging instantiating this as:
NewServer(WithGPUs(gpus.IGX_THOR, gpus.RTX_PRO_6000))
Let's assume that I don't care about the driver version and / or the CUDA version in this case.
WIth this in mind, I have some proposed changes under https://github.com/elezar/go-nvml/tree/feat/add-dgxh100-dgxh200-dgxb200-mocks
I also have a branch: https://github.com/elezar/go-nvml/tree/add-tegra-mocks that adds a tegra (Thor) device.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 16 out of 16 changed files in this pull request and generated 13 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
| } | ||
|
|
||
| gi.GetComputeInstancePossiblePlacementsFunc = func(info *nvml.ComputeInstanceProfileInfo) ([]nvml.ComputeInstancePlacement, nvml.Return) { | ||
| return gi.MIGProfiles.ComputeInstancePlacements[int(gi.Info.Id)][int(info.Id)], nvml.SUCCESS |
| d.MigMode = mode | ||
| return nvml.SUCCESS, nvml.SUCCESS | ||
| } | ||
|
|
||
| d.GetMigModeFunc = func() (int, int, nvml.Return) { | ||
| return d.MigMode, d.MigMode, nvml.SUCCESS |
| nvml.GPU_INSTANCE_PROFILE_1_SLICE_REV1: { | ||
| nvml.COMPUTE_INSTANCE_PROFILE_1_SLICE: { | ||
| Id: nvml.COMPUTE_INSTANCE_PROFILE_1_SLICE_REV1, | ||
| SliceCount: 1, | ||
| InstanceCount: 1, | ||
| MultiprocessorCount: 14, | ||
| }, |
| var a100_ComputeInstanceProfiles = map[int]map[int]nvml.ComputeInstanceProfileInfo{ | ||
| nvml.GPU_INSTANCE_PROFILE_1_SLICE: { | ||
| nvml.COMPUTE_INSTANCE_PROFILE_1_SLICE: { | ||
| Id: nvml.COMPUTE_INSTANCE_PROFILE_1_SLICE, | ||
| SliceCount: 1, | ||
| InstanceCount: 1, | ||
| MultiprocessorCount: 16, | ||
| }, | ||
| }, | ||
| nvml.GPU_INSTANCE_PROFILE_2_SLICE: { | ||
| nvml.COMPUTE_INSTANCE_PROFILE_1_SLICE: { | ||
| Id: nvml.COMPUTE_INSTANCE_PROFILE_1_SLICE, | ||
| SliceCount: 1, | ||
| InstanceCount: 2, | ||
| MultiprocessorCount: 16, | ||
| }, | ||
| nvml.COMPUTE_INSTANCE_PROFILE_2_SLICE: { | ||
| Id: nvml.COMPUTE_INSTANCE_PROFILE_2_SLICE, |
| var a100_GpuInstancePlacements = map[int][]nvml.GpuInstancePlacement{ | ||
| nvml.GPU_INSTANCE_PROFILE_1_SLICE: { | ||
| {Start: 0, Size: 1}, | ||
| {Start: 1, Size: 1}, | ||
| {Start: 2, Size: 1}, | ||
| {Start: 3, Size: 1}, | ||
| {Start: 4, Size: 1}, | ||
| {Start: 5, Size: 1}, | ||
| {Start: 6, Size: 1}, | ||
| }, | ||
| nvml.GPU_INSTANCE_PROFILE_2_SLICE: { | ||
| {Start: 0, Size: 2}, | ||
| {Start: 2, Size: 2}, | ||
| {Start: 4, Size: 2}, | ||
| }, | ||
| nvml.GPU_INSTANCE_PROFILE_3_SLICE: { | ||
| {Start: 0, Size: 3}, | ||
| {Start: 4, Size: 3}, | ||
| }, | ||
| nvml.GPU_INSTANCE_PROFILE_4_SLICE: { | ||
| {Start: 0, Size: 4}, | ||
| }, | ||
| nvml.GPU_INSTANCE_PROFILE_7_SLICE: { | ||
| {Start: 0, Size: 8}, // Test expects Size 8 | ||
| }, | ||
| } |
| }, | ||
| nvml.GPU_INSTANCE_PROFILE_2_SLICE: { | ||
| {Start: 0, Size: 2}, | ||
| {Start: 2, Size: 2}, | ||
| {Start: 4, Size: 2}, | ||
| }, | ||
| nvml.GPU_INSTANCE_PROFILE_3_SLICE: { | ||
| {Start: 0, Size: 3}, | ||
| {Start: 4, Size: 3}, | ||
| }, | ||
| nvml.GPU_INSTANCE_PROFILE_4_SLICE: { | ||
| {Start: 0, Size: 4}, | ||
| }, | ||
| nvml.GPU_INSTANCE_PROFILE_7_SLICE: { | ||
| {Start: 0, Size: 7}, | ||
| }, |
| ```go | ||
| import ( | ||
| "github.com/NVIDIA/go-nvml/pkg/nvml/mock/dgxa100" | ||
| "github.com/NVIDIA/go-nvml/pkg/nvml/mock/dgxh100" | ||
| "github.com/NVIDIA/go-nvml/pkg/nvml/mock/dgxh200" | ||
| "github.com/NVIDIA/go-nvml/pkg/nvml/mock/dgxb200" | ||
| "github.com/NVIDIA/go-nvml/pkg/nvml/mock/internal/shared/gpus" | ||
| ) | ||
|
|
|
|
||
| - All existing `dgxa100.New()`, `dgxh100.New()`, `dgxh200.New()`, `dgxb200.New()` calls continue to work unchanged | ||
| - Legacy global variables (`MIGProfiles`, `MIGPlacements`) are preserved for all generations | ||
| - Device names maintain "Mock" prefix for test compatibility |
| ├── shared/ | ||
| │ ├── shared.go # Core shared factory and types | ||
| │ └── gpus/ # GPU configuration definitions | ||
| │ ├── a100.go # A100 GPU variants (Ampere) | ||
| │ ├── a30.go # A30 GPU variants (Ampere) | ||
| │ ├── h100.go # H100 GPU variants (Hopper) | ||
| │ ├── h200.go # H200 GPU variants (Hopper) | ||
| │ └── b200.go # B200 GPU variants (Blackwell) |
| var a100_ComputeInstanceProfiles = map[int]map[int]nvml.ComputeInstanceProfileInfo{ | ||
| nvml.GPU_INSTANCE_PROFILE_1_SLICE: { | ||
| nvml.COMPUTE_INSTANCE_PROFILE_1_SLICE: { | ||
| Id: nvml.COMPUTE_INSTANCE_PROFILE_1_SLICE, | ||
| SliceCount: 1, | ||
| InstanceCount: 1, | ||
| MultiprocessorCount: 16, | ||
| }, | ||
| }, | ||
| nvml.GPU_INSTANCE_PROFILE_2_SLICE: { |
Establish baseline tests for the original A100 implementation before refactoring to shared architecture. This enables proper TDD approach where we can detect regressions during refactoring. Tests cover: - Server creation and basic properties - Device handling and indexing (8 devices) - Device properties (name, architecture, memory, etc.) - Device access by UUID and PCI bus ID - MIG mode operations - MIG profile configurations and access - GPU instance lifecycle and placements - Compute instance lifecycle - Init/shutdown behavior - Multiple device uniqueness - A100-specific characteristics All tests pass with the existing implementation at fd3e42f. Signed-off-by: Fabien Dupont <fdupont@redhat.com>
- Implement shared factory system in pkg/nvml/mock/shared/ to eliminate code duplication - Add comprehensive A30 GPU configurations with MIG profiles (56 SMs, 1/2/4-slice support) - Refactor dgxa100 to use shared factory while maintaining backward compatibility - Create modular GPU configurations in shared/gpus/ for A100 and A30 families - Add comprehensive documentation covering architecture and usage examples - Maintain thread safety and proper NVML return codes - Support all A100 variants (SXM4 40GB/80GB, PCIe 40GB/80GB) and A30 PCIe 24GB Signed-off-by: Fabien Dupont <fdupont@redhat.com>
Implements DGX H100 and H200 GPU mocks following the established shared factory pattern for consistency with existing A100/A30 implementations. - Add H100 SXM5 80GB configuration with complete MIG profile support - Add H200 SXM5 141GB configuration with complete MIG profile support - Implement dgxh100 and dgxh200 packages using shared factory pattern - Include all 7 MIG profiles (standard, REV1 media, REV2 double memory) - Maintain full backward compatibility with legacy globals and type aliases - Use NVIDIA-spec compliant memory allocations and SM distributions Signed-off-by: Fabien Dupont <fdupont@redhat.com>
Implements DGX B200 mock following the established shared factory pattern: - Add B200 SXM5 180GB GPU configuration with Blackwell architecture - Comprehensive MIG profiles matching NVIDIA specifications: * Memory allocations: 23GB, 45GB, 90GB, 180GB per NVIDIA MIG User Guide * REV1 (media extensions) and REV2 (expanded memory) profiles * Full P2P support in MIG mode (IsP2pSupported: 1) * 144 SMs total with 18 SMs per slice - Complete DGX B200 server implementation with 8 GPUs - Driver version 560.28.03, NVML 12.560.28.03, CUDA 12060 - Comprehensive test suite covering server, device, and MIG operations - Backward compatible legacy global variables (MIGProfiles, MIGPlacements) Memory values corrected from initial 192GB to 180GB based on official NVIDIA MIG User Guide specifications. Signed-off-by: Fabien Dupont <fdupont@redhat.com>
…ntation Updates the mock framework documentation to include all GPU generations: - Add H100, H200, and B200 to architecture diagram and file structure - Document all GPU specifications: * H100 SXM5 80GB (Hopper, 132 SMs, CUDA 9.0, P2P MIG support) * H200 SXM5 141GB (Hopper, 132 SMs, CUDA 9.0, P2P MIG support) * B200 SXM5 180GB (Blackwell, 144 SMs, CUDA 10.0, P2P MIG support) - Add complete server model documentation with driver versions - Expand MIG support section with detailed SM allocations and P2P capabilities - Update usage examples to include all four GPU generations - Add comprehensive testing instructions for all mock implementations - Update backward compatibility section to reflect all generations The documentation now accurately reflects the complete shared factory implementation with comprehensive coverage of Ampere, Hopper, and Blackwell GPU architectures. Signed-off-by: Fabien Dupont <fdupont@redhat.com>
This commit addresses feedback from PR review #3356107365 by @elezar, implementing architectural improvements and code quality enhancements. - Added compile-time interface assertions to all test files - `var _ nvml.Interface = (*shared.Server)(nil)` pattern - Catches interface compliance issues at compile time vs runtime - Added missing trailing newline to dgxa100_test.go - Removed unnecessary type aliases from new packages (dgxh100, dgxh200, dgxb200) - New packages now use shared.Server directly instead of local aliases - Moved pkg/nvml/mock/shared → pkg/nvml/mock/internal/shared - Hides implementation details from external users - Public API remains unchanged through package-specific wrappers - Reduces field duplication (Name, Brand, Architecture, etc.) - Clearer relationship between configuration and device - Fields now accessed via d.Config.Name, d.Config.Brand, etc. - Removed 5+ duplicate fields from Device struct - Changed `Devices [8]nvml.Device` → `Devices []nvml.Device` - Removes hardcoded 8-device limit - Enables flexible GPU counts and heterogeneous configurations - `const GBtoMB = 1024` for clear memory calculations - Improves readability: `24 * GBtoMB` vs `24576` - Implemented `NewServerWithGPUs(...Config)` with variadic parameters - Supports mixed GPU types in single server - Example: `NewServerWithGPUs(gpus.A100_SXM4_40GB, gpus.A100_SXM4_80GB)` - Added comprehensive test: TestHeterogeneousGPUs - Added deprecation comments to backward-compatible type aliases - Deprecated single-GPU functions in favor of flexible alternatives - Maintains backward compatibility while guiding users to better APIs - All existing tests pass (dgxa100, dgxh100, dgxh200, dgxb200) - Added TestHeterogeneousGPUs demonstrating mixed GPU configurations - No breaking changes to public API Signed-off-by: Fabien Dupont <fdupont@redhat.com>
dc6b844 to
ee9071d
Compare
This commit adds comprehensive mock implementations for three additional NVIDIA GPU architectures to support multi-architecture testing:
Each mock server provides:
Key Features:
This enables comprehensive testing across multiple GPU architectures without requiring physical hardware.