Description
On a DGX Spark / NVIDIA GB10 host, a NemoClaw/OpenShell sandbox created with GPU passthrough can see the NVIDIA GPU with nvidia-smi, but CUDA initialization fails when the sandbox is created with the default Landlock policy.
Expected: a GPU-enabled OpenShell sandbox should allow CUDA workloads to initialize successfully, or onboarding should fail with a clearer validation error than a passing nvidia-smi proof.
Actual: nvidia-smi succeeds, but cuInit(0) through the normal openshell sandbox exec path returns 304. Recreating the same sandbox image with the same GPU devices and policy minus the landlock block fixes CUDA initialization.
This makes nvidia-smi an insufficient GPU proof for CUDA workloads on this setup.
Reproduction Steps
-
Configure NVIDIA CDI on the Spark host:
sudo nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml
sudo systemctl restart docker
nvidia-ctk cdi list
-
Onboard a NemoClaw sandbox with direct sandbox GPU enabled:
nemoclaw onboard --name drclaw --gpu --sandbox-gpu --sandbox-gpu-device nvidia.com/gpu=all --non-interactive --yes --yes-i-accept-third-party-software
-
Verify nvidia-smi works via OpenShell:
openshell sandbox exec -n drclaw -- nvidia-smi
-
Test CUDA initialization via OpenShell:
openshell sandbox exec -n drclaw -- python3 -c 'import ctypes,sys; lib=ctypes.CDLL("libcuda.so.1"); rc=lib.cuInit(0); print("cuInit(0)=%s" % rc); sys.exit(0 if rc == 0 else 1)'
Result with default Landlock policy:
-
Test direct Docker exec into the same GPU-enabled sandbox container:
docker exec --user sandbox <openshell-drclaw-container> python3 -c 'import ctypes,sys; lib=ctypes.CDLL("libcuda.so.1"); rc=lib.cuInit(0); print("cuInit(0)=%s" % rc); sys.exit(0 if rc == 0 else 1)'
Result:
-
Create a temporary GPU sandbox from the same image and same policy, but omit the landlock: section. CUDA initialization through OpenShell succeeds:
Environment
- Hardware: DGX Spark / NVIDIA GB10
- OS: Linux aarch64
- NVIDIA driver:
580.126.09
- CUDA reported by
nvidia-smi: 13.0
- GPU:
NVIDIA GB10
- NemoClaw:
v0.0.41
- OpenShell:
0.0.39
- NemoClaw source revision installed locally:
5818cfa8962084717f281bfff5c08ae0435a30a7
- Docker GPU mode selected by onboarding:
--gpus all
- CDI devices present:
nvidia.com/gpu=0
nvidia.com/gpu=GPU-96e354d9-34ac-8927-0cbb-d761e87ba109
nvidia.com/gpu=all
Debug Output
The relevant split is:
# OpenShell exec with default Landlock policy
openshell sandbox exec -n drclaw -- python3 /sandbox/.openclaw/drclaw-cuda-probe.py
cuInit(0)=304
# Direct Docker exec into same container
docker exec --user sandbox <openshell-drclaw-container> python3 /sandbox/.openclaw/drclaw-cuda-probe.py
cuInit(0)=0
# OpenShell exec after recreating sandbox without the landlock block
openshell sandbox exec -n drclaw -- python3 /sandbox/.openclaw/drclaw-cuda-probe.py
cuInit(0)=0
nvidia-smi succeeds through OpenShell in both cases.
Logs
With the default policy, nvidia-smi succeeds but cuInit(0) fails with CUDA result 304.
I also observed that /proc/<pid>/task/<tid>/comm writes fail under the default OpenShell execution path:
sh: 1: cannot create /proc/<pid>/task/<pid>/comm: Permission denied
Allowing /proc read-write in the policy did not fix CUDA while Landlock remained enabled. Removing the landlock: section at sandbox creation time did fix CUDA. OpenShell rejects changing Landlock on a live sandbox, so this had to be tested by recreating a temporary sandbox.
Checklist
Description
On a DGX Spark / NVIDIA GB10 host, a NemoClaw/OpenShell sandbox created with GPU passthrough can see the NVIDIA GPU with
nvidia-smi, but CUDA initialization fails when the sandbox is created with the default Landlock policy.Expected: a GPU-enabled OpenShell sandbox should allow CUDA workloads to initialize successfully, or onboarding should fail with a clearer validation error than a passing
nvidia-smiproof.Actual:
nvidia-smisucceeds, butcuInit(0)through the normalopenshell sandbox execpath returns304. Recreating the same sandbox image with the same GPU devices and policy minus thelandlockblock fixes CUDA initialization.This makes
nvidia-smian insufficient GPU proof for CUDA workloads on this setup.Reproduction Steps
Configure NVIDIA CDI on the Spark host:
Onboard a NemoClaw sandbox with direct sandbox GPU enabled:
Verify
nvidia-smiworks via OpenShell:openshell sandbox exec -n drclaw -- nvidia-smiTest CUDA initialization via OpenShell:
Result with default Landlock policy:
Test direct Docker exec into the same GPU-enabled sandbox container:
Result:
Create a temporary GPU sandbox from the same image and same policy, but omit the
landlock:section. CUDA initialization through OpenShell succeeds:Environment
580.126.09nvidia-smi:13.0NVIDIA GB10v0.0.410.0.395818cfa8962084717f281bfff5c08ae0435a30a7--gpus allnvidia.com/gpu=0nvidia.com/gpu=GPU-96e354d9-34ac-8927-0cbb-d761e87ba109nvidia.com/gpu=allDebug Output
The relevant split is:
nvidia-smisucceeds through OpenShell in both cases.Logs
With the default policy,
nvidia-smisucceeds butcuInit(0)fails with CUDA result304.I also observed that
/proc/<pid>/task/<tid>/commwrites fail under the default OpenShell execution path:Allowing
/procread-write in the policy did not fix CUDA while Landlock remained enabled. Removing thelandlock:section at sandbox creation time did fix CUDA. OpenShell rejects changing Landlock on a live sandbox, so this had to be tested by recreating a temporary sandbox.Checklist