Skip to content

OpenShell GPU sandbox: CUDA cuInit fails under Landlock on Spark/GB10 despite nvidia-smi working #4016

@mcragun

Description

@mcragun

Description

On a DGX Spark / NVIDIA GB10 host, a NemoClaw/OpenShell sandbox created with GPU passthrough can see the NVIDIA GPU with nvidia-smi, but CUDA initialization fails when the sandbox is created with the default Landlock policy.

Expected: a GPU-enabled OpenShell sandbox should allow CUDA workloads to initialize successfully, or onboarding should fail with a clearer validation error than a passing nvidia-smi proof.

Actual: nvidia-smi succeeds, but cuInit(0) through the normal openshell sandbox exec path returns 304. Recreating the same sandbox image with the same GPU devices and policy minus the landlock block fixes CUDA initialization.

This makes nvidia-smi an insufficient GPU proof for CUDA workloads on this setup.

Reproduction Steps

  1. Configure NVIDIA CDI on the Spark host:

    sudo nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml
    sudo systemctl restart docker
    nvidia-ctk cdi list
  2. Onboard a NemoClaw sandbox with direct sandbox GPU enabled:

    nemoclaw onboard --name drclaw --gpu --sandbox-gpu --sandbox-gpu-device nvidia.com/gpu=all --non-interactive --yes --yes-i-accept-third-party-software
  3. Verify nvidia-smi works via OpenShell:

    openshell sandbox exec -n drclaw -- nvidia-smi
  4. Test CUDA initialization via OpenShell:

    openshell sandbox exec -n drclaw -- python3 -c 'import ctypes,sys; lib=ctypes.CDLL("libcuda.so.1"); rc=lib.cuInit(0); print("cuInit(0)=%s" % rc); sys.exit(0 if rc == 0 else 1)'

    Result with default Landlock policy:

    cuInit(0)=304
  5. Test direct Docker exec into the same GPU-enabled sandbox container:

    docker exec --user sandbox <openshell-drclaw-container> python3 -c 'import ctypes,sys; lib=ctypes.CDLL("libcuda.so.1"); rc=lib.cuInit(0); print("cuInit(0)=%s" % rc); sys.exit(0 if rc == 0 else 1)'

    Result:

    cuInit(0)=0
  6. Create a temporary GPU sandbox from the same image and same policy, but omit the landlock: section. CUDA initialization through OpenShell succeeds:

    cuInit(0)=0

Environment

  • Hardware: DGX Spark / NVIDIA GB10
  • OS: Linux aarch64
  • NVIDIA driver: 580.126.09
  • CUDA reported by nvidia-smi: 13.0
  • GPU: NVIDIA GB10
  • NemoClaw: v0.0.41
  • OpenShell: 0.0.39
  • NemoClaw source revision installed locally: 5818cfa8962084717f281bfff5c08ae0435a30a7
  • Docker GPU mode selected by onboarding: --gpus all
  • CDI devices present:
    • nvidia.com/gpu=0
    • nvidia.com/gpu=GPU-96e354d9-34ac-8927-0cbb-d761e87ba109
    • nvidia.com/gpu=all

Debug Output

The relevant split is:

# OpenShell exec with default Landlock policy
openshell sandbox exec -n drclaw -- python3 /sandbox/.openclaw/drclaw-cuda-probe.py
cuInit(0)=304

# Direct Docker exec into same container
docker exec --user sandbox <openshell-drclaw-container> python3 /sandbox/.openclaw/drclaw-cuda-probe.py
cuInit(0)=0

# OpenShell exec after recreating sandbox without the landlock block
openshell sandbox exec -n drclaw -- python3 /sandbox/.openclaw/drclaw-cuda-probe.py
cuInit(0)=0

nvidia-smi succeeds through OpenShell in both cases.

Logs

With the default policy, nvidia-smi succeeds but cuInit(0) fails with CUDA result 304.

I also observed that /proc/<pid>/task/<tid>/comm writes fail under the default OpenShell execution path:

sh: 1: cannot create /proc/<pid>/task/<pid>/comm: Permission denied

Allowing /proc read-write in the policy did not fix CUDA while Landlock remained enabled. Removing the landlock: section at sandbox creation time did fix CUDA. OpenShell rejects changing Landlock on a live sandbox, so this had to be tested by recreating a temporary sandbox.

Checklist

  • I confirmed this bug is reproducible
  • I searched existing issues and this is not a duplicate

Metadata

Metadata

Assignees

No one assigned

    Labels

    area: sandboxOpenShell sandbox lifecycle, runtime, config, or recoveryplatform: dgx-sparkAffects DGX Spark hardware or workflows

    Type

    No fields configured for Bug.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions