Hi, I'm trying to run a simple "pytorch tensor add" on GPU under nsjail on a GCP nvidia-tesla-t4 node and i'm getting the following error.
nsjail_pytorch.cfg
mount {
src: "/home/current_user_ldap/pytorch_env"
dst: "/home/current_user_ldap/pytorch_env"
is_bind: true
}
mount {
src: "/dev/nvidia0"
dst: "/dev/nvidia0"
is_bind: true
rw: true
}
mount {
src: "/dev/nvidiactl"
dst: "/dev/nvidiactl"
is_bind: true
rw: true
}
mount {
src: "/dev/nvidia-uvm"
dst: "/dev/nvidia-uvm"
is_bind: true
rw: true
}
mount {
src: "/usr"
dst: "/usr"
is_bind: true
rw: true
}
# for libs
mount {
src: "/lib64"
dst: "/lib64"
is_bind: true
}
mount {
src: "/lib"
dst: "/lib"
is_bind: true
rw: true
}
cwd: "/home/current_user_ldap/pytorch_env/"
Running simple PyTorch Tensor Add on CPU works.
nsjail -Mo --chroot / --rlimit_nproc 6553 --rlimit_fsize inf --rlimit_as inf -- /usr/bin/python3 -c "import torch; a = torch.tensor([1.0, 2.0], device='cpu') + torch.tensor([3.0, 4.0], device='cpu'); print(a)"
This prints the expected tensor output of [4, 6]
Running simple PyTorch Tensor Add on GPU fails
nsjail -Mo --config nsjail_pytorch.cfg --chroot / --rlimit_nproc 6553 --rlimit_fsize inf --rlimit_as inf -- /usr/bin/python3 -c "import torch; print(torch.cuda.is_available());"
[I][2024-08-10T02:03:04+0000] Mode: STANDALONE_ONCE
[I][2024-08-10T02:03:04+0000] Jail parameters: hostname:'NSJAIL', chroot:'/', process:'/usr/bin/python3', bind:[::]:0, max_conns:0, max_conns_per_ip:0, time_limit:600, personality:0, daemonize:false, clone_newnet:true, clone_newuser:true, clone_newns:true, clone_newpid:true, clone_newipc:true, clone_newuts:true, clone_newcgroup:true, clone_newtime:false, keep_caps:false, disable_no_new_privs:false, max_cpus:0
[I][2024-08-10T02:03:04+0000] Mount: '/' -> '/' flags:MS_RDONLY|MS_BIND|MS_REC|MS_PRIVATE type:'' options:'' dir:true
[I][2024-08-10T02:03:04+0000] Mount: '/home/current_user_ldap/pytorch_env' -> '/home/current_user_ldap/pytorch_env' flags:MS_RDONLY|MS_BIND|MS_REC|MS_PRIVATE type:'' options:'' dir:true
[I][2024-08-10T02:03:04+0000] Mount: '/dev/nvidia0' -> '/dev/nvidia0' flags:MS_BIND|MS_REC|MS_PRIVATE type:'' options:'' dir:false
[I][2024-08-10T02:03:04+0000] Mount: '/dev/nvidiactl' -> '/dev/nvidiactl' flags:MS_BIND|MS_REC|MS_PRIVATE type:'' options:'' dir:false
[I][2024-08-10T02:03:04+0000] Mount: '/dev/nvidia-uvm' -> '/dev/nvidia-uvm' flags:MS_BIND|MS_REC|MS_PRIVATE type:'' options:'' dir:false
[I][2024-08-10T02:03:04+0000] Mount: '/usr' -> '/usr' flags:MS_BIND|MS_REC|MS_PRIVATE type:'' options:'' dir:true
[I][2024-08-10T02:03:04+0000] Mount: '/lib64' -> '/lib64' flags:MS_RDONLY|MS_BIND|MS_REC|MS_PRIVATE type:'' options:'' dir:true
[I][2024-08-10T02:03:04+0000] Mount: '/lib' -> '/lib' flags:MS_BIND|MS_REC|MS_PRIVATE type:'' options:'' dir:true
[I][2024-08-10T02:03:04+0000] Uid map: inside_uid:1002 outside_uid:1002 count:1 newuidmap:false
[I][2024-08-10T02:03:04+0000] Gid map: inside_gid:1003 outside_gid:1003 count:1 newgidmap:false
[I][2024-08-10T02:03:06+0000] Executing '/usr/bin/python3' for '[STANDALONE MODE]'
/home/current_user_ldap/.local/lib/python3.9/site-packages/torch/cuda/__init__.py:128: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 304: OS call failed or operation not supported on this OS (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.)
return torch._C._cuda_getDeviceCount() > 0
False
[I][2024-08-10T02:03:08+0000] pid=28434 ([STANDALONE MODE]) exited with status: 0, (PIDs left: 0)
NVIDIA-SMI runs fine under nsjail
nsjail -Mo --config nsjail_pytorch.cfg --chroot / --rlimit_nproc 6553 --rlimit_as inf -- /bin/nvidia-smi
The above prints, the actual nvidia-smi output successfully.
Notes
- PyTorch works fine under nsjail (No issues)
- nvidia-smi works under nsjail
- Running PyTorch without nsjail on GPU succeeds.
This doesn't look like pytorch or the host issue provided pytorch works on GPU without nsjail. Any help appreciated.
Hi, I'm trying to run a simple "pytorch tensor add" on GPU under nsjail on a GCP
nvidia-tesla-t4node and i'm getting the following error.nsjail_pytorch.cfg
Running simple PyTorch Tensor Add on CPU works.
This prints the expected tensor output of [4, 6]
Running simple PyTorch Tensor Add on GPU fails
NVIDIA-SMI runs fine under nsjail
The above prints, the actual nvidia-smi output successfully.
Notes
This doesn't look like pytorch or the host issue provided pytorch works on GPU without nsjail. Any help appreciated.