How to use with SLURM

Any guidance for using with SLURM? Certain actors are failing

When I run 

`srun -p compsci-gpu --gres=gpu:4 --cpus-per-gpu=5 --mem=24G --pty bash`

Followed by:

`python main.py --env BreakoutNoFrameskip-v4 --case atari --opr train --amp_type torch_amp --num_gpus 1 --num_cpus 10 --cpu_actor 1 --gpu_actor 1 --force`

I get the following warning:

`WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 135095644160 bytes available. This may slow down performance! You may be able to free up space by deleting files in /dev/shm or terminating any running plasma_store_server processes. If you are inside a Docker container, you may need to pass an argument with the flag '--shm-size' to 'docker run'.`

Followed by the task failing:

`2022-12-22 10:38:02,577 WARNING worker.py:1072 -- The node with node id 67f743d808b7bd16d45063d18dadf1b5cbb39e7d has been marked dead because the detector has missed too many heartbeats from it.`

`E1222 10:38:02.612172  8087  8433 task_manager.cc:323] Task failed: IOError: 14: Socket closed: Type=ACTOR_TASK, Language=PYTHON, Resources: {CPU: 1, }, function_descriptor={type=PythonFunctionDescriptor, module_name=core.reanalyze_worker, class_name=BatchWorker_CPU, function_name=run, function_hash=}, task_id=d251967856448ceb88866c7d01000000, task_name=BatchWorker_CPU.run(), job_id=01000000, num_args=0, num_returns=2, actor_task_spec={actor_id=88866c7d01000000, actor_caller_id=ffffffffffffffffffffffff01000000, actor_counter=0}`

I am not sure how to parse the error, any advice? What #SBATCH headings do you recommend using in the provided`train.sh`? Thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to use with SLURM #41

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

How to use with SLURM #41

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions