Add configuration file for Graham (#169)#172
Open
bouthilx wants to merge 36 commits intoSMART-Lab:masterfrom
Open
Add configuration file for Graham (#169)#172bouthilx wants to merge 36 commits intoSMART-Lab:masterfrom
bouthilx wants to merge 36 commits intoSMART-Lab:masterfrom
Conversation
…andle slurm clusters
Why: For each option, add_sbatch_option would add the option in both the form --[OPTION_NAME] and [OPTION_NAME].
It will need many conversions, not only on resources, so better make it clean.
Slurm has no queues, so PBS option -q is invalid and non-convertible.
$PBS_JOBID was used to set the stdout/err of the job as well as in the commands. Replace them with $SLURM_JOB_ID. Also, workers were accessing os.environ[PBS_JOBID] so we added a second fetch on SLURM_JOB_ID in case os.environ[PBS_JOBID] gave undefined.
Slurm cannot be passed environment variables defined locally on command-line like PBS_FILENAME is. To bypass this, we add a definition in the prolog, making PBS_FILENAME available to all commands and epilog. NOTE: We leave PBS_FILENAME definition in command-line too such that any user using $PBS_FILENAME inside a custom pbsFlag can still do so.
PBS options -V is not converted properly to SBATCH --export ALL. We remove it and replace it with --export=ALL is the sbatch options.
Slurm does not have a equivalent environment variable set like PBS_WALLTIME. To avoid confusion, all variables PBS_WALLTIME are renamed to SBATCH_TIMELIMIT (the environment variable one would use to set --time with sbatch). As SBATCH_TIMELIMIT is not set automatically, we add it to the prolog to make it available to all commands and epilog. NOTE: PBS_WALLTIME is set in seconds, but we only have HH:MM:SS-like strings at the time of building the PBS file. We needed to add a walltime_to_seconds helper function to convert HH:MM:SS like strings into seconds, so that SBATCH_TIMELIMIT is set with seconds like PBS_WALLTIME.
It is possible to query the system to see if some commands are available
using distutils.spawn.find_executable(command_name).
Clusters where more than one launcher are available will still get
launchers selected based on string matching. For instance,
get_launcher("helios") would always return msub no matter what is
available on the system.
It is difficult to debug resuming while important process are taking place in the pbs script automatically built by SmartDispatch. We add verbose to smart-dispatch script and add debugging prints in epilog.
JobGenerators are selected by job_generator_factory based on the cluster's name. We use a more flexible, duck typing approach for Slurm clusters. If cluster name is not known, or not any of the if-case clauses in the factory, then we look at which launchers are available in the system. If it is sbatch, then a SlurmJobGenerator is built, a JobGenerator otherwise.
The command `sacctmgr` fails on some computers (mila01 namely), but the current behavior gives the impression sbatch is simply not available. Printing the stderr makes it more obvious that sbatch should be available, but something is broken behind sacctmgr. It only appears when using -vv options nevertheless.
Adding a script to do automatic verifications to assert validity of the current code. The verifications are not automatic unit-tests, they need automatically checks that the process executed successfully, but the administrator still needs to verify manually, reading the logs, that the requested resources were provided. Verifications can easily be combined, building on top of each others, from complex ones to simpler ones. Here is a list of all the verification currently implemented for slurm clusters: 1. very_simple_task (1 CPU) 2. verify_simple_task_with_one_gpu (1 CPU 1 GPU) 3. verify_simple_task_with_many_gpus (1 CPU X GPU) 4. verify_many_task (X CPU) 5. verify_many_task_with_many_cores (XY CPU) 6. verify_many_task_with_one_gpu (X CPU X GPU) 7. verify_many_task_with_many_gpus (X CPU Y GPU) 8. verify_simple_task_with_autoresume_unneeded (1 CPU) 9. verify_simple_task_with_autoresume_needed (1 CPU) 10. verify_many_task_with_autoresume_needed (X CPU)
My initial though was that get_launcher should raise an error when no launcher is found on the system since there cannot be any job launcher. I realized that this would break the --doNotLaunch option that users may want to use on system with no launcher, just to create the files.
The tests were failing because the account was not specified.
The tests were failing because the account was not specified
Collaborator
|
Does this completly fix #169 ? |
Collaborator
Author
There was a missing parentheses which was causing a bad conversion of "DD:HH:MM:SS" to seconds. The unit-test was also missing the same parentheses. I added a unit-test to make sure such error could not occur again.
090309b to
cac2f08
Compare
mgermain
reviewed
Oct 20, 2017
| ### SLURM clusters | ||
|
|
||
| Smartdispatch can also run on SLURM clusters. | ||
| All features like `--gpusPerNode` or `--coresPerNode` are supported. |
Member
There was a problem hiding this comment.
All features including the ones like ?
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Wait for #166 before merging
Fix issue #169