Skip to content

Slurm JobID cannot always be used as file name #42

@bhmevik

Description

@bhmevik

bart-logger uses job_id as part of the file name for the record files, and the slurm plugin uses the sacct JobID field as job_id. If a job array with a lot of array tasks still pending is cancelled, then sacct will report a single record for these pending tasks, with JobID the full list:

sacct -j 3460472 -o start,state,end,jobid%400
              Start      State                 End                                                                                                                                                                                                                                                                                                                                                                                                            JobID 
------------------- ---------- -------------------                                                                                                                                                                                                                                                                                                                                                            ----------------------------------------------------- 
2021-09-09T12:06:50 CANCELLED+ 2021-09-09T12:06:50   3460472_[1,11,21,31,41,51,61,71,81,91,101,111,121,131,141,151,161,171,181,191,201,211,221,231,241,251,261,271,281,291,301,311,321,331,341,351,361,371,381,391,401,411,421,431,441,451,461,471,481,491,501,511,521,531,541,551,561,571,581,591,601,611,621,631,641,651,661,671,681,691,701,711,721,731,741,751,761,771,781,791,801,811,821,831,841,851,861,871,881,891,901,911,921,931,941,951,961,971,981,991] 

The problem would have been avoided with #26, since start time and end time is set to the cancellation time when pending jobs are cancelled, leading to zero walltime.
But perhaps a more cleaner fix is to use sacct -o JobIDRaw,... instead of sacct -o JobID,.... We have tested this on one of our clusters, and I'll create a pull request for it:

sacct -j 3460472 -o start,state,end,jobidraw
              Start      State                 End     JobIDRaw 
------------------- ---------- ------------------- ------------ 
2021-09-09T12:06:50 CANCELLED+ 2021-09-09T12:06:50 3460472 

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions