Skip to content

Work in progress: Support PBS scheduler#50

Draft
ignatiusm wants to merge 11 commits intoSvenMarcus:mainfrom
Garvan-Data-Science-Platform:add-pbs-controller
Draft

Work in progress: Support PBS scheduler#50
ignatiusm wants to merge 11 commits intoSvenMarcus:mainfrom
Garvan-Data-Science-Platform:add-pbs-controller

Conversation

@ignatiusm
Copy link
Copy Markdown

Related to issue #37.

Thanks for developing HPC Rocket! I'd like to help with adding features (specifically around adding PBS support). Before I go too far, I thought I'd submit this draft Pull Request to start a discussion.

This PR so far includes:

  • Slight refactoring of config to generalise from sbatch to job
  • Refactoring of tests to work with the new config arrangement
  • Adding config to allow specifying a scheduler
  • Adding pbscontroller and pbsbatchjob
  • Tweaks to UI to partially handle different logs from PBS
  • Some partial edits to docs with new config settings

As noted in the title, this is very much 'work in progress'. Please excuse the partial implementation (particularly for config parsing, and conditions in hpcrocket/core/application.py). I wanted to get something started and then create the PR so we had a place to discuss things.

Please let me know if you are interested in contributions from the other side of the world, and if so, feel free to make any suggestions for how best to proceed. :)

@SvenMarcus
Copy link
Copy Markdown
Owner

Thanks a lot for the contribution!
Work towards supporting PBS is very welcome as I don't have access to a cluster using it and haven't the time to set up a test environment myself.

I'll review your PR within the next week and will get back to you as soon as I can!

@ignatiusm
Copy link
Copy Markdown
Author

Yay! I'm pleased this PR is welcome. I realise that I need to get the tests passing again, so I'll do some more work on that.

Also, I know a good source for dockerised PBS set up. Let me know if you'd like the details (I guess that would require #33 to be resolved first).

@SvenMarcus
Copy link
Copy Markdown
Owner

Yay! I'm pleased this PR is welcome. I realise that I need to get the tests passing again, so I'll do some more work on that.

I have a pretty good idea of what needs to be done for that, but it'll require some work to build that flexibility and still satisfy mypy. I think I can work on getting the existing tests working today, but we still need some for the PbsController and PbsBatchJob. Could you implement some test cases for those classes? We don't need a real PBS system for that, if you have a look in the test folder, there is a slurmoutput sub directory containing examples of common slurm output. It would be great if you could provide something similar for PBS.

Also, I know a good source for dockerised PBS set up. Let me know if you'd like the details (I guess that would require #33 to be resolved first).

That would be amazing! I've had some trouble with this guide, but maybe that's because I'm working with an Apple Silicon processor.
I think we can also get around implementing the local executor if we simply enable the SSH daemon in the container.

@SvenMarcus
Copy link
Copy Markdown
Owner

SvenMarcus commented Jun 12, 2024

I extended your PR to make the tests pass. Please rebase onto this branch before making more changes: https://github.com/SvenMarcus/hpc-rocket/tree/Garvan-Data-Science-Platform-add-pbs-controller

There are lots of changed files, but the changes mostly deal with introducing generalized Scheduler, BatchJob, JobStatus and TaskStatus classes and protocols (now found in the new core/schedulers package).
After refactoring a bit, PbsBatchJob and SlurmBatchJob turned out to be exactly the same, so I deleted those and we can simply use the generalized BatchJob class now.

I have also slightly changed your proposed format of the config file by renaming file to script. It is now also a yaml object instead of an array.

job:
    script: jobfile.sh
    scheduler: slurm

That's because hpc-rocket already supported this (undocumented) format to avoid adding an extra copy instruction for the slurm script:

job:
   script: remote_jobfile.sh
   from: local_jobfile_path.sh

This is useful for passing in CI secrets
Comment thread hpcrocket/cli/_yaml.py Outdated
@ignatiusm
Copy link
Copy Markdown
Author

Hi @SvenMarcus ,

Also, I know a good source for dockerised PBS set up. Let me know if you'd like the details (I guess that would require #33 to be resolved first).

That would be amazing! I've had some trouble with this guide, but maybe that's because I'm working with an Apple Silicon processor. I think we can also get around implementing the local executor if we simply enable the SSH daemon in the container.

I have borrowed the set up from the dask-jobqueue repo ci test set up. They have PBS, SLURM and SGE configurations. They are a bit more complex than is needed for here, but I've typically just followed the steps in the jobqueue_before_install function in pbs.sh. I haven't attempted to simplify this more yet.

@SvenMarcus
Copy link
Copy Markdown
Owner

I have borrowed the set up from the dask-jobqueue repo ci test set up. They have PBS, SLURM and SGE configurations. They are a bit more complex than is needed for here, but I've typically just followed the steps in the jobqueue_before_install function in pbs.sh. I haven't attempted to simplify this more yet.

Thanks! I'll have a look at it as soon as I find some time. That might take a while, though.
But I think we can still move ahead with the PR if we add some tests with dummy PBS output.

@ignatiusm
Copy link
Copy Markdown
Author

I have borrowed the set up from the dask-jobqueue repo ci test set up. They have PBS, SLURM and SGE configurations. They are a bit more complex than is needed for here, but I've typically just followed the steps in the jobqueue_before_install function in pbs.sh. I haven't attempted to simplify this more yet.

Thanks! I'll have a look at it as soon as I find some time. That might take a while, though. But I think we can still move ahead with the PR if we add some tests with dummy PBS output.

Apologies for the super slow progress with this. I've been caught up in other things too alas. I've pushed some dummy pbs output. I'll work on tests early next week.

@SvenMarcus
Copy link
Copy Markdown
Owner

Apologies for the super slow progress with this. I've been caught up in other things too alas. I've pushed some dummy pbs output. I'll work on tests early next week.

No need to rush, take all the time you need! I really appreciate that you're taking the time to contribute to hpc-rocket! 🙂

Let me know if you run into any trouble getting things working!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants