-
Notifications
You must be signed in to change notification settings - Fork 5
[skills,docs] docs: add KeepGPU repository workflow skill #65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
7 commits
Select commit
Hold shift + click to select a range
e5e4334
docs(skills): add keepgpu repository workflow skill
Wangmerlyn b95170d
docs(skills): refocus keepgpu skill on CLI usage
Wangmerlyn dfd8530
docs(skills): clarify source checkout install path
Wangmerlyn 22c79ff
docs(skills): prefer git URL install path
Wangmerlyn 65b8d7c
docs(skills): prefer tmux for remote sessions
Wangmerlyn e9eef73
docs(skills): drop response template section
Wangmerlyn a49a618
docs(skills): rename skill to gpu-keepalive-with-keepgpu
Wangmerlyn File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
22 changes: 22 additions & 0 deletions
22
docs/skills/gpu-keepalive-with-keepgpu/skillcheck-free-report.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,22 @@ | ||
| # SkillCheck Free Report: gpu-keepalive-with-keepgpu | ||
|
|
||
| Validation method: applied the SkillCheck Free rule set from `skill-check/SKILL.md` in <https://github.com/olgasafonova/skillcheck-free> after updating the skill to focus on KeepGPU CLI usage and installation instructions. | ||
|
|
||
| ## Summary | ||
|
|
||
| - Critical: 0 | ||
| - Warnings: 0 | ||
| - Suggestions: 0 | ||
|
|
||
| ## Strengths detected | ||
|
|
||
| - Description includes activation triggers (8.3) | ||
| - Description includes negative triggers (8.7) | ||
| - Skill includes example section (8.1) | ||
| - Skill documents error handling or limitations (8.2) | ||
| - Skill uses structured instructions (8.5) | ||
| - Skill documents prerequisites (8.6) | ||
|
|
||
| ## Checked files | ||
|
|
||
| - `skills/gpu-keepalive-with-keepgpu/SKILL.md` | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,156 @@ | ||
| --- | ||
| name: gpu-keepalive-with-keepgpu | ||
| description: Install and operate the KeepGPU CLI to keep reserved GPUs active during data prep, debugging, and orchestration downtime. Use when users ask for keep-gpu command construction, tuning (--vram, --interval, --busy-threshold), installation from this repository, or runtime troubleshooting of keep-gpu sessions; do not use for repository development, code refactoring, or unrelated Python tooling. | ||
| --- | ||
|
|
||
| # KeepGPU CLI Operator | ||
|
|
||
| Use this workflow to run `keep-gpu` safely and effectively. | ||
|
|
||
| ## Prerequisites | ||
|
|
||
| - Confirm at least one GPU is visible (`python -c "import torch; print(torch.cuda.device_count())"`). | ||
| - Run commands in a shell where CUDA/ROCm drivers are already available. | ||
| - Use `Ctrl+C` to stop KeepGPU and release memory cleanly. | ||
|
|
||
| ## Install KeepGPU | ||
|
|
||
| Install PyTorch first for your platform, then install KeepGPU. | ||
|
|
||
| ### Option A: Install from package index | ||
|
|
||
| ```bash | ||
| # CUDA example (change cu121 to your CUDA version) | ||
| pip install --index-url https://download.pytorch.org/whl/cu121 torch | ||
| pip install keep-gpu | ||
| ``` | ||
|
|
||
| ```bash | ||
| # ROCm example (change rocm6.1 to your ROCm version) | ||
| pip install --index-url https://download.pytorch.org/whl/rocm6.1 torch | ||
| pip install keep-gpu[rocm] | ||
| ``` | ||
|
|
||
| ### Option B: Install directly from Git URL (no local clone) | ||
|
|
||
| Prefer this option when users only need the CLI and do not need local source edits. This avoids checkout directory and cleanup overhead. | ||
|
|
||
| ```bash | ||
| pip install "git+https://github.com/Wangmerlyn/KeepGPU.git" | ||
| ``` | ||
|
|
||
| If SSH access is configured: | ||
|
|
||
| ```bash | ||
| pip install "git+ssh://git@github.com/Wangmerlyn/KeepGPU.git" | ||
| ``` | ||
|
|
||
| ROCm variant from Git URL: | ||
|
|
||
| ```bash | ||
| pip install "keep_gpu[rocm] @ git+https://github.com/Wangmerlyn/KeepGPU.git" | ||
| ``` | ||
|
|
||
| ### Option C: Install from a local source checkout (explicit path) | ||
|
|
||
| Use this option only when users already have a local checkout or plan to edit source. | ||
|
|
||
| ```bash | ||
| git clone https://github.com/Wangmerlyn/KeepGPU.git | ||
| cd KeepGPU | ||
| pip install -e . | ||
| ``` | ||
|
|
||
| If the checkout already exists somewhere else, install by absolute path: | ||
|
|
||
| ```bash | ||
| pip install -e /absolute/path/to/KeepGPU | ||
| ``` | ||
|
|
||
| For ROCm users from local checkout: | ||
|
|
||
| ```bash | ||
| pip install -e ".[rocm]" | ||
| ``` | ||
|
|
||
| Verify installation: | ||
|
|
||
| ```bash | ||
| keep-gpu --help | ||
| ``` | ||
|
|
||
| ## Command model | ||
|
|
||
| CLI options to tune: | ||
|
|
||
| - `--gpu-ids`: comma-separated IDs (`0`, `0,1`). If omitted, KeepGPU uses all visible GPUs. | ||
| - `--vram`: VRAM to hold per GPU (`512MB`, `1GiB`, or raw bytes). | ||
| - `--interval`: seconds between keep-alive cycles. | ||
| - `--busy-threshold` (`--util-threshold` alias): if utilization is above this percent, KeepGPU backs off. | ||
|
|
||
| Legacy compatibility: | ||
|
|
||
| - `--threshold` is deprecated but still accepted. | ||
| - Numeric `--threshold` maps to busy threshold. | ||
| - String `--threshold` maps to VRAM. | ||
|
|
||
| ## Agent workflow | ||
|
|
||
| 1. Collect workload intent: target GPU IDs, expected hold duration, and whether node is shared. | ||
| 2. Choose safe defaults when unspecified: `--vram 1GiB`, `--interval 60-120`, `--busy-threshold 25` for shared nodes. | ||
| 3. Build one concrete command. | ||
| 4. Provide stop instruction (`Ctrl+C`) and a verification step. | ||
| 5. If command fails, provide one minimal troubleshooting command at a time. | ||
|
|
||
| ## Command templates | ||
|
|
||
| Single GPU while preprocessing: | ||
|
|
||
| ```bash | ||
| keep-gpu --gpu-ids 0 --vram 1GiB --interval 60 --busy-threshold 25 | ||
| ``` | ||
|
|
||
| All visible GPUs with lighter load: | ||
|
|
||
| ```bash | ||
| keep-gpu --vram 512MB --interval 180 | ||
| ``` | ||
|
|
||
| Remote sessions (preferred: `tmux` for visibility and control): | ||
|
|
||
| ```bash | ||
| tmux new -s keepgpu | ||
| keep-gpu --gpu-ids 0 --vram 1GiB --interval 300 | ||
| # Detach with Ctrl+b then d; reattach with: tmux attach -t keepgpu | ||
| ``` | ||
|
|
||
| Fallback when `tmux` is unavailable: | ||
|
|
||
| ```bash | ||
| nohup keep-gpu --gpu-ids 0 --vram 1GiB --interval 300 > keepgpu.log 2>&1 & | ||
| echo $! > keepgpu.pid | ||
| # Monitor: tail -f keepgpu.log | ||
| # Stop: kill "$(cat keepgpu.pid)" | ||
| ``` | ||
|
|
||
| ## Troubleshooting | ||
|
|
||
| - Invalid `--gpu-ids`: ensure comma-separated integers only. | ||
| - Allocation failure / OOM: reduce `--vram` or free memory first. | ||
| - No utilization telemetry: ensure `nvidia-ml-py` works and `nvidia-smi` is available. | ||
| - No GPUs detected: verify drivers, CUDA/ROCm runtime, and `torch.cuda.device_count()`. | ||
|
|
||
| ## Example | ||
|
|
||
| User request: "Install KeepGPU from GitHub and keep GPU 0 alive while I preprocess." | ||
|
|
||
| Suggested response shape: | ||
|
|
||
| 1. Install: `pip install "git+https://github.com/Wangmerlyn/KeepGPU.git"` | ||
| 2. Run: `keep-gpu --gpu-ids 0 --vram 1GiB --interval 60 --busy-threshold 25` | ||
| 3. Verify: check CLI logs for keep loop activity; stop with `Ctrl+C` when done. | ||
|
|
||
| ## Limitations | ||
|
|
||
| - KeepGPU is not a scheduler; it only keeps already accessible GPUs active. | ||
| - KeepGPU behavior depends on cluster policy; some schedulers require higher VRAM or tighter intervals. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,3 @@ | ||
| display_name: KeepGPU CLI Operator | ||
| short_description: Install and operate keep-gpu CLI commands safely on shared GPU machines. | ||
| default_prompt: Use this skill to install KeepGPU and build run-ready keep-gpu commands with sensible defaults, verification steps, and troubleshooting. |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The "Strengths detected" section lists rule numbers (e.g., 8.3) without a brief explanation of what each rule signifies. To make this report more self-contained and immediately understandable for readers who may not have direct access to or familiarity with the
skill-check/SKILL.mddocument, consider adding a concise description for each strength. This would improve the clarity and overall usefulness of the report as a standalone document.