diff --git a/README.md b/README.md index 3dcceb7..2dddc3e 100644 --- a/README.md +++ b/README.md @@ -5,6 +5,7 @@ [![DOI](https://zenodo.org/badge/987167271.svg)](https://doi.org/10.5281/zenodo.17129114) [![Ask DeepWiki](https://deepwiki.com/badge.svg)](https://deepwiki.com/Wangmerlyn/KeepGPU) [![CodeRabbit Pull Request Reviews](https://img.shields.io/coderabbit/prs/github/Wangmerlyn/KeepGPU)](https://coderabbit.ai/dashboard/gh/Wangmerlyn/KeepGPU) +[![SkillCheck Passed](https://raw.githubusercontent.com/olgasafonova/skillcheck-free/main/skill-check/passed.svg)](https://github.com/olgasafonova/skillcheck-free) **Keep GPU** keeps shared GPUs from being reclaimed while you prep data, debug, or coordinate multi-stage pipelines. It allocates just enough VRAM and issues lightweight CUDA work so schedulers observe an “active” device—without running a full training job. diff --git a/docs/skills/gpu-keepalive-with-keepgpu/skillcheck-free-report.md b/docs/skills/gpu-keepalive-with-keepgpu/skillcheck-free-report.md new file mode 100644 index 0000000..d17cc13 --- /dev/null +++ b/docs/skills/gpu-keepalive-with-keepgpu/skillcheck-free-report.md @@ -0,0 +1,22 @@ +# SkillCheck Free Report: gpu-keepalive-with-keepgpu + +Validation method: applied the SkillCheck Free rule set from `skill-check/SKILL.md` in after updating the skill to focus on KeepGPU CLI usage and installation instructions. + +## Summary + +- Critical: 0 +- Warnings: 0 +- Suggestions: 0 + +## Strengths detected + +- Description includes activation triggers (8.3) +- Description includes negative triggers (8.7) +- Skill includes example section (8.1) +- Skill documents error handling or limitations (8.2) +- Skill uses structured instructions (8.5) +- Skill documents prerequisites (8.6) + +## Checked files + +- `skills/gpu-keepalive-with-keepgpu/SKILL.md` diff --git a/skills/gpu-keepalive-with-keepgpu/SKILL.md b/skills/gpu-keepalive-with-keepgpu/SKILL.md new file mode 100644 index 0000000..cbd9544 --- /dev/null +++ b/skills/gpu-keepalive-with-keepgpu/SKILL.md @@ -0,0 +1,156 @@ +--- +name: gpu-keepalive-with-keepgpu +description: Install and operate the KeepGPU CLI to keep reserved GPUs active during data prep, debugging, and orchestration downtime. Use when users ask for keep-gpu command construction, tuning (--vram, --interval, --busy-threshold), installation from this repository, or runtime troubleshooting of keep-gpu sessions; do not use for repository development, code refactoring, or unrelated Python tooling. +--- + +# KeepGPU CLI Operator + +Use this workflow to run `keep-gpu` safely and effectively. + +## Prerequisites + +- Confirm at least one GPU is visible (`python -c "import torch; print(torch.cuda.device_count())"`). +- Run commands in a shell where CUDA/ROCm drivers are already available. +- Use `Ctrl+C` to stop KeepGPU and release memory cleanly. + +## Install KeepGPU + +Install PyTorch first for your platform, then install KeepGPU. + +### Option A: Install from package index + +```bash +# CUDA example (change cu121 to your CUDA version) +pip install --index-url https://download.pytorch.org/whl/cu121 torch +pip install keep-gpu +``` + +```bash +# ROCm example (change rocm6.1 to your ROCm version) +pip install --index-url https://download.pytorch.org/whl/rocm6.1 torch +pip install keep-gpu[rocm] +``` + +### Option B: Install directly from Git URL (no local clone) + +Prefer this option when users only need the CLI and do not need local source edits. This avoids checkout directory and cleanup overhead. + +```bash +pip install "git+https://github.com/Wangmerlyn/KeepGPU.git" +``` + +If SSH access is configured: + +```bash +pip install "git+ssh://git@github.com/Wangmerlyn/KeepGPU.git" +``` + +ROCm variant from Git URL: + +```bash +pip install "keep_gpu[rocm] @ git+https://github.com/Wangmerlyn/KeepGPU.git" +``` + +### Option C: Install from a local source checkout (explicit path) + +Use this option only when users already have a local checkout or plan to edit source. + +```bash +git clone https://github.com/Wangmerlyn/KeepGPU.git +cd KeepGPU +pip install -e . +``` + +If the checkout already exists somewhere else, install by absolute path: + +```bash +pip install -e /absolute/path/to/KeepGPU +``` + +For ROCm users from local checkout: + +```bash +pip install -e ".[rocm]" +``` + +Verify installation: + +```bash +keep-gpu --help +``` + +## Command model + +CLI options to tune: + +- `--gpu-ids`: comma-separated IDs (`0`, `0,1`). If omitted, KeepGPU uses all visible GPUs. +- `--vram`: VRAM to hold per GPU (`512MB`, `1GiB`, or raw bytes). +- `--interval`: seconds between keep-alive cycles. +- `--busy-threshold` (`--util-threshold` alias): if utilization is above this percent, KeepGPU backs off. + +Legacy compatibility: + +- `--threshold` is deprecated but still accepted. +- Numeric `--threshold` maps to busy threshold. +- String `--threshold` maps to VRAM. + +## Agent workflow + +1. Collect workload intent: target GPU IDs, expected hold duration, and whether node is shared. +2. Choose safe defaults when unspecified: `--vram 1GiB`, `--interval 60-120`, `--busy-threshold 25` for shared nodes. +3. Build one concrete command. +4. Provide stop instruction (`Ctrl+C`) and a verification step. +5. If command fails, provide one minimal troubleshooting command at a time. + +## Command templates + +Single GPU while preprocessing: + +```bash +keep-gpu --gpu-ids 0 --vram 1GiB --interval 60 --busy-threshold 25 +``` + +All visible GPUs with lighter load: + +```bash +keep-gpu --vram 512MB --interval 180 +``` + +Remote sessions (preferred: `tmux` for visibility and control): + +```bash +tmux new -s keepgpu +keep-gpu --gpu-ids 0 --vram 1GiB --interval 300 +# Detach with Ctrl+b then d; reattach with: tmux attach -t keepgpu +``` + +Fallback when `tmux` is unavailable: + +```bash +nohup keep-gpu --gpu-ids 0 --vram 1GiB --interval 300 > keepgpu.log 2>&1 & +echo $! > keepgpu.pid +# Monitor: tail -f keepgpu.log +# Stop: kill "$(cat keepgpu.pid)" +``` + +## Troubleshooting + +- Invalid `--gpu-ids`: ensure comma-separated integers only. +- Allocation failure / OOM: reduce `--vram` or free memory first. +- No utilization telemetry: ensure `nvidia-ml-py` works and `nvidia-smi` is available. +- No GPUs detected: verify drivers, CUDA/ROCm runtime, and `torch.cuda.device_count()`. + +## Example + +User request: "Install KeepGPU from GitHub and keep GPU 0 alive while I preprocess." + +Suggested response shape: + +1. Install: `pip install "git+https://github.com/Wangmerlyn/KeepGPU.git"` +2. Run: `keep-gpu --gpu-ids 0 --vram 1GiB --interval 60 --busy-threshold 25` +3. Verify: check CLI logs for keep loop activity; stop with `Ctrl+C` when done. + +## Limitations + +- KeepGPU is not a scheduler; it only keeps already accessible GPUs active. +- KeepGPU behavior depends on cluster policy; some schedulers require higher VRAM or tighter intervals. diff --git a/skills/gpu-keepalive-with-keepgpu/agents/openai.yaml b/skills/gpu-keepalive-with-keepgpu/agents/openai.yaml new file mode 100644 index 0000000..67fbe49 --- /dev/null +++ b/skills/gpu-keepalive-with-keepgpu/agents/openai.yaml @@ -0,0 +1,3 @@ +display_name: KeepGPU CLI Operator +short_description: Install and operate keep-gpu CLI commands safely on shared GPU machines. +default_prompt: Use this skill to install KeepGPU and build run-ready keep-gpu commands with sensible defaults, verification steps, and troubleshooting.