Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
[![DOI](https://zenodo.org/badge/987167271.svg)](https://doi.org/10.5281/zenodo.17129114)
[![Ask DeepWiki](https://deepwiki.com/badge.svg)](https://deepwiki.com/Wangmerlyn/KeepGPU)
[![CodeRabbit Pull Request Reviews](https://img.shields.io/coderabbit/prs/github/Wangmerlyn/KeepGPU)](https://coderabbit.ai/dashboard/gh/Wangmerlyn/KeepGPU)
[![SkillCheck Passed](https://raw.githubusercontent.com/olgasafonova/skillcheck-free/main/skill-check/passed.svg)](https://github.com/olgasafonova/skillcheck-free)

**Keep GPU** keeps shared GPUs from being reclaimed while you prep data, debug, or coordinate multi-stage pipelines. It allocates just enough VRAM and issues lightweight CUDA work so schedulers observe an “active” device—without running a full training job.

Expand Down
22 changes: 22 additions & 0 deletions docs/skills/gpu-keepalive-with-keepgpu/skillcheck-free-report.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
# SkillCheck Free Report: gpu-keepalive-with-keepgpu

Validation method: applied the SkillCheck Free rule set from `skill-check/SKILL.md` in <https://github.com/olgasafonova/skillcheck-free> after updating the skill to focus on KeepGPU CLI usage and installation instructions.

## Summary

- Critical: 0
- Warnings: 0
- Suggestions: 0

## Strengths detected

- Description includes activation triggers (8.3)
- Description includes negative triggers (8.7)
- Skill includes example section (8.1)
- Skill documents error handling or limitations (8.2)
- Skill uses structured instructions (8.5)
- Skill documents prerequisites (8.6)
Comment on lines +13 to +18
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The "Strengths detected" section lists rule numbers (e.g., 8.3) without a brief explanation of what each rule signifies. To make this report more self-contained and immediately understandable for readers who may not have direct access to or familiarity with the skill-check/SKILL.md document, consider adding a concise description for each strength. This would improve the clarity and overall usefulness of the report as a standalone document.


## Checked files

- `skills/gpu-keepalive-with-keepgpu/SKILL.md`
156 changes: 156 additions & 0 deletions skills/gpu-keepalive-with-keepgpu/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,156 @@
---
name: gpu-keepalive-with-keepgpu
description: Install and operate the KeepGPU CLI to keep reserved GPUs active during data prep, debugging, and orchestration downtime. Use when users ask for keep-gpu command construction, tuning (--vram, --interval, --busy-threshold), installation from this repository, or runtime troubleshooting of keep-gpu sessions; do not use for repository development, code refactoring, or unrelated Python tooling.
---

# KeepGPU CLI Operator

Use this workflow to run `keep-gpu` safely and effectively.

## Prerequisites

- Confirm at least one GPU is visible (`python -c "import torch; print(torch.cuda.device_count())"`).
- Run commands in a shell where CUDA/ROCm drivers are already available.
- Use `Ctrl+C` to stop KeepGPU and release memory cleanly.

## Install KeepGPU

Install PyTorch first for your platform, then install KeepGPU.

### Option A: Install from package index

```bash
# CUDA example (change cu121 to your CUDA version)
pip install --index-url https://download.pytorch.org/whl/cu121 torch
pip install keep-gpu
```

```bash
# ROCm example (change rocm6.1 to your ROCm version)
pip install --index-url https://download.pytorch.org/whl/rocm6.1 torch
pip install keep-gpu[rocm]
```

### Option B: Install directly from Git URL (no local clone)

Prefer this option when users only need the CLI and do not need local source edits. This avoids checkout directory and cleanup overhead.

```bash
pip install "git+https://github.com/Wangmerlyn/KeepGPU.git"
```

If SSH access is configured:

```bash
pip install "git+ssh://git@github.com/Wangmerlyn/KeepGPU.git"
```

ROCm variant from Git URL:

```bash
pip install "keep_gpu[rocm] @ git+https://github.com/Wangmerlyn/KeepGPU.git"
```

### Option C: Install from a local source checkout (explicit path)

Use this option only when users already have a local checkout or plan to edit source.

```bash
git clone https://github.com/Wangmerlyn/KeepGPU.git
cd KeepGPU
pip install -e .
```

If the checkout already exists somewhere else, install by absolute path:

```bash
pip install -e /absolute/path/to/KeepGPU
```

For ROCm users from local checkout:

```bash
pip install -e ".[rocm]"
```

Verify installation:

```bash
keep-gpu --help
```

## Command model

CLI options to tune:

- `--gpu-ids`: comma-separated IDs (`0`, `0,1`). If omitted, KeepGPU uses all visible GPUs.
- `--vram`: VRAM to hold per GPU (`512MB`, `1GiB`, or raw bytes).
- `--interval`: seconds between keep-alive cycles.
- `--busy-threshold` (`--util-threshold` alias): if utilization is above this percent, KeepGPU backs off.

Legacy compatibility:

- `--threshold` is deprecated but still accepted.
- Numeric `--threshold` maps to busy threshold.
- String `--threshold` maps to VRAM.

## Agent workflow

1. Collect workload intent: target GPU IDs, expected hold duration, and whether node is shared.
2. Choose safe defaults when unspecified: `--vram 1GiB`, `--interval 60-120`, `--busy-threshold 25` for shared nodes.
3. Build one concrete command.
4. Provide stop instruction (`Ctrl+C`) and a verification step.
5. If command fails, provide one minimal troubleshooting command at a time.

## Command templates

Single GPU while preprocessing:

```bash
keep-gpu --gpu-ids 0 --vram 1GiB --interval 60 --busy-threshold 25
```

All visible GPUs with lighter load:

```bash
keep-gpu --vram 512MB --interval 180
```

Remote sessions (preferred: `tmux` for visibility and control):

```bash
tmux new -s keepgpu
keep-gpu --gpu-ids 0 --vram 1GiB --interval 300
# Detach with Ctrl+b then d; reattach with: tmux attach -t keepgpu
```

Fallback when `tmux` is unavailable:

```bash
nohup keep-gpu --gpu-ids 0 --vram 1GiB --interval 300 > keepgpu.log 2>&1 &
echo $! > keepgpu.pid
# Monitor: tail -f keepgpu.log
# Stop: kill "$(cat keepgpu.pid)"
```

## Troubleshooting

- Invalid `--gpu-ids`: ensure comma-separated integers only.
- Allocation failure / OOM: reduce `--vram` or free memory first.
- No utilization telemetry: ensure `nvidia-ml-py` works and `nvidia-smi` is available.
- No GPUs detected: verify drivers, CUDA/ROCm runtime, and `torch.cuda.device_count()`.

## Example

User request: "Install KeepGPU from GitHub and keep GPU 0 alive while I preprocess."

Suggested response shape:

1. Install: `pip install "git+https://github.com/Wangmerlyn/KeepGPU.git"`
2. Run: `keep-gpu --gpu-ids 0 --vram 1GiB --interval 60 --busy-threshold 25`
3. Verify: check CLI logs for keep loop activity; stop with `Ctrl+C` when done.

## Limitations

- KeepGPU is not a scheduler; it only keeps already accessible GPUs active.
- KeepGPU behavior depends on cluster policy; some schedulers require higher VRAM or tighter intervals.
3 changes: 3 additions & 0 deletions skills/gpu-keepalive-with-keepgpu/agents/openai.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
display_name: KeepGPU CLI Operator
short_description: Install and operate keep-gpu CLI commands safely on shared GPU machines.
default_prompt: Use this skill to install KeepGPU and build run-ready keep-gpu commands with sensible defaults, verification steps, and troubleshooting.