Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
42 changes: 34 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ On many clusters, idle GPUs are reaped or silently shared after a short grace pe
- **Polite** – Uses NVML to read utilization and backs off when the GPU is busy.
- **Portable** – Typer/Rich CLI for humans; Python API for orchestrators and notebooks.
- **Observable** – Structured logging and optional file logs for auditing what kept the GPU alive.
- **Power-aware** – Uses intervalled elementwise ops instead of heavy matmul floods to present “busy” utilization while keeping power and thermals lower (see `CudaGPUController._run_mat_batch` for the loop).
- **Power-aware** – Uses intervalled elementwise ops instead of heavy matmul floods to present “busy” utilization while keeping power and thermals lower (see `CudaGPUController._run_relu_batch` for the loop).
- **NVML-backed** – GPU telemetry comes from `nvidia-ml-py` (the `pynvml` module), with optional `rocm-smi` support when you install the `rocm` extra.

## Quick start (CLI)
Expand All @@ -30,6 +30,18 @@ pip install keep-gpu

# Hold GPU 0 with 1 GiB VRAM and throttle if utilization exceeds 25%
keep-gpu --gpu-ids 0 --vram 1GiB --busy-threshold 25 --interval 60

# Non-blocking mode for agent workflows (auto-starts local service)
keep-gpu start --gpu-ids 0 --vram 1GiB --busy-threshold 25 --interval 60
keep-gpu status
keep-gpu stop --all
keep-gpu service-stop
```

Open the dashboard while service mode is running:

```text
http://127.0.0.1:8765/
```

### Platform installs at a glance
Expand All @@ -52,10 +64,18 @@ keep-gpu --gpu-ids 0 --vram 1GiB --busy-threshold 25 --interval 60

Flags that matter:

- `--vram` (`1GiB`, `750MB`, or bytes): how much memory to pin.
- `--interval` (seconds): sleep between keep-alive bursts.
- `--busy-threshold`: skip work when NVML reports higher utilization.
- `--gpu-ids`: target a subset; otherwise all visible GPUs are guarded.
- Blocking mode knobs:
- `--vram` (`1GiB`, `750MB`, or bytes): how much memory to pin.
- `--interval` (seconds): sleep between keep-alive bursts.
- `--busy-threshold`: skip work when NVML reports higher utilization.
- `--gpu-ids`: target a subset; otherwise all visible GPUs are guarded.
- Service mode commands:
- `keep-gpu serve`: run local service (HTTP + dashboard).
- `keep-gpu start`: create keep session and return immediately.
- `keep-gpu status`: inspect active sessions.
- `keep-gpu stop --job-id <id>` or `keep-gpu stop --all`: release sessions.
- `keep-gpu service-stop`: stop auto-started local daemon.
- `keep-gpu list-gpus`: fetch telemetry from local service.

## Embed in Python

Expand Down Expand Up @@ -91,21 +111,27 @@ with GlobalGPUController(gpu_ids=[0, 1], vram_to_keep="750MB", interval=90, busy
- ROCm-only tests carry `@pytest.mark.rocm`; run with `pytest --run-rocm tests/rocm_controller`.
- Markers: `rocm` (needs ROCm stack) and `large_memory` (opt-in locally).

### MCP endpoint (experimental)
### MCP and service API

- Start a simple JSON-RPC server on stdin/stdout (default):
```bash
keep-gpu-mcp-server
```
- Or expose it over HTTP (JSON-RPC 2.0 by way of POST):
- Or expose it over HTTP (JSON-RPC + REST + dashboard):
```bash
keep-gpu-mcp-server --mode http --host 0.0.0.0 --port 8765
```
- Example request (one per line):
- JSON-RPC request example:
```json
{"id": 1, "method": "start_keep", "params": {"gpu_ids": [0], "vram": "512MB", "interval": 60, "busy_threshold": 20}}
```
- REST examples:
```bash
curl http://127.0.0.1:8765/health
curl http://127.0.0.1:8765/api/sessions
```
- Methods: `start_keep`, `stop_keep` (optional `job_id`, default stops all), `status` (optional `job_id`), `list_gpus` (basic info).
- Dashboard: `http://127.0.0.1:8765/`
- Minimal client config (stdio MCP):
```yaml
servers:
Expand Down
18 changes: 18 additions & 0 deletions docs/getting-started.md
Original file line number Diff line number Diff line change
Expand Up @@ -86,6 +86,24 @@ keep-gpu --interval 120 --gpu-ids 0 --vram 1GiB
Leave the command running while you prepare data or review notebooks. When you are
ready to hand the GPU back, hit `Ctrl+C`—controllers will release VRAM and exit.

## Non-blocking workflow for agents

Use service mode when you need the terminal for follow-up commands:

```bash
keep-gpu start --gpu-ids 0 --interval 120 --vram 1GiB --busy-threshold 25
keep-gpu status
```

You can inspect and control sessions in a browser by starting the service and
opening:

```bash
keep-gpu serve --host 127.0.0.1 --port 8765
```

`http://127.0.0.1:8765/`
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For consistency with other documentation files (README.md, guides/cli.md, etc.), it would be better to wrap this URL in a text code block. This will ensure uniform rendering of URLs across the documentation.

Suggested change
`http://127.0.0.1:8765/`
http://127.0.0.1:8765/


## KeepGPU inside Python

Prefer code-level control? Import the controllers directly (full recipes in
Expand Down
129 changes: 68 additions & 61 deletions docs/guides/cli.md
Original file line number Diff line number Diff line change
@@ -1,100 +1,107 @@
# CLI Playbook

Practical examples for running `keep-gpu` on shared clusters, workstations, and
Jupyter environments.
KeepGPU now supports two operational styles:

## Command anatomy
- **Blocking mode** (`keep-gpu ...`) for traditional shell workflows.
- **Service mode** (`keep-gpu start/status/stop`) for agent workflows that must continue after arming keep-alive.

## 1) Blocking mode (compatibility)

```bash
keep-gpu --interval 120 --gpu-ids 0,1 --vram 2GiB --busy-threshold 25
```

| Flag | Meaning | Default |
| --- | --- | --- |
| `--interval` | Sleep between keep-alive cycles (seconds). Lower = tighter lock. | `300` |
| `--gpu-ids` | Comma-separated visible IDs. Leave unset to keep every detected GPU busy. | all |
| `--vram` | Amount of memory each controller allocates. Accepts `800MB`, `1GiB`, `1073741824`, etc. | `1GiB` |
| `--busy-threshold` | Skip work when utilization is already above this percentage (`--threshold` still works for legacy scripts). | `-1` (never skip) |
This command blocks until you press `Ctrl+C`.

## 2) Non-blocking service mode (recommended for agents)

!!! note "Still using `--threshold`?"
Values passed to `--threshold` are auto-detected: numbers override
`--busy-threshold`, while strings such as `1GiB` override `--vram`. Prefer the explicit
flags going forward, but old commands continue to run.
### Start a keep session

!!! info "What happens under the hood?"
Each GPU gets a `CudaGPUController` that allocates one tensor sized by
`--vram` and runs a lightweight matmul loop. Controllers use NVML
(`nvidia-ml-py` / `pynvml` module) to read utilization so they back off when a device is already
busy (see `--busy-threshold`).
```bash
keep-gpu start --gpu-ids 0 --vram 1GiB --interval 60 --busy-threshold 25
```

## Scenarios
`start` auto-starts the local service if needed and returns immediately with a
`job_id`. The command also prints:

### 1. Hold a single GPU while preprocessing
- dashboard URL (`http://<host>:<port>/`),
- follow-up status/stop command hints,
- daemon shutdown hint (`keep-gpu service-stop`).

### Check status

```bash
keep-gpu --gpu-ids 0 --interval 60 --vram 2GiB
keep-gpu status
keep-gpu status --job-id <job_id>
```

- Keeps card `0` alive with moderate VRAM allocation.
- Suitable for local experiments where you just need the scheduler to see
sustained activity.
### Stop sessions

```bash
keep-gpu stop --job-id <job_id>
keep-gpu stop --all
```

### 2. Park every GPU on the node overnight
### Stop local daemon

```bash
keep-gpu --interval 180 --vram 512MB
keep-gpu service-stop
```

- Launch inside `tmux`/`screen` or your cluster’s “interactive session.”
- Use a long interval to reduce background load while still holding the cards.
If sessions are still active, stop them first or use `--force`.

### 3. Share the node without starving teammates
### List telemetry

```bash
keep-gpu --gpu-ids 0,1 --interval 90 --busy-threshold 35
keep-gpu list-gpus
```

- Controllers pause their work whenever utilization exceeds 35%.
- Lets you reserve GPUs 0 and 1 while GPUs 2+ remain untouched.

### 4. Run from Jupyter or VS Code terminal
### Run service explicitly

```bash
!keep-gpu --interval 45 --vram 768MB --busy-threshold 50
keep-gpu serve --host 127.0.0.1 --port 8765
```

- Prefix with `!` (Jupyter) or use the integrated terminal.
- Stop with `Ctrl+C` when you are ready to start the actual CUDA workload.
## 3) Dashboard UI

### 5. Background job by way of scheduler
When service mode is running, open:

```bash
nohup keep-gpu --interval 300 --gpu-ids 0 --vram 1GiB > keepgpu.log 2>&1 &
```text
http://127.0.0.1:8765/
```

- `nohup` keeps the process alive after you disconnect.
- Combine with your cluster’s reservation commands (for example, `srun`, `bsub`, `qsub`).
The dashboard provides:

- live GPU memory/utilization telemetry,
- active keep sessions,
- session creation form,
- single-session and stop-all controls.

## Observability and safety
## Command knobs

| Option | Meaning | Default |
| --- | --- | --- |
| `--gpu-ids` | Comma-separated GPU IDs. Omit to use all visible devices. | all |
| `--vram` | Per-GPU memory target (`512MB`, `1GiB`, bytes). | `1GiB` |
| `--interval` | Seconds between keep-alive cycles. | `300` |
| `--busy-threshold` / `--util-threshold` | Back off when utilization exceeds this value. | `-1` |

- **Logging levels** – Set `CONSOLE_LOG_LEVEL=DEBUG` or `FILE_LOG_LEVEL=INFO`
to capture detailed metrics. Logs land in `logs/*.log`.
- **VRAM tuning** – Start with 1–2 GiB. Some schedulers only inspect “memory in
use,” so going below 500 MB may not fool them.
- **Graceful exit** – Use `Ctrl+C`; KeepGPU releases tensors and clears the CUDA
cache so the next job starts with a clean slate.
- **Failure recovery** – If allocation fails (for example, VRAM already full), the CLI
retries after `--interval` and logs the error. Adjust `--vram` downwards or
free memory manually.
## Remote sessions

Preferred workflow for remote shells:

```bash
tmux new -s keepgpu
keep-gpu start --gpu-ids 0 --vram 1GiB --interval 120 --busy-threshold 25
```

Ready to embed this behavior inside your training scripts? Head to the
[Python API Recipes](python.md) next.
Then run follow-up commands in the same shell (non-blocking), or monitor by way
of `keep-gpu status`.

## FAQ
## Troubleshooting

- **NVML not available?** KeepGPU falls back to assuming 0% utilization. Install
`nvidia-ml-py` (provides `pynvml`), ensure `nvidia-smi` works, and rerun. On
managed clusters, load the CUDA/NVML module first.
- **Running under a scheduler?** Use the background recipe above (`nohup ... &`)
or wrap `keep-gpu` in your scheduler’s job script so it stays tied to the
reservation. Increase `--interval` if you want lower background load.
- **`--gpu-ids` parse error**: use only comma-separated integers (`0,1`).
- **Start cannot reach service**: run `keep-gpu serve --host 127.0.0.1 --port 8765`.
- **Need to close background service**: run `keep-gpu stop --all` first, then `keep-gpu service-stop` (or use `keep-gpu service-stop --force`).
- **OOM during keep**: reduce `--vram` or free GPU memory before starting.
Comment thread
coderabbitai[bot] marked this conversation as resolved.
- **No utilization data**: ensure `nvidia-ml-py` works and `nvidia-smi` is available.
Loading