-
Notifications
You must be signed in to change notification settings - Fork 5
[cli,mcp,dashboard,docs,skills] feat: add non-blocking service workflow and UI #66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
7 commits
Select commit
Hold shift + click to select a range
135a442
feat(cli): add non-blocking service workflow and dashboard
Wangmerlyn 6e57ef5
feat(cli): improve service lifecycle UX and dashboard polish
Wangmerlyn 8435280
fix(cli): harden service errors and session release UX
Wangmerlyn b8d7f42
fix(cli): make stop-all resilient to service timeouts
Wangmerlyn 34bceb4
fix(controller): make keep loops interruptible for fast stop
Wangmerlyn a15315b
fix(mcp): harden service API and stop handling
Wangmerlyn 641a56b
chore(ci): retrigger readthedocs build
Wangmerlyn File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,100 +1,107 @@ | ||
| # CLI Playbook | ||
|
|
||
| Practical examples for running `keep-gpu` on shared clusters, workstations, and | ||
| Jupyter environments. | ||
| KeepGPU now supports two operational styles: | ||
|
|
||
| ## Command anatomy | ||
| - **Blocking mode** (`keep-gpu ...`) for traditional shell workflows. | ||
| - **Service mode** (`keep-gpu start/status/stop`) for agent workflows that must continue after arming keep-alive. | ||
|
|
||
| ## 1) Blocking mode (compatibility) | ||
|
|
||
| ```bash | ||
| keep-gpu --interval 120 --gpu-ids 0,1 --vram 2GiB --busy-threshold 25 | ||
| ``` | ||
|
|
||
| | Flag | Meaning | Default | | ||
| | --- | --- | --- | | ||
| | `--interval` | Sleep between keep-alive cycles (seconds). Lower = tighter lock. | `300` | | ||
| | `--gpu-ids` | Comma-separated visible IDs. Leave unset to keep every detected GPU busy. | all | | ||
| | `--vram` | Amount of memory each controller allocates. Accepts `800MB`, `1GiB`, `1073741824`, etc. | `1GiB` | | ||
| | `--busy-threshold` | Skip work when utilization is already above this percentage (`--threshold` still works for legacy scripts). | `-1` (never skip) | | ||
| This command blocks until you press `Ctrl+C`. | ||
|
|
||
| ## 2) Non-blocking service mode (recommended for agents) | ||
|
|
||
| !!! note "Still using `--threshold`?" | ||
| Values passed to `--threshold` are auto-detected: numbers override | ||
| `--busy-threshold`, while strings such as `1GiB` override `--vram`. Prefer the explicit | ||
| flags going forward, but old commands continue to run. | ||
| ### Start a keep session | ||
|
|
||
| !!! info "What happens under the hood?" | ||
| Each GPU gets a `CudaGPUController` that allocates one tensor sized by | ||
| `--vram` and runs a lightweight matmul loop. Controllers use NVML | ||
| (`nvidia-ml-py` / `pynvml` module) to read utilization so they back off when a device is already | ||
| busy (see `--busy-threshold`). | ||
| ```bash | ||
| keep-gpu start --gpu-ids 0 --vram 1GiB --interval 60 --busy-threshold 25 | ||
| ``` | ||
|
|
||
| ## Scenarios | ||
| `start` auto-starts the local service if needed and returns immediately with a | ||
| `job_id`. The command also prints: | ||
|
|
||
| ### 1. Hold a single GPU while preprocessing | ||
| - dashboard URL (`http://<host>:<port>/`), | ||
| - follow-up status/stop command hints, | ||
| - daemon shutdown hint (`keep-gpu service-stop`). | ||
|
|
||
| ### Check status | ||
|
|
||
| ```bash | ||
| keep-gpu --gpu-ids 0 --interval 60 --vram 2GiB | ||
| keep-gpu status | ||
| keep-gpu status --job-id <job_id> | ||
| ``` | ||
|
|
||
| - Keeps card `0` alive with moderate VRAM allocation. | ||
| - Suitable for local experiments where you just need the scheduler to see | ||
| sustained activity. | ||
| ### Stop sessions | ||
|
|
||
| ```bash | ||
| keep-gpu stop --job-id <job_id> | ||
| keep-gpu stop --all | ||
| ``` | ||
|
|
||
| ### 2. Park every GPU on the node overnight | ||
| ### Stop local daemon | ||
|
|
||
| ```bash | ||
| keep-gpu --interval 180 --vram 512MB | ||
| keep-gpu service-stop | ||
| ``` | ||
|
|
||
| - Launch inside `tmux`/`screen` or your cluster’s “interactive session.” | ||
| - Use a long interval to reduce background load while still holding the cards. | ||
| If sessions are still active, stop them first or use `--force`. | ||
|
|
||
| ### 3. Share the node without starving teammates | ||
| ### List telemetry | ||
|
|
||
| ```bash | ||
| keep-gpu --gpu-ids 0,1 --interval 90 --busy-threshold 35 | ||
| keep-gpu list-gpus | ||
| ``` | ||
|
|
||
| - Controllers pause their work whenever utilization exceeds 35%. | ||
| - Lets you reserve GPUs 0 and 1 while GPUs 2+ remain untouched. | ||
|
|
||
| ### 4. Run from Jupyter or VS Code terminal | ||
| ### Run service explicitly | ||
|
|
||
| ```bash | ||
| !keep-gpu --interval 45 --vram 768MB --busy-threshold 50 | ||
| keep-gpu serve --host 127.0.0.1 --port 8765 | ||
| ``` | ||
|
|
||
| - Prefix with `!` (Jupyter) or use the integrated terminal. | ||
| - Stop with `Ctrl+C` when you are ready to start the actual CUDA workload. | ||
| ## 3) Dashboard UI | ||
|
|
||
| ### 5. Background job by way of scheduler | ||
| When service mode is running, open: | ||
|
|
||
| ```bash | ||
| nohup keep-gpu --interval 300 --gpu-ids 0 --vram 1GiB > keepgpu.log 2>&1 & | ||
| ```text | ||
| http://127.0.0.1:8765/ | ||
| ``` | ||
|
|
||
| - `nohup` keeps the process alive after you disconnect. | ||
| - Combine with your cluster’s reservation commands (for example, `srun`, `bsub`, `qsub`). | ||
| The dashboard provides: | ||
|
|
||
| - live GPU memory/utilization telemetry, | ||
| - active keep sessions, | ||
| - session creation form, | ||
| - single-session and stop-all controls. | ||
|
|
||
| ## Observability and safety | ||
| ## Command knobs | ||
|
|
||
| | Option | Meaning | Default | | ||
| | --- | --- | --- | | ||
| | `--gpu-ids` | Comma-separated GPU IDs. Omit to use all visible devices. | all | | ||
| | `--vram` | Per-GPU memory target (`512MB`, `1GiB`, bytes). | `1GiB` | | ||
| | `--interval` | Seconds between keep-alive cycles. | `300` | | ||
| | `--busy-threshold` / `--util-threshold` | Back off when utilization exceeds this value. | `-1` | | ||
|
|
||
| - **Logging levels** – Set `CONSOLE_LOG_LEVEL=DEBUG` or `FILE_LOG_LEVEL=INFO` | ||
| to capture detailed metrics. Logs land in `logs/*.log`. | ||
| - **VRAM tuning** – Start with 1–2 GiB. Some schedulers only inspect “memory in | ||
| use,” so going below 500 MB may not fool them. | ||
| - **Graceful exit** – Use `Ctrl+C`; KeepGPU releases tensors and clears the CUDA | ||
| cache so the next job starts with a clean slate. | ||
| - **Failure recovery** – If allocation fails (for example, VRAM already full), the CLI | ||
| retries after `--interval` and logs the error. Adjust `--vram` downwards or | ||
| free memory manually. | ||
| ## Remote sessions | ||
|
|
||
| Preferred workflow for remote shells: | ||
|
|
||
| ```bash | ||
| tmux new -s keepgpu | ||
| keep-gpu start --gpu-ids 0 --vram 1GiB --interval 120 --busy-threshold 25 | ||
| ``` | ||
|
|
||
| Ready to embed this behavior inside your training scripts? Head to the | ||
| [Python API Recipes](python.md) next. | ||
| Then run follow-up commands in the same shell (non-blocking), or monitor by way | ||
| of `keep-gpu status`. | ||
|
|
||
| ## FAQ | ||
| ## Troubleshooting | ||
|
|
||
| - **NVML not available?** KeepGPU falls back to assuming 0% utilization. Install | ||
| `nvidia-ml-py` (provides `pynvml`), ensure `nvidia-smi` works, and rerun. On | ||
| managed clusters, load the CUDA/NVML module first. | ||
| - **Running under a scheduler?** Use the background recipe above (`nohup ... &`) | ||
| or wrap `keep-gpu` in your scheduler’s job script so it stays tied to the | ||
| reservation. Increase `--interval` if you want lower background load. | ||
| - **`--gpu-ids` parse error**: use only comma-separated integers (`0,1`). | ||
| - **Start cannot reach service**: run `keep-gpu serve --host 127.0.0.1 --port 8765`. | ||
| - **Need to close background service**: run `keep-gpu stop --all` first, then `keep-gpu service-stop` (or use `keep-gpu service-stop --force`). | ||
| - **OOM during keep**: reduce `--vram` or free GPU memory before starting. | ||
|
coderabbitai[bot] marked this conversation as resolved.
|
||
| - **No utilization data**: ensure `nvidia-ml-py` works and `nvidia-smi` is available. | ||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For consistency with other documentation files (
README.md,guides/cli.md, etc.), it would be better to wrap this URL in atextcode block. This will ensure uniform rendering of URLs across the documentation.