Wangmerlyn · Wangmerlyn · Mar 4, 2026 · Mar 4, 2026 · Mar 4, 2026 · Mar 4, 2026
diff --git a/README.md b/README.md
@@ -20,7 +20,7 @@ On many clusters, idle GPUs are reaped or silently shared after a short grace pe
 - **Polite** – Uses NVML to read utilization and backs off when the GPU is busy.
 - **Portable** – Typer/Rich CLI for humans; Python API for orchestrators and notebooks.
 - **Observable** – Structured logging and optional file logs for auditing what kept the GPU alive.
-- **Power-aware** – Uses intervalled elementwise ops instead of heavy matmul floods to present “busy” utilization while keeping power and thermals lower (see `CudaGPUController._run_mat_batch` for the loop).
+- **Power-aware** – Uses intervalled elementwise ops instead of heavy matmul floods to present “busy” utilization while keeping power and thermals lower (see `CudaGPUController._run_relu_batch` for the loop).
 - **NVML-backed** – GPU telemetry comes from `nvidia-ml-py` (the `pynvml` module), with optional `rocm-smi` support when you install the `rocm` extra.
 
 ## Quick start (CLI)
@@ -30,6 +30,18 @@ pip install keep-gpu
 
 # Hold GPU 0 with 1 GiB VRAM and throttle if utilization exceeds 25%
 keep-gpu --gpu-ids 0 --vram 1GiB --busy-threshold 25 --interval 60
+
+# Non-blocking mode for agent workflows (auto-starts local service)
+keep-gpu start --gpu-ids 0 --vram 1GiB --busy-threshold 25 --interval 60
+keep-gpu status
+keep-gpu stop --all
+keep-gpu service-stop
+```
+
+Open the dashboard while service mode is running:
+
+```text
+http://127.0.0.1:8765/
 ```
 
 ### Platform installs at a glance
@@ -52,10 +64,18 @@ keep-gpu --gpu-ids 0 --vram 1GiB --busy-threshold 25 --interval 60
 
 Flags that matter:
 
-- `--vram` (`1GiB`, `750MB`, or bytes): how much memory to pin.
-- `--interval` (seconds): sleep between keep-alive bursts.
-- `--busy-threshold`: skip work when NVML reports higher utilization.
-- `--gpu-ids`: target a subset; otherwise all visible GPUs are guarded.
+- Blocking mode knobs:
+  - `--vram` (`1GiB`, `750MB`, or bytes): how much memory to pin.
+  - `--interval` (seconds): sleep between keep-alive bursts.
+  - `--busy-threshold`: skip work when NVML reports higher utilization.
+  - `--gpu-ids`: target a subset; otherwise all visible GPUs are guarded.
+- Service mode commands:
+  - `keep-gpu serve`: run local service (HTTP + dashboard).
+  - `keep-gpu start`: create keep session and return immediately.
+  - `keep-gpu status`: inspect active sessions.
+  - `keep-gpu stop --job-id <id>` or `keep-gpu stop --all`: release sessions.
+  - `keep-gpu service-stop`: stop auto-started local daemon.
+  - `keep-gpu list-gpus`: fetch telemetry from local service.
 
 ## Embed in Python
 
@@ -91,21 +111,27 @@ with GlobalGPUController(gpu_ids=[0, 1], vram_to_keep="750MB", interval=90, busy
 - ROCm-only tests carry `@pytest.mark.rocm`; run with `pytest --run-rocm tests/rocm_controller`.
 - Markers: `rocm` (needs ROCm stack) and `large_memory` (opt-in locally).
 
-### MCP endpoint (experimental)
+### MCP and service API
 
 - Start a simple JSON-RPC server on stdin/stdout (default):
   ```bash
   keep-gpu-mcp-server
   ```
-- Or expose it over HTTP (JSON-RPC 2.0 by way of POST):
+- Or expose it over HTTP (JSON-RPC + REST + dashboard):
   ```bash
   keep-gpu-mcp-server --mode http --host 0.0.0.0 --port 8765
   ```
-- Example request (one per line):
+- JSON-RPC request example:
   ```json
   {"id": 1, "method": "start_keep", "params": {"gpu_ids": [0], "vram": "512MB", "interval": 60, "busy_threshold": 20}}
   ```
+- REST examples:
+  ```bash
+  curl http://127.0.0.1:8765/health
+  curl http://127.0.0.1:8765/api/sessions
+  ```
 - Methods: `start_keep`, `stop_keep` (optional `job_id`, default stops all), `status` (optional `job_id`), `list_gpus` (basic info).
+- Dashboard: `http://127.0.0.1:8765/`
 - Minimal client config (stdio MCP):
   ```yaml
   servers:

diff --git a/docs/getting-started.md b/docs/getting-started.md
@@ -86,6 +86,24 @@ keep-gpu --interval 120 --gpu-ids 0 --vram 1GiB
 Leave the command running while you prepare data or review notebooks. When you are
 ready to hand the GPU back, hit `Ctrl+C`—controllers will release VRAM and exit.
 
+## Non-blocking workflow for agents
+
+Use service mode when you need the terminal for follow-up commands:
+
+```bash
+keep-gpu start --gpu-ids 0 --interval 120 --vram 1GiB --busy-threshold 25
+keep-gpu status
+```
+
+You can inspect and control sessions in a browser by starting the service and
+opening:
+
+```bash
+keep-gpu serve --host 127.0.0.1 --port 8765
+```
+
+`http://127.0.0.1:8765/`
-`http://127.0.0.1:8765/`
+http://127.0.0.1:8765/
-`http://127.0.0.1:8765/`
+http://127.0.0.1:8765/
+
 ## KeepGPU inside Python
 
 Prefer code-level control? Import the controllers directly (full recipes in

diff --git a/docs/guides/cli.md b/docs/guides/cli.md
@@ -1,100 +1,107 @@
 # CLI Playbook
 
-Practical examples for running `keep-gpu` on shared clusters, workstations, and
-Jupyter environments.
+KeepGPU now supports two operational styles:
 
-## Command anatomy
+- **Blocking mode** (`keep-gpu ...`) for traditional shell workflows.
+- **Service mode** (`keep-gpu start/status/stop`) for agent workflows that must continue after arming keep-alive.
+
+## 1) Blocking mode (compatibility)
 
 ```bash
 keep-gpu --interval 120 --gpu-ids 0,1 --vram 2GiB --busy-threshold 25
 ```
 
-| Flag | Meaning | Default |
-| --- | --- | --- |
-| `--interval` | Sleep between keep-alive cycles (seconds). Lower = tighter lock. | `300` |
-| `--gpu-ids` | Comma-separated visible IDs. Leave unset to keep every detected GPU busy. | all |
-| `--vram` | Amount of memory each controller allocates. Accepts `800MB`, `1GiB`, `1073741824`, etc. | `1GiB` |
-| `--busy-threshold` | Skip work when utilization is already above this percentage (`--threshold` still works for legacy scripts). | `-1` (never skip) |
+This command blocks until you press `Ctrl+C`.
+
+## 2) Non-blocking service mode (recommended for agents)
 
-!!! note "Still using `--threshold`?"
-    Values passed to `--threshold` are auto-detected: numbers override
-    `--busy-threshold`, while strings such as `1GiB` override `--vram`. Prefer the explicit
-    flags going forward, but old commands continue to run.
+### Start a keep session
 
-!!! info "What happens under the hood?"
-    Each GPU gets a `CudaGPUController` that allocates one tensor sized by
-    `--vram` and runs a lightweight matmul loop. Controllers use NVML
-    (`nvidia-ml-py` / `pynvml` module) to read utilization so they back off when a device is already
-    busy (see `--busy-threshold`).
+```bash
+keep-gpu start --gpu-ids 0 --vram 1GiB --interval 60 --busy-threshold 25
+```
 
-## Scenarios
+`start` auto-starts the local service if needed and returns immediately with a
+`job_id`. The command also prints:
 
-### 1. Hold a single GPU while preprocessing
+- dashboard URL (`http://<host>:<port>/`),
+- follow-up status/stop command hints,
+- daemon shutdown hint (`keep-gpu service-stop`).
+
+### Check status
 
 ```bash
-keep-gpu --gpu-ids 0 --interval 60 --vram 2GiB
+keep-gpu status
+keep-gpu status --job-id <job_id>
 ```
 
-- Keeps card `0` alive with moderate VRAM allocation.
-- Suitable for local experiments where you just need the scheduler to see
-  sustained activity.
+### Stop sessions
+
+```bash
+keep-gpu stop --job-id <job_id>
+keep-gpu stop --all
+```
 
-### 2. Park every GPU on the node overnight
+### Stop local daemon
 
 ```bash
-keep-gpu --interval 180 --vram 512MB
+keep-gpu service-stop
 ```
 
-- Launch inside `tmux`/`screen` or your cluster’s “interactive session.”
-- Use a long interval to reduce background load while still holding the cards.
+If sessions are still active, stop them first or use `--force`.
 
-### 3. Share the node without starving teammates
+### List telemetry
 
 ```bash
-keep-gpu --gpu-ids 0,1 --interval 90 --busy-threshold 35
+keep-gpu list-gpus
 ```
 
-- Controllers pause their work whenever utilization exceeds 35%.
-- Lets you reserve GPUs 0 and 1 while GPUs 2+ remain untouched.
-
-### 4. Run from Jupyter or VS Code terminal
+### Run service explicitly
 
 ```bash
-!keep-gpu --interval 45 --vram 768MB --busy-threshold 50
+keep-gpu serve --host 127.0.0.1 --port 8765
 ```
 
-- Prefix with `!` (Jupyter) or use the integrated terminal.
-- Stop with `Ctrl+C` when you are ready to start the actual CUDA workload.
+## 3) Dashboard UI
 
-### 5. Background job by way of scheduler
+When service mode is running, open:
 
-```bash
-nohup keep-gpu --interval 300 --gpu-ids 0 --vram 1GiB > keepgpu.log 2>&1 &
+```text
+http://127.0.0.1:8765/
 ```
 
-- `nohup` keeps the process alive after you disconnect.
-- Combine with your cluster’s reservation commands (for example, `srun`, `bsub`, `qsub`).
+The dashboard provides:
+
+- live GPU memory/utilization telemetry,
+- active keep sessions,
+- session creation form,
+- single-session and stop-all controls.
 
-## Observability and safety
+## Command knobs
+
+| Option | Meaning | Default |
+| --- | --- | --- |
+| `--gpu-ids` | Comma-separated GPU IDs. Omit to use all visible devices. | all |
+| `--vram` | Per-GPU memory target (`512MB`, `1GiB`, bytes). | `1GiB` |
+| `--interval` | Seconds between keep-alive cycles. | `300` |
+| `--busy-threshold` / `--util-threshold` | Back off when utilization exceeds this value. | `-1` |
 
-- **Logging levels** – Set `CONSOLE_LOG_LEVEL=DEBUG` or `FILE_LOG_LEVEL=INFO`
-  to capture detailed metrics. Logs land in `logs/*.log`.
-- **VRAM tuning** – Start with 1–2 GiB. Some schedulers only inspect “memory in
-  use,” so going below 500 MB may not fool them.
-- **Graceful exit** – Use `Ctrl+C`; KeepGPU releases tensors and clears the CUDA
-  cache so the next job starts with a clean slate.
-- **Failure recovery** – If allocation fails (for example, VRAM already full), the CLI
-  retries after `--interval` and logs the error. Adjust `--vram` downwards or
-  free memory manually.
+## Remote sessions
+
+Preferred workflow for remote shells:
+
+```bash
+tmux new -s keepgpu
+keep-gpu start --gpu-ids 0 --vram 1GiB --interval 120 --busy-threshold 25
+```
 
-Ready to embed this behavior inside your training scripts? Head to the
-[Python API Recipes](python.md) next.
+Then run follow-up commands in the same shell (non-blocking), or monitor by way
+of `keep-gpu status`.
 
-## FAQ
+## Troubleshooting
 
-- **NVML not available?** KeepGPU falls back to assuming 0% utilization. Install
-  `nvidia-ml-py` (provides `pynvml`), ensure `nvidia-smi` works, and rerun. On
-  managed clusters, load the CUDA/NVML module first.
-- **Running under a scheduler?** Use the background recipe above (`nohup ... &`)
-  or wrap `keep-gpu` in your scheduler’s job script so it stays tied to the
-  reservation. Increase `--interval` if you want lower background load.
+- **`--gpu-ids` parse error**: use only comma-separated integers (`0,1`).
+- **Start cannot reach service**: run `keep-gpu serve --host 127.0.0.1 --port 8765`.
+- **Need to close background service**: run `keep-gpu stop --all` first, then `keep-gpu service-stop` (or use `keep-gpu service-stop --force`).
+- **OOM during keep**: reduce `--vram` or free GPU memory before starting.
+- **No utilization data**: ensure `nvidia-ml-py` works and `nvidia-smi` is available.