From dec7a340035d8bc2c3be0ab1f6754d260b116bd4 Mon Sep 17 00:00:00 2001 From: Wang Siyuan Date: Thu, 5 Mar 2026 16:18:55 +0800 Subject: [PATCH] chore(git): remove planning files from git tracking --- .gitignore | 8 +++ docs/plans/cli-service-dashboard/findings.md | 72 ------------------- docs/plans/cli-service-dashboard/progress.md | 59 --------------- docs/plans/cli-service-dashboard/task_plan.md | 48 ------------- 4 files changed, 8 insertions(+), 179 deletions(-) delete mode 100644 docs/plans/cli-service-dashboard/findings.md delete mode 100644 docs/plans/cli-service-dashboard/progress.md delete mode 100644 docs/plans/cli-service-dashboard/task_plan.md diff --git a/.gitignore b/.gitignore index 68860dd..4c8a2ec 100644 --- a/.gitignore +++ b/.gitignore @@ -147,9 +147,17 @@ venv.bak/ # mkdocs documentation /site + +# Planning files (created by planning-with-files skill) /task_plan.md /findings.md /progress.md +task_plan.md +findings.md +progress.md +**/task_plan.md +**/findings.md +**/progress.md # mypy .mypy_cache/ diff --git a/docs/plans/cli-service-dashboard/findings.md b/docs/plans/cli-service-dashboard/findings.md deleted file mode 100644 index 872dcb7..0000000 --- a/docs/plans/cli-service-dashboard/findings.md +++ /dev/null @@ -1,72 +0,0 @@ -# Findings: Non-blocking CLI + Dashboard - -## Repository Observations - -- Blocking CLI entrypoint is `src/keep_gpu/cli.py` (`main` Typer command). -- Existing remote/session model already exists in `src/keep_gpu/mcp/server.py` (`start_keep`, `stop_keep`, `status`, `list_gpus`). -- Existing MCP HTTP mode only accepts JSON-RPC `POST` and does not expose REST or web UI. -- Current tests for MCP are in `tests/mcp/test_server.py` and cover core session methods. - -## Design Direction (Frontend Skill) - -- Aesthetic direction: **retro-futuristic control room**. -- Visual memory hook: glowing telemetry grid + layered glass panels + high-contrast amber/cyan instrumentation. -- UX priority: one-screen control loop (start session, inspect health, stop session) for agent operators. - -## Implementation Notes - -- Keep backward compatibility: default `keep-gpu` remains blocking. -- Non-blocking usage should be first-class through subcommands and local auto-start. -- Dashboard should be served by the same local backend to avoid extra runtime coordination. - -## Implemented Architecture - -- `src/keep_gpu/cli.py` now exposes: - - blocking default callback (backward compatible), - - `serve`, `start`, `status`, `stop`, and `list-gpus` subcommands, - - local service auto-start in `start` by launching `python -m keep_gpu.mcp.server`. -- `src/keep_gpu/mcp/server.py` now exposes REST routes and static UI serving: - - `GET /health`, `GET /api/gpus`, `GET /api/sessions`, `GET /api/sessions/{job_id}` - - `POST /api/sessions`, `DELETE /api/sessions`, `DELETE /api/sessions/{job_id}` - - dashboard assets on `/` with SPA fallback. -- Dashboard source lives in `web/dashboard/` (React + Vite), with build output in - `src/keep_gpu/mcp/static/`. - -## Validation Findings - -- New CLI and server tests pass (`13 passed`). -- Pre-commit hooks pass with no auto-fix needed after initial implementation. -- `mkdocs build` succeeds; informational note reports plan/skills docs excluded from nav. - -## Iteration Findings (Post-UX Feedback) - -- Added explicit daemon lifecycle guidance in CLI output and docs. -- Added `keep-gpu service-stop` command so users can terminate auto-started local daemon. -- Updated `keep-gpu start` output to always print dashboard URL and follow-up stop commands. -- Reworked dashboard visual language from neon/glassy to classic, restrained control-room style. -- Expanded CLI tests to cover start-output hints and `service-stop` guardrails. - -## Iteration Findings (Bug Report Follow-up) - -- Root cause for noisy `keep-gpu stop --all` failure output: - - service commands still imported `torch` at module import time, causing unrelated NumPy warning noise, - - network timeout from `urlopen` was not normalized, so users could see raw traceback-style failures. -- Fixes applied: - - moved `torch` import into blocking-only execution path, - - wrapped HTTP/timeout errors in friendly `RuntimeError` messages, - - extended command error handling to avoid traceback leaks to users, - - improved dashboard action state so releasing one session does not disable every release control globally. -- Additional tests now assert timeout/error behavior for `keep-gpu stop --all` and HTTP wrapper timeout handling. - -## Root-Cause Fix Findings (Stop-All Reliability) - -- Reproduced that frequent stop timeouts are mostly service-side latency under release, not network transport. -- Root causes identified: - - worker loops used `time.sleep(interval)`, which is non-interruptible and can delay shutdown for long intervals, - - stop RPC could block behind slow release execution, - - single-threaded HTTP server amplified perceived hangs under long requests. -- Fixes implemented: - - replaced sleep points in CUDA/ROCm keep loops with `Event.wait(interval)` so stop signals interrupt immediately, - - added bounded release join timeout in CUDA/ROCm controllers to avoid indefinite blocking, - - made release-all server responses timeout-aware (`timed_out` reporting) with background continuation, - - switched HTTP server to threaded mode to prevent one slow request from blocking others. diff --git a/docs/plans/cli-service-dashboard/progress.md b/docs/plans/cli-service-dashboard/progress.md deleted file mode 100644 index 05c9a4f..0000000 --- a/docs/plans/cli-service-dashboard/progress.md +++ /dev/null @@ -1,59 +0,0 @@ -# Progress Log: Non-blocking CLI + Dashboard - -## Session 2026-03-04 - -### Completed - -- Created implementation plan in `docs/plans/cli-service-dashboard/task_plan.md`. -- Captured initial repository findings in `docs/plans/cli-service-dashboard/findings.md`. -- Confirmed branch: `feat/cli-service-dashboard`. -- Implemented CLI subcommands and service auto-start in `src/keep_gpu/cli.py`. -- Extended server HTTP endpoints and static dashboard routing in `src/keep_gpu/mcp/server.py`. -- Added React/Vite dashboard source under `web/dashboard/` and built static assets. -- Added tests in `tests/mcp/test_http_api.py` and `tests/test_cli_service_commands.py`. -- Updated README/docs/skill content for new service and dashboard workflow. -- Addressed UX feedback: - - refined dashboard styling to a classic high-quality aesthetic, - - improved `keep-gpu start` messaging with dashboard URL and shutdown hints, - - added `keep-gpu service-stop` command for daemon shutdown. -- Expanded tests for output/help lifecycle behavior. -- Applied follow-up reliability fixes from real-user reproduction: - - moved `torch` import into blocking-only path, - - normalized service timeout/HTTP failures into user-friendly errors, - - fixed dashboard release-button state handling to avoid global lockouts. -- Implemented root-cause shutdown fixes: - - converted CUDA/ROCm loop sleeps to interruptible waits, - - added bounded join behavior in controller `release()`, - - updated server stop behavior to surface timeout-aware payloads, - - moved HTTP server to threaded mode. - -### In Progress - -- PR is open; addressing review feedback and final merge readiness checks. - -### Next - -1. Finish open review comments and verify final CI pass. -2. Merge PR after review completion. - -### Validation Results - -- `pytest tests/mcp/test_server.py tests/mcp/test_http_api.py tests/test_cli_thresholds.py tests/test_cli_service_commands.py` -> 13 passed. -- `pre-commit run --all-files` -> all hooks passed. -- `mkdocs build` -> success (non-blocking info about docs pages outside nav). - -### Validation Results (Follow-up) - -- `pytest tests/test_cli_service_commands.py tests/mcp/test_http_api.py tests/mcp/test_server.py tests/test_cli_thresholds.py` -> 18 passed. -- `pre-commit run --all-files` -> all hooks passed. -- `mkdocs build` -> success. - -### Validation Results (Root-Cause Fix) - -- `pytest tests/test_cli_service_commands.py tests/mcp/test_server.py tests/mcp/test_http_api.py tests/test_cli_thresholds.py tests/cuda_controller/test_keep_and_release.py tests/cuda_controller/test_throttle.py` -> 26 passed. -- `pre-commit run --all-files` -> all hooks passed. -- `mkdocs build` -> success. - -## Error Log (This Iteration) - -- `pre-commit run --all-files` reformatted `tests/test_cli_service_commands.py` by way of Black on first run; second run passed. diff --git a/docs/plans/cli-service-dashboard/task_plan.md b/docs/plans/cli-service-dashboard/task_plan.md deleted file mode 100644 index 749bba1..0000000 --- a/docs/plans/cli-service-dashboard/task_plan.md +++ /dev/null @@ -1,48 +0,0 @@ -# Task Plan: Non-blocking CLI + Local Service + Dashboard - -## Background - -`keep-gpu` currently runs as a blocking command. This works for manual shells but is awkward for LLM agent workflows that need to continue with follow-up commands. - -## Goal - -Add an agent-friendly, non-blocking CLI workflow while preserving the existing blocking behavior, then provide a production-grade web dashboard backed by the same local service. - -## Proposed Solution - -1. Keep existing `keep-gpu` (blocking) unchanged for compatibility. -2. Extend CLI with service-driven subcommands: `serve`, `start`, `status`, `stop`, and `list-gpus`. -3. Implement service auto-start in `start` when local service is unavailable. -4. Extend HTTP server with REST endpoints and serve a React/Vite-built dashboard. -5. Update docs and the KeepGPU skill to reflect the new preferred agent workflow. - -## Phases - -### Phase 1: Design and Scaffolding -- [x] Define CLI subcommand UX and local service contract -- [x] Add service client/autostart helpers -- [x] Add REST handlers and static file serving skeleton -- **Status:** complete - -### Phase 2: Dashboard Frontend (React/Vite) -- [x] Add Vite app and distinctive UI design -- [x] Implement live GPU/session views and session controls -- [x] Build and bundle static assets for Python serving -- **Status:** complete - -### Phase 3: Tests and Documentation -- [x] Add/extend tests for new CLI and server behavior -- [x] Update CLI/MCP docs and README -- [x] Update `skills/gpu-keepalive-with-keepgpu/SKILL.md` -- **Status:** complete - -### Phase 4: Validation -- [x] Run targeted pytest suites -- [x] Run `pre-commit run --all-files` -- [x] Run `mkdocs build` -- **Status:** complete - -## Risks / Open Questions - -1. How to keep auto-start robust without introducing zombie processes. -2. How to package dashboard assets cleanly for source and installed usage.