[cli,mcp,dashboard,docs,skills] feat: add non-blocking service workflow and UI#66
[cli,mcp,dashboard,docs,skills] feat: add non-blocking service workflow and UI#66Wangmerlyn merged 7 commits intomainfrom
Conversation
|
Warning Rate limit exceeded
⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (12)
📝 WalkthroughWalkthroughIntroduces a comprehensive service-oriented architecture for KeepGPU, replacing the blocking-only CLI with a multi-command interface (serve, start, status, stop, list-gpus, service-stop) supporting non-blocking sessions. Expands the MCP server to expose HTTP/REST endpoints alongside JSON-RPC and adds a React-based web dashboard UI. Changes
Sequence Diagram(s)sequenceDiagram
participant User as User/Agent
participant CLI as CLI Interface
participant Service as Local Service
participant Controller as GPU Controller
User->>CLI: keep-gpu start --gpu-ids 0
CLI->>CLI: Check if service running
CLI->>Service: Auto-start service (spawn process)
Service->>Service: Initialize HTTP/RPC server
Service-->>CLI: Service ready
CLI->>Service: RPC call: start_keep(gpu_ids, vram, ...)
Service->>Controller: Acquire GPU resource
Controller-->>Service: Session created (job_id)
Service-->>CLI: job_id + metadata
CLI-->>User: Print job_id & dashboard URL
Note over User,CLI: Non-blocking: User continues
User->>CLI: keep-gpu status
CLI->>Service: RPC/HTTP: status(job_id?)
Service->>Controller: Query session state
Controller-->>Service: Session metrics
Service-->>CLI: Status JSON
CLI-->>User: Display status + utilization
User->>CLI: keep-gpu stop --job-id <id>
CLI->>Service: RPC/HTTP: stop_keep(job_id)
Service->>Controller: Release GPU session
Controller-->>Service: Released
Service-->>CLI: Stopped confirmation
CLI-->>User: Session stopped
sequenceDiagram
participant Browser as Browser/Dashboard UI
participant Dashboard as React App
participant API as HTTP REST Server
participant Service as Keep Service
Browser->>Dashboard: Load /index.html
Dashboard->>API: GET /api/gpus
API->>Service: Query GPU telemetry
Service-->>API: GPU list + utilization
API-->>Dashboard: JSON response
Dashboard->>API: GET /api/sessions
API->>Service: Query active sessions
Service-->>API: Sessions list
API-->>Dashboard: Sessions JSON
Dashboard-->>Browser: Render GPU grid + session list
Browser->>Dashboard: Click "Start Session" form
Dashboard->>API: POST /api/sessions (gpu_ids, vram, ...)
API->>Service: start_keep RPC
Service-->>API: job_id + session metadata
API-->>Dashboard: Session created
Dashboard-->>Browser: Update sessions list, show status
Note over Browser,Service: Auto-refresh every 3s
Dashboard->>API: GET /api/gpus
API-->>Dashboard: Updated metrics
Dashboard-->>Browser: Refresh telemetry display
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
|
/gemini review\n@coderabbitai review |
|
🧠 Learnings used✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 135a4423e5
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
web/dashboard/src/App.jsx
Outdated
| .map((part) => Number(part.trim())) | ||
| .filter((value) => Number.isFinite(value)) |
There was a problem hiding this comment.
Reject empty GPU ID tokens in dashboard parsing
The dashboard parser coerces each comma-separated token with Number(...), and Number("") becomes 0, so inputs with a trailing/extra comma (for example "1," or "0,,2") silently add GPU 0 to the request. This can arm sessions on the wrong device and unexpectedly occupy GPU 0, which is especially risky on shared machines; the parser should treat empty/invalid tokens as an input error instead of coercing/filtering them.
Useful? React with 👍 / 👎.
| if path.startswith("/api/sessions/"): | ||
| job_id = unquote(path.rsplit("/", 1)[-1]) | ||
| self._json_response(200, server_ref.stop_keep(job_id=job_id)) |
There was a problem hiding this comment.
Prevent trailing-slash deletes from stopping all sessions
The DELETE route extracts the job id with path.rsplit("/", 1)[-1], which yields an empty string for /api/sessions/<job_id>/; passing that empty value into stop_keep triggers the "stop all" branch because falsy job_id means no specific target. A single trailing slash can therefore terminate every active session instead of one session, so this path should reject empty IDs (or normalize path handling before dispatch).
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Code Review
This pull request introduces a new non-blocking service-based workflow with a web dashboard for KeepGPU, maintaining backward compatibility with the existing CLI. A critical security concern is the lack of authentication in the new local service API, making it vulnerable if exposed to the network, and its susceptibility to memory exhaustion and crashes from malformed JSON payloads. Furthermore, the implementation has a potential crash on Windows when auto-starting the service and an unhandled error case in the REST API, suggesting opportunities for more robust error handling and a more user-friendly UI.
src/keep_gpu/cli.py
Outdated
| subprocess.Popen( | ||
| [ | ||
| sys.executable, | ||
| "-m", | ||
| "keep_gpu.mcp.server", | ||
| "--mode", | ||
| "http", | ||
| "--host", | ||
| host, | ||
| "--port", | ||
| str(port), | ||
| ], | ||
| stdin=subprocess.DEVNULL, | ||
| stdout=log_file, | ||
| stderr=log_file, | ||
| start_new_session=True, | ||
| ) |
There was a problem hiding this comment.
The use of start_new_session=True is specific to POSIX-like systems and will cause an error on Windows. To ensure cross-platform compatibility for detaching the service process, you should use creationflags on Windows. Here is a suggested implementation that handles both cases.
popen_kwargs = {
"stdin": subprocess.DEVNULL,
"stdout": log_file,
"stderr": log_file,
}
if sys.platform == "win32":
# DETACHED_PROCESS makes the new process not inherit the parent's console.
popen_kwargs["creationflags"] = subprocess.DETACHED_PROCESS
else:
# start_new_session creates a new session, detaching it from the
# parent's terminal, which is what we want for a background service.
popen_kwargs["start_new_session"] = True
subprocess.Popen(
[
sys.executable,
"-m",
"keep_gpu.mcp.server",
"--mode",
"http",
"--host",
host,
"--port",
str(port),
],
**popen_kwargs,
)| def do_GET(self): # noqa: N802 | ||
| parsed = urlparse(self.path) | ||
| path = parsed.path | ||
| server_ref = self.server.keepgpu_server # type: ignore[attr-defined] | ||
|
|
||
| if path == "/health": | ||
| self._json_response(200, {"ok": True}) | ||
| return | ||
| if path == "/api/gpus": | ||
| self._json_response(200, server_ref.list_gpus()) | ||
| return | ||
| if path == "/api/sessions": | ||
| self._json_response(200, server_ref.status()) | ||
| return | ||
| if path.startswith("/api/sessions/"): | ||
| job_id = unquote(path.rsplit("/", 1)[-1]) | ||
| self._json_response(200, server_ref.status(job_id=job_id)) | ||
| return | ||
|
|
||
| self._serve_static(path) | ||
|
|
||
| def do_POST(self): # noqa: N802 | ||
| """ | ||
| Handle an HTTP JSON-RPC request and write a JSON response. | ||
|
|
||
| Expects application/json bodies containing {"method", "params", "id"}. | ||
| Returns 400 with an error object if parsing fails. | ||
| """ | ||
| parsed = urlparse(self.path) | ||
| path = parsed.path | ||
| server_ref = self.server.keepgpu_server # type: ignore[attr-defined] | ||
| try: | ||
| length = int(self.headers.get("content-length", "0")) | ||
| body = self.rfile.read(length).decode("utf-8") | ||
| payload = json.loads(body) | ||
| server_ref = self.server.keepgpu_server # type: ignore[attr-defined] | ||
| response = _handle_request(server_ref, payload) | ||
| status = 200 | ||
| payload = self._read_json_body() | ||
| if path == "/api/sessions": | ||
| result = server_ref.start_keep(**payload) | ||
| self._json_response(200, result) | ||
| return | ||
|
|
||
| # JSON-RPC compatibility endpoint. | ||
| if path in ("/", "/rpc"): | ||
| response = _handle_request(server_ref, payload) | ||
| self._json_response(200, response) | ||
| return | ||
|
|
||
| self._json_response(404, {"error": {"message": "Unknown endpoint"}}) | ||
| except (json.JSONDecodeError, ValueError, UnicodeDecodeError) as exc: | ||
| response = {"error": {"message": f"Bad request: {exc}"}} | ||
| status = 400 | ||
| data = json.dumps(response).encode() | ||
| self.send_response(status) | ||
| self.send_header("content-type", "application/json") | ||
| self.send_header("content-length", str(len(data))) | ||
| self.end_headers() | ||
| self.wfile.write(data) | ||
| self._json_response(400, {"error": {"message": f"Bad request: {exc}"}}) | ||
|
|
||
| def do_DELETE(self): # noqa: N802 | ||
| parsed = urlparse(self.path) | ||
| path = parsed.path | ||
| server_ref = self.server.keepgpu_server # type: ignore[attr-defined] | ||
|
|
||
| if path == "/api/sessions": | ||
| self._json_response(200, server_ref.stop_keep(job_id=None)) | ||
| return | ||
| if path.startswith("/api/sessions/"): | ||
| job_id = unquote(path.rsplit("/", 1)[-1]) | ||
| self._json_response(200, server_ref.stop_keep(job_id=job_id)) | ||
| return | ||
| self._json_response(404, {"error": {"message": "Unknown endpoint"}}) | ||
|
|
There was a problem hiding this comment.
The KeepGPUServer exposes sensitive GPU control endpoints (e.g., starting/stopping sessions) without authentication or authorization, making it vulnerable if exposed to the network. An attacker could manipulate GPU resources. Additionally, the /api/sessions POST endpoint's try...except block is incomplete, failing to handle TypeError for unexpected payload keys, which can lead to unhandled exceptions and 500 errors, further degrading API robustness.
| def _read_json_body(self) -> Dict[str, Any]: | ||
| length = int(self.headers.get("content-length", "0")) | ||
| body = self.rfile.read(length).decode("utf-8") | ||
| return json.loads(body) |
There was a problem hiding this comment.
src/keep_gpu/mcp/server.py
Outdated
| payload = self._read_json_body() | ||
| if path == "/api/sessions": | ||
| result = server_ref.start_keep(**payload) |
There was a problem hiding this comment.
In the do_POST method, user-supplied JSON payloads are unpacked as keyword arguments to the start_keep method using the **payload syntax. If the payload contains keys that do not match the method signature, a TypeError is raised. This exception is not caught by the specific try...except block in the handler, which only catches ValueError and json.JSONDecodeError. This could lead to failed requests and server instability when processing malformed input.
| try: | ||
| payload = _http_json_request("GET", f"{_service_base_url(host, port)}/health") | ||
| return bool(payload.get("ok")) | ||
| except Exception: |
There was a problem hiding this comment.
web/dashboard/src/App.jsx
Outdated
| GPUs {JSON.stringify(session.params.gpu_ids)} / {session.params.vram} / | ||
| {" "} | ||
| {session.params.interval}s / threshold {session.params.busy_threshold}% |
There was a problem hiding this comment.
The UI currently displays gpu_ids as a raw JSON string (e.g., null or [0,1]). This could be made more user-friendly. For example, you could display "all" when gpu_ids is null or empty, and a simple comma-separated string like "0, 1" otherwise.
| GPUs {JSON.stringify(session.params.gpu_ids)} / {session.params.vram} / | |
| {" "} | |
| {session.params.interval}s / threshold {session.params.busy_threshold}% | |
| GPUs {session.params.gpu_ids ? session.params.gpu_ids.join(", ") : "all"} / {session.params.vram} / | |
| {" "} | |
| {session.params.interval}s / threshold {session.params.busy_threshold}% |
|
@Wangmerlyn Thanks for the explicit invocation! I'll proceed with a full code review of the pull request now. |
|
/gemini review\n@coderabbitai review |
|
🧠 Learnings used✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Code Review
This pull request introduces a significant new feature: a non-blocking service-based workflow for "keep-gpu", complete with a REST API and a React-based web dashboard. The implementation preserves backward compatibility by keeping the original blocking CLI behavior as the default. However, it introduces several security concerns, primarily the lack of authentication on the local HTTP service, which allows any user on a multi-user system to control GPU reservations. The service is also vulnerable to memory exhaustion via large request bodies and potential thread exhaustion through unvalidated GPU ID lists. Additionally, a potential robustness issue in the new API server and a minor inconsistency in the documentation were identified. Implementing basic authentication and input validation is recommended to ensure the service is safe for use in shared environments.
| def do_GET(self): # noqa: N802 | ||
| parsed = urlparse(self.path) | ||
| path = parsed.path | ||
| server_ref = self.server.keepgpu_server # type: ignore[attr-defined] | ||
|
|
||
| if path == "/health": | ||
| self._json_response(200, {"ok": True}) | ||
| return | ||
| if path == "/api/gpus": | ||
| self._json_response(200, server_ref.list_gpus()) | ||
| return | ||
| if path == "/api/sessions": | ||
| self._json_response(200, server_ref.status()) | ||
| return | ||
| if path.startswith("/api/sessions/"): | ||
| job_id = unquote(path.rsplit("/", 1)[-1]) | ||
| self._json_response(200, server_ref.status(job_id=job_id)) | ||
| return | ||
|
|
||
| self._serve_static(path) | ||
|
|
||
| def do_POST(self): # noqa: N802 | ||
| """ | ||
| Handle an HTTP JSON-RPC request and write a JSON response. | ||
|
|
||
| Expects application/json bodies containing {"method", "params", "id"}. | ||
| Returns 400 with an error object if parsing fails. | ||
| """ | ||
| parsed = urlparse(self.path) | ||
| path = parsed.path | ||
| server_ref = self.server.keepgpu_server # type: ignore[attr-defined] | ||
| try: | ||
| length = int(self.headers.get("content-length", "0")) | ||
| body = self.rfile.read(length).decode("utf-8") | ||
| payload = json.loads(body) | ||
| server_ref = self.server.keepgpu_server # type: ignore[attr-defined] | ||
| response = _handle_request(server_ref, payload) | ||
| status = 200 | ||
| payload = self._read_json_body() | ||
| if path == "/api/sessions": | ||
| result = server_ref.start_keep(**payload) | ||
| self._json_response(200, result) | ||
| return | ||
|
|
||
| # JSON-RPC compatibility endpoint. | ||
| if path in ("/", "/rpc"): | ||
| response = _handle_request(server_ref, payload) | ||
| self._json_response(200, response) | ||
| return | ||
|
|
||
| self._json_response(404, {"error": {"message": "Unknown endpoint"}}) | ||
| except (json.JSONDecodeError, ValueError, UnicodeDecodeError) as exc: | ||
| response = {"error": {"message": f"Bad request: {exc}"}} | ||
| status = 400 | ||
| data = json.dumps(response).encode() | ||
| self.send_response(status) | ||
| self.send_header("content-type", "application/json") | ||
| self.send_header("content-length", str(len(data))) | ||
| self.end_headers() | ||
| self.wfile.write(data) | ||
| self._json_response(400, {"error": {"message": f"Bad request: {exc}"}}) | ||
|
|
||
| def do_DELETE(self): # noqa: N802 | ||
| parsed = urlparse(self.path) | ||
| path = parsed.path | ||
| server_ref = self.server.keepgpu_server # type: ignore[attr-defined] | ||
|
|
||
| if path == "/api/sessions": | ||
| self._json_response(200, server_ref.stop_keep(job_id=None)) | ||
| return | ||
| if path.startswith("/api/sessions/"): | ||
| job_id = unquote(path.rsplit("/", 1)[-1]) | ||
| self._json_response(200, server_ref.stop_keep(job_id=job_id)) | ||
| return | ||
| self._json_response(404, {"error": {"message": "Unknown endpoint"}}) | ||
|
|
There was a problem hiding this comment.
The KeepGPU local service lacks authentication, allowing any user on a multi-user system to control GPU reservations and potentially cause resource exhaustion. This is a critical security vulnerability. Additionally, the current handling of the payload dictionary by passing it directly with ** can lead to TypeError if unexpected keys are sent, resulting in unhandled exceptions and 500 errors, further contributing to robustness issues and potential resource exhaustion. It is recommended to implement a token-based authentication mechanism for the service and to filter the payload to only include expected keys for methods like start_keep to improve both security and robustness.
| def _read_json_body(self) -> Dict[str, Any]: | ||
| length = int(self.headers.get("content-length", "0")) | ||
| body = self.rfile.read(length).decode("utf-8") | ||
| return json.loads(body) |
There was a problem hiding this comment.
The _read_json_body method reads the entire request body into memory based on the Content-Length header without any size limits. An attacker could send a request with a very large Content-Length to cause memory exhaustion on the host machine. It is recommended to implement a maximum allowed size for the request body and verify the Content-Length header against this limit before reading the data.
| def _read_json_body(self) -> Dict[str, Any]: | |
| length = int(self.headers.get("content-length", "0")) | |
| body = self.rfile.read(length).decode("utf-8") | |
| return json.loads(body) | |
| def _read_json_body(self) -> Dict[str, Any]: | |
| length = int(self.headers.get("content-length", "0")) | |
| if length > 1_000_000: # Example 1MB limit | |
| raise ValueError("Request body too large") | |
| body = self.rfile.read(length).decode("utf-8") | |
| return json.loads(body) |
src/keep_gpu/mcp/server.py
Outdated
| status = 200 | ||
| payload = self._read_json_body() | ||
| if path == "/api/sessions": | ||
| result = server_ref.start_keep(**payload) |
There was a problem hiding this comment.
The /api/sessions endpoint passes the user-supplied payload directly to server_ref.start_keep(**payload). The start_keep method (and the underlying GlobalGPUController) creates a new thread for each GPU ID provided in the gpu_ids parameter. An attacker could send a request with an excessively large list of GPU IDs, causing the service to attempt to spawn a massive number of threads, potentially leading to process or system instability. The gpu_ids parameter should be validated against the actual number of available GPUs on the system before proceeding.
| keep-gpu serve --host 127.0.0.1 --port 8765 | ||
| ``` | ||
|
|
||
| `http://127.0.0.1:8765/` |
There was a problem hiding this comment.
There was a problem hiding this comment.
Actionable comments posted: 13
🧹 Nitpick comments (5)
docs/getting-started.md (1)
93-106: Add the service teardown command to complete the workflow.This quickstart shows how to start/serve, but not how to stop the daemon. Adding
keep-gpu service-stophere would prevent orphaned local services.📝 Proposed doc tweak
keep-gpu serve --host 127.0.0.1 --port 8765
http://127.0.0.1:8765/
+
+When finished:
+
+bash +keep-gpu service-stop +</details> Based on learnings: CLI interface at `keep-gpu` is a first-class interface that must be kept in sync with related documentation and tests when behavior changes. <details> <summary>🤖 Prompt for AI Agents</summary>Verify each finding against the current code and only fix it if needed.
In
@docs/getting-started.mdaround lines 93 - 106, The docs' quickstart shows
starting and serving the daemon but omits how to stop it; update the
getting-started.md section that listskeep-gpu startandkeep-gpu serveto
include the teardown commandkeep-gpu service-stopwith a short "When
finished:" lead-in so users can cleanly stop the daemon; locate the block
containingkeep-gpu start --gpu-idsand thekeep-gpu serve --hostexample
and append the service-stop instruction and example invocation.</details> </blockquote></details> <details> <summary>docs/plans/cli-service-dashboard/findings.md (1)</summary><blockquote> `1-47`: **Add an explicit TODO checklist section to match plan-doc conventions.** Given this file lives under `docs/plans/`, include a clear todo/status block (done/in-progress/pending) so execution tracking remains explicit. As per coding guidelines: `docs/plans/**/*.md`: Before starting complex work, capture background, goal, solution, and todo items in a Markdown plan under `docs/plans/`. <details> <summary>🤖 Prompt for AI Agents</summary>Verify each finding against the current code and only fix it if needed.
In
@docs/plans/cli-service-dashboard/findings.mdaround lines 1 - 47, Add a
"TODO / Status" checklist section to the existing plan document "Findings:
Non-blocking CLI + Dashboard" that follows docs/plans conventions: under the
top-level findings heading add a "## TODO / Status" (or "## Roadmap / TODO")
subsection that explicitly lists Background, Goal, Solution, and a checklist of
actionable items with statuses (done / in-progress / pending) and owner/ETA
where useful; ensure it references implemented artifacts like
src/keep_gpu/cli.py, src/keep_gpu/mcp/server.py, and web/dashboard/ so readers
can map tasks to code (e.g., entries for "serve subcommand implemented (done)",
"service-stop command tests (done/in-progress)", "dashboard build pipeline
(pending)"); keep the checklist concise and use Markdown task list syntax so
it's easy to track progress.</details> </blockquote></details> <details> <summary>docs/index.md (1)</summary><blockquote> `35-36`: **Make the “Dashboard + API” entry clickable for consistency.** This is the only item in the list without a link, which makes navigation slightly inconsistent. <details> <summary>🔗 Proposed tweak</summary> ```diff -- :material-monitor-dashboard: **Dashboard + API** – Run `keep-gpu serve` and open +- :material-monitor-dashboard: **[Dashboard + API](guides/mcp.md)** – Run `keep-gpu serve` and open `http://127.0.0.1:8765/` for session controls and telemetry.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/index.md` around lines 35 - 36, The list entry for ":material-monitor-dashboard: **Dashboard + API** – Run `keep-gpu serve` and open `http://127.0.0.1:8765/` is the only item missing a link; update that entry to make the "Dashboard + API" text (or the whole entry) a clickable Markdown link to the dashboard URL (e.g., `[Dashboard + API](http://127.0.0.1:8765/)`) so it matches the other linked items and improves navigation.tests/mcp/test_http_api.py (1)
40-68: Optional: extract HTTP server spin-up/cleanup into a shared fixture/helper.Both tests repeat identical server/thread lifecycle code. A small fixture would reduce maintenance and make future endpoint tests easier to add.
Also applies to: 70-109
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/mcp/test_http_api.py` around lines 40 - 68, Refactor the duplicated HTTP server spin-up/cleanup in test_http_health_and_static_index (and the block at lines 70-109) into a reusable pytest fixture or helper that constructs KeepGPUServer(controller_factory=dummy_factory), creates the _Server subclassing TCPServer with allow_reuse_address=True, instantiates it with _JSONRPCHandler, sets httpd.keepgpu_server, starts a daemon thread running httpd.serve_forever, yields the base URL (or httpd and thread) to tests, and ensures proper teardown by calling httpd.shutdown(), httpd.server_close(), server.shutdown(), and joining the thread; update test_http_health_and_static_index to use that fixture and keep calls to _request_json, Request, and urlopen unchanged.tests/test_cli_service_commands.py (1)
76-80: Optional lint-hygiene cleanup for test doubles.To avoid ARG001/ARG005 noise, consider prefixing intentionally unused mock parameters with
_(for example_host,_port,_params).Also applies to: 94-95, 104-114
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/test_cli_service_commands.py` around lines 76 - 80, The test doubles fake_ensure and fake_rpc are triggering linter ARG001/ARG005 because unused parameters are named without a leading underscore; update their signatures to prefix unused parameters (e.g., change host, port, auto_start to _host, _port, _auto_start in fake_ensure and method, params, host, port to _method, _params, _host, _port in fake_rpc) and apply the same renaming pattern to the other test doubles referenced around lines 94-95 and 104-114 to silence the lint warnings while keeping behavior identical.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/guides/cli.md`:
- Around line 101-106: Update the Troubleshooting entry for daemon shutdown to
mention active sessions: add guidance that if sessions are active you must run
the session shutdown command (keep-gpu stop --all) first or use the forced
service stop (keep-gpu service-stop --force) before trying keep-gpu
service-stop, so users aren't blocked by active sessions; update the line under
"Need to close background service" to include these two commands and a brief
note about active sessions.
In `@docs/plans/cli-service-dashboard/progress.md`:
- Around line 23-27: Update the progress section that currently reads "Preparing
final summary and PR after successful validation." and the "Next" list item
"Create commit and open PR for review." to reflect that the PR is already open:
replace those lines with a current-status statement (e.g., "PR open: awaiting
review/feedback") and update the "Next" items to actions for an open PR such as
"Address review feedback," "Respond to reviewer comments," and "Prepare
merge-ready changes"; ensure the heading "Next" and the list content are
consistent with the PR-open lifecycle.
In `@docs/reference/cli.md`:
- Around line 7-14: Update the command summary lines to include the host/port
flags for the status and service-stop commands: modify the entries for "keep-gpu
status" and "keep-gpu service-stop" so they show "[--host 127.0.0.1] [--port
8765]" (matching the detailed sections) to ensure the summary reflects the
documented flags for keep-gpu status and keep-gpu service-stop.
In `@skills/gpu-keepalive-with-keepgpu/SKILL.md`:
- Around line 94-105: Add one explicit sentence clarifying that Ctrl+C only
stops foreground runs and does not affect background service sessions; for
service mode users must run keep-gpu stop (to stop sessions) and optionally
keep-gpu service-stop (to stop the service). Mention the specific commands
keep-gpu start, keep-gpu status, keep-gpu stop, and keep-gpu service-stop so
readers can locate relevant behavior and that this applies after the
non-blocking vs service mode split.
In `@src/keep_gpu/cli.py`:
- Around line 186-206: The stop routine (_stop_service_process) blindly signals
whatever PID is read from disk; add an ownership check before sending signals:
implement a helper _is_managed_keepgpu_pid(pid: int) that verifies the PID
corresponds to our managed keep_gpu process (e.g., inspect process command/args
for "keep_gpu.mcp.server" and "--mode http" or equivalent), call this helper
after _read_service_pid and before any _pid_alive/os.kill actions, and if the
check fails treat the PID as invalid (clear the PID with _clear_service_pid and
return False) to avoid killing unrelated processes.
- Around line 125-127: The code currently assumes the HTTP response is JSON and
calls json.loads(body) directly after urlopen(request...), which raises on
non-JSON replies and bypasses command error handling; wrap the json.loads(body)
call in a try/except that catches json.JSONDecodeError (and ValueError as a
fallback), and in the except path convert the raw response into a controlled CLI
error (or raise the same command-level exception type used elsewhere) including
the response body and HTTP status/context from response, so non-KeepGPU
endpoints produce a handled error instead of an unhandled exception; keep the
happy path returning the parsed JSON or {} when body is empty.
In `@src/keep_gpu/mcp/server.py`:
- Around line 261-264: The handler that extracts job_id with
unquote(path.rsplit("/", 1)[-1]) (in the GET branch that calls server_ref.status
and the DELETE branch that calls server_ref.stop_keep) can produce an empty
string and thus trigger a global "stop all" behavior; add a guard after
extracting job_id to validate it is non-empty (e.g., if not job_id: return a
400/404 via self._json_response) and only call server_ref.status(job_id=job_id)
or server_ref.stop_keep(job_id=job_id) when job_id is present; update both
branches that use unquote(path.rsplit("/", 1)[-1]) to perform this check before
invoking server_ref methods.
- Around line 279-293: The handler only converts JSON parse errors into
structured JSON responses, but exceptions thrown by
server_ref.start_keep(**payload) or during JSON-RPC handling (e.g., TypeError,
KeyError, RuntimeError) bypass that and return non-JSON responses; wrap the
calls to server_ref.start_keep(...) and _handle_request(...) (and the
surrounding request handling block that reads payload via _read_json_body()) so
that any non-parse exception is caught and sent through _json_response with a
500 status and a structured body like {"error": {"message": str(exc), "type":
exc.__class__.__name__}}; keep the existing
JSONDecodeError/ValueError/UnicodeDecodeError handling but add a broader except
Exception as exc branch that uses _json_response(500, ...) to ensure consistent
JSON error formatting for start_keep, _handle_request, and other runtime
failures.
- Line 266: The GET handler currently funnels all unmatched paths to
self._serve_static(path), causing /api/* unknown endpoints to return the SPA
HTML; update the GET handling (e.g., in do_GET or the method that calls
self._serve_static) to intercept paths that start with "/api/" and return a
proper JSON 404 instead of falling through: send a 404 status, set
"Content-Type: application/json", write a small JSON body like {"error":"Not
Found"} (or similar) and return early; leave non-/api/ paths to continue to
self._serve_static(path).
In `@src/keep_gpu/mcp/static/assets/index.css`:
- Line 1: The bundled CSS (index.css) is triggering stylelint errors (e.g.,
font-family-name-quotes, keyframe-selector-notation); update the stylelint
configuration to exclude generated assets and point linting at the source
stylesheet instead: add an ignoreFiles/ignorePatterns entry for the MCP static
assets bundle (the generated index.css) in your .stylelintrc / stylelint config
and ensure the canonical source (the dashboard styles.css) remains in the lint
target list so only the authoring file is linted.
In `@web/dashboard/src/App.jsx`:
- Around line 25-33: The parseGpuIds function currently coerces inputs with
Number(...) causing empty strings to become 0 and allowing negatives/decimals;
update parseGpuIds to reject non-integer or negative tokens by trimming,
splitting on "," and validating each token with a strict integer regex (e.g.
/^\d+$/) or parseInt combined with Number.isInteger and >= 0, then return an
array of integers or null for empty input; also apply the same strict validation
where the session payload is built before calling /api/sessions (the code that
assembles gpu_ids for the POST) so you never send 0 for empty fields or
fractional/negative IDs.
In `@web/dashboard/src/styles.css`:
- Line 75: Replace the quoted font-family declaration "font-family: "Fraunces",
serif;" with an unquoted font family name so it reads font-family: Fraunces,
serif; — update both occurrences of the exact declaration string ("font-family:
"Fraunces", serif;") so they conform to stylelint's font-family-name-quotes
rule.
- Line 1: The `@import` uses url(...) which violates the Stylelint rule expecting
string notation; update the `@import` statement (the line that currently imports
fonts with url(...)) to use a quoted string instead (e.g., `@import`
"https://fonts.googleapis.com/..." ) so the import uses string notation while
preserving the same URL and parameters.
---
Nitpick comments:
In `@docs/getting-started.md`:
- Around line 93-106: The docs' quickstart shows starting and serving the daemon
but omits how to stop it; update the getting-started.md section that lists
`keep-gpu start` and `keep-gpu serve` to include the teardown command `keep-gpu
service-stop` with a short "When finished:" lead-in so users can cleanly stop
the daemon; locate the block containing `keep-gpu start --gpu-ids` and the
`keep-gpu serve --host` example and append the service-stop instruction and
example invocation.
In `@docs/index.md`:
- Around line 35-36: The list entry for ":material-monitor-dashboard:
**Dashboard + API** – Run `keep-gpu serve` and open `http://127.0.0.1:8765/` is
the only item missing a link; update that entry to make the "Dashboard + API"
text (or the whole entry) a clickable Markdown link to the dashboard URL (e.g.,
`[Dashboard + API](http://127.0.0.1:8765/)`) so it matches the other linked
items and improves navigation.
In `@docs/plans/cli-service-dashboard/findings.md`:
- Around line 1-47: Add a "TODO / Status" checklist section to the existing plan
document "Findings: Non-blocking CLI + Dashboard" that follows docs/plans
conventions: under the top-level findings heading add a "## TODO / Status" (or
"## Roadmap / TODO") subsection that explicitly lists Background, Goal,
Solution, and a checklist of actionable items with statuses (done / in-progress
/ pending) and owner/ETA where useful; ensure it references implemented
artifacts like src/keep_gpu/cli.py, src/keep_gpu/mcp/server.py, and
web/dashboard/ so readers can map tasks to code (e.g., entries for "serve
subcommand implemented (done)", "service-stop command tests (done/in-progress)",
"dashboard build pipeline (pending)"); keep the checklist concise and use
Markdown task list syntax so it's easy to track progress.
In `@tests/mcp/test_http_api.py`:
- Around line 40-68: Refactor the duplicated HTTP server spin-up/cleanup in
test_http_health_and_static_index (and the block at lines 70-109) into a
reusable pytest fixture or helper that constructs
KeepGPUServer(controller_factory=dummy_factory), creates the _Server subclassing
TCPServer with allow_reuse_address=True, instantiates it with _JSONRPCHandler,
sets httpd.keepgpu_server, starts a daemon thread running httpd.serve_forever,
yields the base URL (or httpd and thread) to tests, and ensures proper teardown
by calling httpd.shutdown(), httpd.server_close(), server.shutdown(), and
joining the thread; update test_http_health_and_static_index to use that fixture
and keep calls to _request_json, Request, and urlopen unchanged.
In `@tests/test_cli_service_commands.py`:
- Around line 76-80: The test doubles fake_ensure and fake_rpc are triggering
linter ARG001/ARG005 because unused parameters are named without a leading
underscore; update their signatures to prefix unused parameters (e.g., change
host, port, auto_start to _host, _port, _auto_start in fake_ensure and method,
params, host, port to _method, _params, _host, _port in fake_rpc) and apply the
same renaming pattern to the other test doubles referenced around lines 94-95
and 104-114 to silence the lint warnings while keeping behavior identical.
ℹ️ Review info
Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 38c4186d-9560-405e-be2d-cc385aa938c3
⛔ Files ignored due to path filters (1)
web/dashboard/package-lock.jsonis excluded by!**/package-lock.json
📒 Files selected for processing (27)
README.mddocs/getting-started.mddocs/guides/cli.mddocs/guides/mcp.mddocs/index.mddocs/plans/cli-service-dashboard/findings.mddocs/plans/cli-service-dashboard/progress.mddocs/plans/cli-service-dashboard/task_plan.mddocs/reference/cli.mddocs/skills/gpu-keepalive-with-keepgpu/skillcheck-free-report.mdmkdocs.ymlskills/gpu-keepalive-with-keepgpu/SKILL.mdskills/gpu-keepalive-with-keepgpu/agents/openai.yamlsrc/keep_gpu/cli.pysrc/keep_gpu/mcp/server.pysrc/keep_gpu/mcp/static/assets/dashboard.jssrc/keep_gpu/mcp/static/assets/index.csssrc/keep_gpu/mcp/static/index.htmltests/mcp/test_http_api.pytests/test_cli_service_commands.pyweb/dashboard/.gitignoreweb/dashboard/index.htmlweb/dashboard/package.jsonweb/dashboard/src/App.jsxweb/dashboard/src/main.jsxweb/dashboard/src/styles.cssweb/dashboard/vite.config.js
| @@ -0,0 +1 @@ | |||
| @import"https://fonts.googleapis.com/css2?family=IBM+Plex+Sans:wght@400;500;600;700&family=IBM+Plex+Mono:wght@400;500&family=Fraunces:opsz,wght@9..144,500;9..144,700&display=swap";:root{--bg-0: #0e1116;--bg-1: #171c24;--bg-2: #1f2630;--panel: rgba(25, 31, 40, .82);--border: rgba(196, 171, 127, .24);--text-main: #ece5d9;--text-muted: #b2a996;--accent: #c4ab7f;--accent-soft: #e6d3ac;--cool: #6c8ca9;--alert: #c0695d;--warm: #bf8f52;--shadow: 0 18px 34px rgba(0, 0, 0, .38)}*{box-sizing:border-box}body{margin:0;min-height:100vh;color:var(--text-main);font-family:IBM Plex Sans,Segoe UI,sans-serif;background:radial-gradient(circle at 80% 0%,rgba(196,171,127,.08) 0%,transparent 35%),linear-gradient(180deg,#12161d,#0d1015)}.deck{position:relative;min-height:100vh;padding:2rem clamp(1rem,2vw,2rem) 1.2rem;display:flex;flex-direction:column;gap:1rem}.grid-noise{position:fixed;top:0;right:0;bottom:0;left:0;pointer-events:none;opacity:.2;background-image:linear-gradient(rgba(255,255,255,.02) 1px,transparent 1px),linear-gradient(90deg,rgba(255,255,255,.02) 1px,transparent 1px);background-size:36px 36px}.glass{background:linear-gradient(175deg,rgba(34,41,50,.8),var(--panel));border:1px solid var(--border);box-shadow:inset 0 1px #ffffff08,var(--shadow);border-radius:.7rem}.masthead{padding:1.2rem 1.25rem}.eyebrow{margin:0;color:var(--accent-soft);font-family:IBM Plex Mono,monospace;letter-spacing:.14em;text-transform:uppercase;font-size:.72rem}.masthead h1{margin:.55rem 0 .35rem;font-family:Fraunces,serif;font-weight:700;font-size:clamp(1.4rem,3vw,2.2rem);letter-spacing:.01em}.masthead p{margin:0;color:var(--text-muted);max-width:72ch;line-height:1.45}.service-hint{margin-top:.6rem;font-family:IBM Plex Mono,monospace;font-size:.74rem;color:#bdb29e}.service-hint code{margin:0 .25rem;color:var(--accent-soft)}.stats-row{display:grid;grid-template-columns:repeat(3,minmax(0,1fr));gap:.8rem}.stat-card{padding:.85rem .95rem}.stat-card h2{margin:0;color:var(--text-muted);text-transform:uppercase;letter-spacing:.11em;font-size:.72rem;font-family:IBM Plex Mono,monospace}.stat-card p{margin:.45rem 0 0;color:var(--accent-soft);font-size:clamp(1.2rem,2vw,1.75rem);font-weight:600}.panel-grid{display:grid;gap:.85rem;grid-template-columns:minmax(300px,1fr) minmax(300px,1fr)}.panel{padding:.95rem}.span-all{grid-column:1 / -1}.panel-heading{display:flex;justify-content:space-between;align-items:center;margin-bottom:.75rem;gap:.6rem}.panel h2{margin:0;font-family:Fraunces,serif;font-weight:500;font-size:1.05rem}.chip{border:1px solid rgba(196,171,127,.4);color:var(--accent-soft);border-radius:999px;padding:.2rem .55rem;font-size:.68rem;font-family:IBM Plex Mono,monospace;letter-spacing:.08em;text-transform:uppercase}.form-grid{display:grid;grid-template-columns:repeat(2,minmax(0,1fr));gap:.68rem}label{display:flex;flex-direction:column;gap:.34rem}label span{font-family:IBM Plex Mono,monospace;text-transform:uppercase;letter-spacing:.08em;color:var(--text-muted);font-size:.68rem}input{border:1px solid rgba(196,171,127,.28);background:#0e1218e6;border-radius:.5rem;color:var(--text-main);padding:.6rem .66rem;font:inherit}input:focus{outline:none;border-color:var(--accent)}button{border:none;border-radius:.55rem;padding:.64rem .8rem;font:inherit;font-family:IBM Plex Mono,monospace;text-transform:uppercase;letter-spacing:.08em;cursor:pointer}button:disabled{cursor:default;opacity:.5}.primary{background:linear-gradient(180deg,#d6c09a,#b89a67);color:#1a1307}.ghost{color:var(--accent-soft);background:#c4ab7f17;border:1px solid rgba(196,171,127,.35)}.danger{color:#f1cec8;background:#c0695d29;border:1px solid rgba(192,105,93,.5)}.session-list{display:flex;flex-direction:column;gap:.6rem}.session-row{border:1px solid rgba(196,171,127,.2);border-radius:.55rem;padding:.65rem;display:flex;justify-content:space-between;align-items:center;gap:.7rem;background:#0e1218bd}.session-row h3{margin:0;font-size:.88rem;font-family:IBM Plex Mono,monospace;color:var(--accent-soft)}.session-row p{margin:.3rem 0 0;color:var(--text-muted);font-size:.78rem}.telemetry-grid{display:grid;grid-template-columns:repeat(auto-fill,minmax(220px,1fr));gap:.68rem}.telemetry-card{border:1px solid rgba(196,171,127,.22);background:#0e1218c2;border-radius:.55rem;padding:.68rem}.telemetry-card header{display:flex;justify-content:space-between;align-items:flex-start;gap:.5rem}.telemetry-card h3{margin:0;font-size:.86rem}.telemetry-card h3 small{display:block;margin-top:.24rem;color:var(--text-muted);font-size:.67rem;font-family:IBM Plex Mono,monospace}.meter{margin-top:.58rem;width:100%;height:.44rem;border-radius:999px;overflow:hidden;background:#ffffff12}.meter-fill{height:100%;background:linear-gradient(90deg,#7a96ad,#be9d67,#bf6d61)}.util-pill{font-size:.67rem;border-radius:999px;padding:.2rem .46rem;border:1px solid;font-family:IBM Plex Mono,monospace}.util-pill.cool{color:#c0d4e4;border-color:#6c8ca973}.util-pill.warm{color:#efd4a9;border-color:#bf8f5280}.util-pill.alert{color:#f2cac4;border-color:#c0695d80}.util-pill.muted{color:var(--text-muted);border-color:#b2a99666}.telemetry-card p,.empty{margin:.5rem 0 0;color:var(--text-muted);font-size:.78rem}.status-line{margin-top:auto;border:1px solid rgba(196,171,127,.28);border-radius:.52rem;background:#0f1318d9;padding:.56rem .72rem;display:flex;gap:.42rem;align-items:center;color:var(--text-muted);font-family:IBM Plex Mono,monospace;font-size:.72rem}.blink{width:.42rem;height:.42rem;border-radius:50%;background:var(--accent);opacity:.8;animation:pulse 1.4s ease-in-out infinite}@keyframes pulse{0%,to{transform:scale(.85)}50%{transform:scale(1.1)}}@media (max-width: 980px){.stats-row,.panel-grid,.form-grid{grid-template-columns:1fr}} | |||
There was a problem hiding this comment.
Generated CSS currently violates stylelint rules.
Line 1 triggers multiple stylelint errors (font-family-name-quotes, keyframe-selector-notation). Since this file is bundled output, consider linting web/dashboard/src/styles.css as the source of truth and excluding generated artifacts under src/keep_gpu/mcp/static/assets/ from stylelint checks.
🧰 Tools
🪛 Stylelint (17.3.0)
[error] 1-1: Expected quotes around "IBM Plex Sans" (font-family-name-quotes)
(font-family-name-quotes)
[error] 1-1: Expected quotes around "Segoe UI" (font-family-name-quotes)
(font-family-name-quotes)
[error] 1-1: Expected quotes around "IBM Plex Mono" (font-family-name-quotes)
(font-family-name-quotes)
[error] 1-1: Expected quotes around "IBM Plex Mono" (font-family-name-quotes)
(font-family-name-quotes)
[error] 1-1: Expected quotes around "IBM Plex Mono" (font-family-name-quotes)
(font-family-name-quotes)
[error] 1-1: Expected quotes around "IBM Plex Mono" (font-family-name-quotes)
(font-family-name-quotes)
[error] 1-1: Expected quotes around "IBM Plex Mono" (font-family-name-quotes)
(font-family-name-quotes)
[error] 1-1: Expected quotes around "IBM Plex Mono" (font-family-name-quotes)
(font-family-name-quotes)
[error] 1-1: Expected quotes around "IBM Plex Mono" (font-family-name-quotes)
(font-family-name-quotes)
[error] 1-1: Expected quotes around "IBM Plex Mono" (font-family-name-quotes)
(font-family-name-quotes)
[error] 1-1: Expected quotes around "IBM Plex Mono" (font-family-name-quotes)
(font-family-name-quotes)
[error] 1-1: Expected quotes around "IBM Plex Mono" (font-family-name-quotes)
(font-family-name-quotes)
[error] 1-1: Expected "to" to be "100%" (keyframe-selector-notation)
(keyframe-selector-notation)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/keep_gpu/mcp/static/assets/index.css` at line 1, The bundled CSS
(index.css) is triggering stylelint errors (e.g., font-family-name-quotes,
keyframe-selector-notation); update the stylelint configuration to exclude
generated assets and point linting at the source stylesheet instead: add an
ignoreFiles/ignorePatterns entry for the MCP static assets bundle (the generated
index.css) in your .stylelintrc / stylelint config and ensure the canonical
source (the dashboard styles.css) remains in the lint target list so only the
authoring file is linted.
| function parseGpuIds(raw) { | ||
| if (!raw.trim()) { | ||
| return null | ||
| } | ||
| return raw | ||
| .split(",") | ||
| .map((part) => Number(part.trim())) | ||
| .filter((value) => Number.isFinite(value)) | ||
| } |
There was a problem hiding this comment.
Harden payload validation before calling /api/sessions.
Lines 25-33 and 112-117 currently allow invalid values (e.g., gpu_ids with negative/decimal IDs, or empty numeric inputs coercing to 0). This can send malformed session parameters to the server.
🛡️ Proposed fix
function parseGpuIds(raw) {
- if (!raw.trim()) {
+ const text = raw.trim()
+ if (!text) {
return null
}
- return raw
- .split(",")
- .map((part) => Number(part.trim()))
- .filter((value) => Number.isFinite(value))
+ const ids = text
+ .split(",")
+ .map((part) => part.trim())
+ .filter(Boolean)
+ .map((part) => Number(part))
+
+ if (ids.some((value) => !Number.isInteger(value) || value < 0)) {
+ throw new Error("GPU IDs must be non-negative integers separated by commas.")
+ }
+ return ids
}
@@
async function onStartSession(event) {
event.preventDefault()
setBusy(true)
try {
+ const interval = Number(form.interval)
+ const busyThreshold = Number(form.busyThreshold)
+ if (!Number.isInteger(interval) || interval < 1) {
+ throw new Error("Interval must be an integer >= 1.")
+ }
+ if (!Number.isInteger(busyThreshold) || busyThreshold < -1) {
+ throw new Error("Busy threshold must be an integer >= -1.")
+ }
+
const payload = {
gpu_ids: parseGpuIds(form.gpuIds),
vram: form.vram,
- interval: Number(form.interval),
- busy_threshold: Number(form.busyThreshold)
+ interval,
+ busy_threshold: busyThreshold
}Also applies to: 112-117
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@web/dashboard/src/App.jsx` around lines 25 - 33, The parseGpuIds function
currently coerces inputs with Number(...) causing empty strings to become 0 and
allowing negatives/decimals; update parseGpuIds to reject non-integer or
negative tokens by trimming, splitting on "," and validating each token with a
strict integer regex (e.g. /^\d+$/) or parseInt combined with Number.isInteger
and >= 0, then return an array of integers or null for empty input; also apply
the same strict validation where the session payload is built before calling
/api/sessions (the code that assembles gpu_ids for the POST) so you never send 0
for empty fields or fractional/negative IDs.
| @@ -0,0 +1,387 @@ | |||
| @import url("https://fonts.googleapis.com/css2?family=IBM+Plex+Sans:wght@400;500;600;700&family=IBM+Plex+Mono:wght@400;500&family=Fraunces:opsz,wght@9..144,500;9..144,700&display=swap"); | |||
There was a problem hiding this comment.
Fix @import notation to satisfy configured Stylelint rule.
Line 1 uses url(...), but your lint config expects string import notation.
🔧 Proposed fix
-@import url("https://fonts.googleapis.com/css2?family=IBM+Plex+Sans:wght@400;500;600;700&family=IBM+Plex+Mono:wght@400;500&family=Fraunces:opsz,wght@9..144,500;9..144,700&display=swap");
+@import "https://fonts.googleapis.com/css2?family=IBM+Plex+Sans:wght@400;500;600;700&family=IBM+Plex+Mono:wght@400;500&family=Fraunces:opsz,wght@9..144,500;9..144,700&display=swap";📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| @import url("https://fonts.googleapis.com/css2?family=IBM+Plex+Sans:wght@400;500;600;700&family=IBM+Plex+Mono:wght@400;500&family=Fraunces:opsz,wght@9..144,500;9..144,700&display=swap"); | |
| `@import` "https://fonts.googleapis.com/css2?family=IBM+Plex+Sans:wght@400;500;600;700&family=IBM+Plex+Mono:wght@400;500&family=Fraunces:opsz,wght@9..144,500;9..144,700&display=swap"; |
🧰 Tools
🪛 Stylelint (17.3.0)
[error] 1-1: Expected "url("https://fonts.googleapis.com/css2?family=IBM+Plex+Sans:wght@400;500;600;700&family=IBM+Plex+Mono:wght@400;500&family=Fraunces:opsz,wght@9..144,500;9..144,700&display=swap")" to be ""https://fonts.googleapis.com/css2?family=IBM+Plex+Sans:wght@400;500;600;700&family=IBM+Plex+Mono:wght@400;500&family=Fraunces:opsz,wght@9..144,500;9..144,700&display=swap"" (import-notation)
(import-notation)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@web/dashboard/src/styles.css` at line 1, The `@import` uses url(...) which
violates the Stylelint rule expecting string notation; update the `@import`
statement (the line that currently imports fonts with url(...)) to use a quoted
string instead (e.g., `@import` "https://fonts.googleapis.com/..." ) so the import
uses string notation while preserving the same URL and parameters.
|
|
||
| .masthead h1 { | ||
| margin: 0.55rem 0 0.35rem; | ||
| font-family: "Fraunces", serif; |
There was a problem hiding this comment.
Remove quotes around Fraunces in font-family.
Lines 75 and 150 violate font-family-name-quotes and will fail Stylelint.
🔧 Proposed fix
- font-family: "Fraunces", serif;
+ font-family: Fraunces, serif;Also applies to: 150-150
🧰 Tools
🪛 Stylelint (17.3.0)
[error] 75-75: Unexpected quotes around "Fraunces" (font-family-name-quotes)
(font-family-name-quotes)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@web/dashboard/src/styles.css` at line 75, Replace the quoted font-family
declaration "font-family: "Fraunces", serif;" with an unquoted font family name
so it reads font-family: Fraunces, serif; — update both occurrences of the exact
declaration string ("font-family: "Fraunces", serif;") so they conform to
stylelint's font-family-name-quotes rule.
|
/gemini review\n@coderabbitai review |
|
@Wangmerlyn I'm already performing a full code review of this pull request based on your previous invocation. There's no need to re-issue the |
|
🧠 Learnings used✅ Actions performedFull review triggered. |
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (3)
src/keep_gpu/cli.py (2)
123-131:⚠️ Potential issue | 🟠 MajorHandle non-JSON success responses as controlled CLI errors.
Line [125] assumes every 2xx body is JSON. A non-JSON body raises and escapes current command-level error handling.
🛠️ Suggested fix
try: with urlopen(request, timeout=timeout) as response: # nosec B310 body = response.read().decode("utf-8") - return json.loads(body) if body else {} + if not body: + return {} + try: + return json.loads(body) + except json.JSONDecodeError as exc: + raise RuntimeError( + f"Non-JSON response from service endpoint: {url}" + ) from exc#!/bin/bash # Verify that successful-response JSON parsing is guarded. rg -n -C3 'return json\.loads\(body\) if body else \{\}' src/keep_gpu/cli.py rg -n -C3 'JSONDecodeError' src/keep_gpu/cli.pyExpected result: first command currently matches; second should show a handler added in
_http_json_requestafter the fix.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/keep_gpu/cli.py` around lines 123 - 131, The code currently assumes every successful HTTP body is JSON (the urlopen(request, timeout=timeout) block that does "return json.loads(body) if body else {}"), which lets a JSONDecodeError escape; update the _http_json_request implementation to catch json.JSONDecodeError around json.loads(body) and raise a controlled RuntimeError (e.g., "Invalid JSON response from KeepGPU service" including the url and a snippet of body) so callers see a clear CLI-level error; keep the existing HTTPError and URLError handlers intact and reference the urlopen(request, timeout=timeout) call and the json.loads(body) invocation when making the change.
190-210:⚠️ Potential issue | 🔴 CriticalDo not signal persisted PID without ownership check.
Line [191] trusts a PID file blindly. A reused/stale PID can target an unrelated process. Also guard
os.killfor process-exit races.🛠️ Suggested fix
def _stop_service_process(host: str, port: int, timeout: float = 3.0) -> bool: pid = _read_service_pid(host, port) if pid is None: return False + if not _is_managed_keepgpu_pid(pid): + _clear_service_pid(host, port) + return False if not _pid_alive(pid): _clear_service_pid(host, port) return True - os.kill(pid, signal.SIGTERM) + try: + os.kill(pid, signal.SIGTERM) + except OSError: + _clear_service_pid(host, port) + return False @@ - os.kill(pid, signal.SIGKILL) + try: + os.kill(pid, signal.SIGKILL) + except OSError: + _clear_service_pid(host, port) + return False🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/keep_gpu/cli.py` around lines 190 - 210, The _stop_service_process function currently trusts a PID from _read_service_pid and sends signals directly; change it to first verify ownership before signaling by adding a check (e.g. new helper like _process_matches_service or reuse _pid_alive plus an ownership check using process cmdline/argv or UID via /proc or psutil) to ensure the PID belongs to our service; only proceed to send SIGTERM/SIGKILL if that check passes, and wrap os.kill calls in try/except to handle ProcessLookupError and PermissionError (treat ProcessLookupError as already-exited and avoid clearing the PID unless ownership was confirmed); keep using _clear_service_pid and _read_service_pid, but do not clear the file if ownership verification fails.docs/reference/cli.md (1)
11-13:⚠️ Potential issue | 🟡 MinorKeep command summary flags aligned with detailed sections.
Line [11]-[13] omit
--host/--portin the quick summary while the detailed command sections include them.Suggested doc patch
-keep-gpu status [--job-id ID] -keep-gpu stop (--job-id ID | --all) -keep-gpu service-stop [--force] +keep-gpu status [--job-id ID] [--host 127.0.0.1] [--port 8765] +keep-gpu stop (--job-id ID | --all) [--host 127.0.0.1] [--port 8765] +keep-gpu service-stop [--force] [--host 127.0.0.1] [--port 8765]As per coding guidelines: "Update documentation when changing CLI flags, controller behavior, platform support, or MCP methods".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/reference/cli.md` around lines 11 - 13, The quick summary lines for the CLI commands (the entries for "keep-gpu status", "keep-gpu stop", and "keep-gpu service-stop") are missing the global flags shown in the detailed sections; update the summary to include the --host and --port flags (e.g., "keep-gpu status [--job-id ID] [--host HOST] [--port PORT]", likewise for "keep-gpu stop" and "keep-gpu service-stop") so the summary aligns with the detailed command docs and coding guidelines.
🧹 Nitpick comments (3)
docs/getting-started.md (1)
89-106: Add teardown commands in the non-blocking quickstart.Lines [94]-[106] explain how to start and inspect service mode, but they should also show how to stop sessions and optionally stop the daemon in the same flow.
Suggested doc patch
keep-gpu start --gpu-ids 0 --interval 120 --vram 1GiB --busy-threshold 25 keep-gpu status +keep-gpu stop --all +keep-gpu service-stopBased on learnings: CLI interface at
keep-gpuis a first-class interface that must be kept in sync with related documentation and tests when behavior changes.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/getting-started.md` around lines 89 - 106, Add explicit teardown commands to the "Non-blocking workflow for agents" quickstart: after the serve/status examples (keep-gpu start, keep-gpu status, keep-gpu serve), show how to stop individual sessions and the service/daemon (e.g., keep-gpu stop [--session-id | --all] and keep-gpu daemon stop) so users can clean up in the same flow; update examples to include the exact CLI flags and an optional note to verify session IDs via keep-gpu status and to ensure docs/tests stay in sync with the keep-gpu CLI behavior.tests/mcp/test_http_api.py (1)
40-68: Extract shared HTTP server harness into a fixture/context helper.Both tests duplicate nearly identical server/thread startup and shutdown logic; consolidating it will reduce drift and cleanup mistakes over time.
Also applies to: 70-109
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/mcp/test_http_api.py` around lines 40 - 68, Extract the duplicated HTTP server startup/shutdown into a reusable pytest fixture or context manager (e.g., http_server) to be used by test_http_health_and_static_index and the other test(s) around lines 70-109: encapsulate creation of KeepGPUServer(controller_factory=dummy_factory), the _Server subclass with allow_reuse_address, binding to ("127.0.0.1", 0) with _JSONRPCHandler, assigning httpd.keepgpu_server, starting the serve_forever thread, and returning the base URL plus references to httpd/server/thread; ensure teardown logic (httpd.shutdown(), httpd.server_close(), server.shutdown(), thread.join(timeout=2)) runs reliably in the fixture finalizer or context manager so tests simply call the fixture/context to get the base URL and avoid duplicating setup/cleanup code.tests/test_cli_service_commands.py (1)
183-187: Considerpytest.failinstead ofassert False.Using
pytest.fail()is more explicit for indicating an expected exception path. The current pattern works butpython -Owould strip the assertion.💡 Suggested improvement
+ import pytest + try: cli._http_json_request("GET", "http://127.0.0.1:8765/health") - assert False, "Expected RuntimeError" + pytest.fail("Expected RuntimeError") except RuntimeError as exc: assert "Cannot reach KeepGPU service" in str(exc)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/test_cli_service_commands.py` around lines 183 - 187, Replace the use of a bare assertion to indicate test failure in the exception path with pytest.fail to avoid issues when assertions are stripped; specifically, in the test around cli._http_json_request("GET", "http://127.0.0.1:8765/health") catch block, call pytest.fail("Expected RuntimeError") (and add an import for pytest at the top of the test module if missing) instead of using assert False so the test failure is explicit and not affected by optimization flags.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@web/dashboard/src/App.jsx`:
- Line 313: The span in App.jsx uses class "refresh-tag" but that selector is
missing from styles.css; add a .refresh-tag rule to styles.css (targeting the
same class used in App.jsx) that sets font-size (~0.68rem), color to
var(--text-muted), and a monospace font-family fallback (e.g., "IBM Plex Mono",
monospace) so the element is styled as intended.
---
Duplicate comments:
In `@docs/reference/cli.md`:
- Around line 11-13: The quick summary lines for the CLI commands (the entries
for "keep-gpu status", "keep-gpu stop", and "keep-gpu service-stop") are missing
the global flags shown in the detailed sections; update the summary to include
the --host and --port flags (e.g., "keep-gpu status [--job-id ID] [--host HOST]
[--port PORT]", likewise for "keep-gpu stop" and "keep-gpu service-stop") so the
summary aligns with the detailed command docs and coding guidelines.
In `@src/keep_gpu/cli.py`:
- Around line 123-131: The code currently assumes every successful HTTP body is
JSON (the urlopen(request, timeout=timeout) block that does "return
json.loads(body) if body else {}"), which lets a JSONDecodeError escape; update
the _http_json_request implementation to catch json.JSONDecodeError around
json.loads(body) and raise a controlled RuntimeError (e.g., "Invalid JSON
response from KeepGPU service" including the url and a snippet of body) so
callers see a clear CLI-level error; keep the existing HTTPError and URLError
handlers intact and reference the urlopen(request, timeout=timeout) call and the
json.loads(body) invocation when making the change.
- Around line 190-210: The _stop_service_process function currently trusts a PID
from _read_service_pid and sends signals directly; change it to first verify
ownership before signaling by adding a check (e.g. new helper like
_process_matches_service or reuse _pid_alive plus an ownership check using
process cmdline/argv or UID via /proc or psutil) to ensure the PID belongs to
our service; only proceed to send SIGTERM/SIGKILL if that check passes, and wrap
os.kill calls in try/except to handle ProcessLookupError and PermissionError
(treat ProcessLookupError as already-exited and avoid clearing the PID unless
ownership was confirmed); keep using _clear_service_pid and _read_service_pid,
but do not clear the file if ownership verification fails.
---
Nitpick comments:
In `@docs/getting-started.md`:
- Around line 89-106: Add explicit teardown commands to the "Non-blocking
workflow for agents" quickstart: after the serve/status examples (keep-gpu
start, keep-gpu status, keep-gpu serve), show how to stop individual sessions
and the service/daemon (e.g., keep-gpu stop [--session-id | --all] and keep-gpu
daemon stop) so users can clean up in the same flow; update examples to include
the exact CLI flags and an optional note to verify session IDs via keep-gpu
status and to ensure docs/tests stay in sync with the keep-gpu CLI behavior.
In `@tests/mcp/test_http_api.py`:
- Around line 40-68: Extract the duplicated HTTP server startup/shutdown into a
reusable pytest fixture or context manager (e.g., http_server) to be used by
test_http_health_and_static_index and the other test(s) around lines 70-109:
encapsulate creation of KeepGPUServer(controller_factory=dummy_factory), the
_Server subclass with allow_reuse_address, binding to ("127.0.0.1", 0) with
_JSONRPCHandler, assigning httpd.keepgpu_server, starting the serve_forever
thread, and returning the base URL plus references to httpd/server/thread;
ensure teardown logic (httpd.shutdown(), httpd.server_close(),
server.shutdown(), thread.join(timeout=2)) runs reliably in the fixture
finalizer or context manager so tests simply call the fixture/context to get the
base URL and avoid duplicating setup/cleanup code.
In `@tests/test_cli_service_commands.py`:
- Around line 183-187: Replace the use of a bare assertion to indicate test
failure in the exception path with pytest.fail to avoid issues when assertions
are stripped; specifically, in the test around cli._http_json_request("GET",
"http://127.0.0.1:8765/health") catch block, call pytest.fail("Expected
RuntimeError") (and add an import for pytest at the top of the test module if
missing) instead of using assert False so the test failure is explicit and not
affected by optimization flags.
ℹ️ Review info
Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: c43b2fac-7fe7-4825-91b3-9823c691d674
⛔ Files ignored due to path filters (1)
web/dashboard/package-lock.jsonis excluded by!**/package-lock.json
📒 Files selected for processing (27)
README.mddocs/getting-started.mddocs/guides/cli.mddocs/guides/mcp.mddocs/index.mddocs/plans/cli-service-dashboard/findings.mddocs/plans/cli-service-dashboard/progress.mddocs/plans/cli-service-dashboard/task_plan.mddocs/reference/cli.mddocs/skills/gpu-keepalive-with-keepgpu/skillcheck-free-report.mdmkdocs.ymlskills/gpu-keepalive-with-keepgpu/SKILL.mdskills/gpu-keepalive-with-keepgpu/agents/openai.yamlsrc/keep_gpu/cli.pysrc/keep_gpu/mcp/server.pysrc/keep_gpu/mcp/static/assets/dashboard.jssrc/keep_gpu/mcp/static/assets/index.csssrc/keep_gpu/mcp/static/index.htmltests/mcp/test_http_api.pytests/test_cli_service_commands.pyweb/dashboard/.gitignoreweb/dashboard/index.htmlweb/dashboard/package.jsonweb/dashboard/src/App.jsxweb/dashboard/src/main.jsxweb/dashboard/src/styles.cssweb/dashboard/vite.config.js
| <section className="panel span-all"> | ||
| <div className="panel-heading"> | ||
| <h2>GPU Telemetry</h2> | ||
| <span className="refresh-tag">refresh 3s</span> |
There was a problem hiding this comment.
Missing CSS class refresh-tag.
The refresh-tag class referenced here is not defined in styles.css. The element will render without styling.
💄 Proposed fix - add to styles.css
.refresh-tag {
font-size: 0.68rem;
color: var(--text-muted);
font-family: "IBM Plex Mono", monospace;
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@web/dashboard/src/App.jsx` at line 313, The span in App.jsx uses class
"refresh-tag" but that selector is missing from styles.css; add a .refresh-tag
rule to styles.css (targeting the same class used in App.jsx) that sets
font-size (~0.68rem), color to var(--text-muted), and a monospace font-family
fallback (e.g., "IBM Plex Mono", monospace) so the element is styled as
intended.
|
/gemini review\n@coderabbitai review |
|
🧠 Learnings used✅ Actions performedReview triggered.
|
|
@Wangmerlyn I'm still actively performing a full code review of this pull request. Repeated invocations of |
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (4)
src/keep_gpu/mcp/server.py (3)
308-310:⚠️ Potential issue | 🔴 CriticalGuard empty
job_idto prevent accidental global stop.A trailing slash (
/api/sessions/) yields an emptyjob_id; in DELETE this flows intostop_keep(job_id=""), which is treated as falsy and becomes “stop all”.🐛 Proposed fix
if path.startswith("/api/sessions/"): - job_id = unquote(path.rsplit("/", 1)[-1]) + job_id = unquote(path.rsplit("/", 1)[-1]).strip() + if not job_id: + self._json_response(400, {"error": {"message": "Missing job_id"}}) + return self._json_response(200, server_ref.status(job_id=job_id)) return @@ if path.startswith("/api/sessions/"): - job_id = unquote(path.rsplit("/", 1)[-1]) + job_id = unquote(path.rsplit("/", 1)[-1]).strip() + if not job_id: + self._json_response(400, {"error": {"message": "Missing job_id"}}) + return self._json_response(200, server_ref.stop_keep(job_id=job_id)) returnAlso applies to: 350-353
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/keep_gpu/mcp/server.py` around lines 308 - 310, The route handling for "/api/sessions/" extracts job_id with job_id = unquote(path.rsplit("/", 1)[-1]) but doesn't guard for an empty string, allowing a trailing slash to produce job_id == "" which leads DELETE to call stop_keep(job_id="") (treated as falsy and stopping all). Update both the status branch (where server_ref.status(job_id=job_id) is called) and the delete branch (where stop_keep is invoked) to check for an empty job_id (e.g., if not job_id: return a 400/404 JSON response via self._json_response and do not call server_ref.status or stop_keep) so empty job IDs are rejected rather than interpreted as “stop all.”
325-341:⚠️ Potential issue | 🟠 MajorNormalize non-parse POST failures into structured JSON errors.
Only parse failures are handled; runtime errors from
start_keep(**payload)can bypass consistent JSON error responses.🛠️ Proposed fix
- except (json.JSONDecodeError, ValueError, UnicodeDecodeError) as exc: + except (json.JSONDecodeError, ValueError, UnicodeDecodeError) as exc: self._json_response(400, {"error": {"message": f"Bad request: {exc}"}}) + except TypeError as exc: + self._json_response(400, {"error": {"message": f"Bad request: {exc}"}}) + except Exception as exc: + logger.exception("POST request failed for path %s", path) + self._json_response( + 500, + { + "error": { + "message": str(exc), + "type": exc.__class__.__name__, + } + }, + )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/keep_gpu/mcp/server.py` around lines 325 - 341, The POST handling currently only catches JSON parse errors so runtime exceptions from server_ref.start_keep or _handle_request escape the JSON error API; wrap the calls to server_ref.start_keep(**payload) and the JSON-RPC _handle_request(server_ref, payload) in a try/except Exception as exc and use self._json_response(500, {"error": {"message": str(exc)}}) to return a consistent structured JSON error; keep the existing except (json.JSONDecodeError, ValueError, UnicodeDecodeError) for 400s and avoid re-raising so all failures produce JSON responses.
313-314:⚠️ Potential issue | 🟠 MajorReturn JSON 404 for unknown
/api/*GET routes.Unmatched API GET paths currently fall through to SPA static serving and can return HTML
200instead of API404.🔧 Proposed fix
- self._serve_static(path) + if path.startswith("/api/"): + self._json_response(404, {"error": {"message": "Unknown endpoint"}}) + return + self._serve_static(path)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/keep_gpu/mcp/server.py` around lines 313 - 314, When falling back to self._serve_static(path) for unmatched GET requests, detect API paths (path.startswith("/api/")) and return a JSON 404 instead of serving the SPA HTML. Modify the request handler that currently calls self._serve_static(path) so that for GET and path.startswith("/api/") it responds with a 404 status, Content-Type: application/json and a small body like {"error":"Not Found"} (use the existing response helper on the class or add a simple JSON response path), otherwise continue to call self._serve_static(path).docs/plans/cli-service-dashboard/progress.md (1)
30-37:⚠️ Potential issue | 🟡 MinorRefresh progress state to match the current PR lifecycle.
As of March 4, 2026, this PR is already open, so “Preparing final summary and PR” and “Create commit and open PR” are stale.
📝 Suggested update
### In Progress -- Preparing final summary and PR after successful validation. +- PR is open; addressing review feedback and merge-readiness checks. ### Next -1. Create commit and open PR for review. +1. Address reviewer feedback. +2. Land follow-up fixes and re-run targeted validation. +3. Prepare merge once approvals are complete.As per coding guidelines, "Implementations must follow the plan document and its todo items; update the plan document when tasks or scope change".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/plans/cli-service-dashboard/progress.md` around lines 30 - 37, The plan's "In Progress" and "Next" sections are stale because the PR is already open; update the "In Progress" bullets (replace "Preparing final summary and PR after successful validation" with "Final summary prepared; PR opened on 2026-03-04" or similar) and replace the "Next" list (remove "Create commit and open PR for review") with current actionable items like "Address review feedback", "Run final validation", and "Merge after approval"; ensure headings "In Progress" and "Next" remain and include a timestamp or note about the PR being open to keep the plan in sync with the PR lifecycle.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/keep_gpu/mcp/server.py`:
- Around line 379-384: The threaded HTTP server created by _Server
(ThreadingMixIn, TCPServer) exposes httpd.keepgpu_server which accesses and
mutates self._sessions from handlers (methods start_keep, stop_keep, status)
without synchronization; add a threading.Lock (e.g., self._sessions_lock) to the
keepgpu server instance during initialization and wrap every read/write of
self._sessions in start_keep, stop_keep and status with the lock (use
acquire/release or a context manager) to prevent race conditions. Ensure the
lock is referenced by the handlers via httpd.keepgpu_server._sessions_lock (or
an accessor) so all concurrent handler threads synchronize on the same lock.
---
Duplicate comments:
In `@docs/plans/cli-service-dashboard/progress.md`:
- Around line 30-37: The plan's "In Progress" and "Next" sections are stale
because the PR is already open; update the "In Progress" bullets (replace
"Preparing final summary and PR after successful validation" with "Final summary
prepared; PR opened on 2026-03-04" or similar) and replace the "Next" list
(remove "Create commit and open PR for review") with current actionable items
like "Address review feedback", "Run final validation", and "Merge after
approval"; ensure headings "In Progress" and "Next" remain and include a
timestamp or note about the PR being open to keep the plan in sync with the PR
lifecycle.
In `@src/keep_gpu/mcp/server.py`:
- Around line 308-310: The route handling for "/api/sessions/" extracts job_id
with job_id = unquote(path.rsplit("/", 1)[-1]) but doesn't guard for an empty
string, allowing a trailing slash to produce job_id == "" which leads DELETE to
call stop_keep(job_id="") (treated as falsy and stopping all). Update both the
status branch (where server_ref.status(job_id=job_id) is called) and the delete
branch (where stop_keep is invoked) to check for an empty job_id (e.g., if not
job_id: return a 400/404 JSON response via self._json_response and do not call
server_ref.status or stop_keep) so empty job IDs are rejected rather than
interpreted as “stop all.”
- Around line 325-341: The POST handling currently only catches JSON parse
errors so runtime exceptions from server_ref.start_keep or _handle_request
escape the JSON error API; wrap the calls to server_ref.start_keep(**payload)
and the JSON-RPC _handle_request(server_ref, payload) in a try/except Exception
as exc and use self._json_response(500, {"error": {"message": str(exc)}}) to
return a consistent structured JSON error; keep the existing except
(json.JSONDecodeError, ValueError, UnicodeDecodeError) for 400s and avoid
re-raising so all failures produce JSON responses.
- Around line 313-314: When falling back to self._serve_static(path) for
unmatched GET requests, detect API paths (path.startswith("/api/")) and return a
JSON 404 instead of serving the SPA HTML. Modify the request handler that
currently calls self._serve_static(path) so that for GET and
path.startswith("/api/") it responds with a 404 status, Content-Type:
application/json and a small body like {"error":"Not Found"} (use the existing
response helper on the class or add a simple JSON response path), otherwise
continue to call self._serve_static(path).
ℹ️ Review info
Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: f5d3baf9-fd7d-4b6a-b0ca-f6985f5fbacd
📒 Files selected for processing (6)
docs/plans/cli-service-dashboard/findings.mddocs/plans/cli-service-dashboard/progress.mdsrc/keep_gpu/mcp/server.pysrc/keep_gpu/single_gpu_controller/cuda_gpu_controller.pysrc/keep_gpu/single_gpu_controller/rocm_gpu_controller.pytests/mcp/test_server.py
|
/gemini review\n@coderabbitai review |
|
🧠 Learnings used✅ Actions performedReview triggered.
|
|
@Wangmerlyn I'm still actively performing a full code review of this pull request. Repeated invocations of |
There was a problem hiding this comment.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/keep_gpu/single_gpu_controller/rocm_gpu_controller.py (1)
84-99:⚠️ Potential issue | 🟡 MinorLog message emitted even when cache is not cleared.
When the thread doesn't stop within the timeout (line 84-90), the method returns early without clearing the cache. However, line 99 unconditionally logs "keep thread stopped & cache cleared", which is misleading for the early-return path where
torch.cuda.empty_cache()was never called.🐛 Proposed fix
if self._thread.is_alive(): logger.warning( "rank %s: ROCm keep thread did not stop within %.1fs", self.rank, join_timeout, ) return torch.cuda.empty_cache() + logger.info("rank %s: keep thread stopped & cache cleared", self.rank) else: logger.warning("rank %s: keep thread not running", self.rank) if self._rocm_smi: try: self._rocm_smi.rsmi_shut_down() except Exception as exc: # pragma: no cover - best effort logger.debug("rsmi_shut_down failed: %s", exc) - logger.info("rank %s: keep thread stopped & cache cleared", self.rank)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/keep_gpu/single_gpu_controller/rocm_gpu_controller.py` around lines 84 - 99, The final info log incorrectly states "keep thread stopped & cache cleared" even when the method returns early because self._thread.is_alive() timed out and torch.cuda.empty_cache() was not called; update the control flow so the logger.info("rank %s: keep thread stopped & cache cleared", self.rank) is only emitted after the cache clear and rsmi shutdown succeed (i.e., move or guard that log behind the branch where torch.cuda.empty_cache() is executed and any rsmi_shut_down handling), and ensure the early-return path logs a distinct warning (using logger.warning with join_timeout) without claiming the cache was cleared.
🧹 Nitpick comments (3)
tests/mcp/test_http_api.py (1)
50-77: Consider extracting server setup/teardown into a fixture.The server setup pattern (TCPServer creation, thread start, base URL computation) and teardown (shutdown, server_close, thread.join) is duplicated across all test functions. A pytest fixture would reduce boilerplate and ensure consistent cleanup.
♻️ Suggested fixture pattern
import pytest `@pytest.fixture` def http_server(): server = make_server() class _Server(TCPServer): allow_reuse_address = True httpd = _Server(("127.0.0.1", 0), _JSONRPCHandler) httpd.keepgpu_server = server thread = threading.Thread(target=httpd.serve_forever, daemon=True) thread.start() base = f"http://127.0.0.1:{httpd.server_address[1]}" yield base, server httpd.shutdown() httpd.server_close() server.shutdown() thread.join(timeout=2)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/mcp/test_http_api.py` around lines 50 - 77, The test duplicates HTTP server setup/teardown (see test_http_health_and_static_index using make_server, _Server subclass, _JSONRPCHandler, httpd and thread) so extract that logic into a pytest fixture (e.g., http_server) that creates make_server(), instantiates the _Server(("127.0.0.1", 0), _JSONRPCHandler), sets httpd.keepgpu_server, starts the thread running httpd.serve_forever, yields the base URL and server for tests, and in teardown calls httpd.shutdown(), httpd.server_close(), server.shutdown(), and thread.join(timeout=2); update test_http_health_and_static_index and other tests to accept the fixture and use the yielded base and server instead of in-test setup/teardown.tests/test_cli_service_commands.py (1)
183-187: Usepytest.fail()instead ofassert False.Lines 185 and 205 use
assert Falsewhich can be optimized away withpython -O. Consider usingpytest.fail()for explicit test failures.♻️ Proposed fix
+ import pytest + try: cli._http_json_request("GET", "http://127.0.0.1:8765/health") - assert False, "Expected RuntimeError" + pytest.fail("Expected RuntimeError") except RuntimeError as exc: assert "Cannot reach KeepGPU service" in str(exc)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/test_cli_service_commands.py` around lines 183 - 187, Replace the fragile "assert False, 'Expected RuntimeError'" usages with explicit pytest.fail calls: import pytest in tests/test_cli_service_commands.py if it's not already imported, and change the failing assertions around the cli._http_json_request("GET", "http://127.0.0.1:8765/health") call (and the other occurrence later in the file) to pytest.fail("Expected RuntimeError") so the test failure cannot be optimized away by python -O and clearly signals the intended failure.tests/mcp/test_server.py (1)
116-130: Test relies on job iteration order matching creation order.The assertion
result["stopped"] == [job_a]assumes the server iterates jobs in insertion order. If the underlying data structure (e.g., a dict prior to Python 3.7 or any reordering logic) changes, this test becomes flaky. Consider using set comparisons instead:assert set(result["stopped"]) == {job_a} assert set(result["timed_out"]) == {job_b}Alternatively, verify the server implementation guarantees insertion-order iteration.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/mcp/test_server.py` around lines 116 - 130, The test test_stop_all_tracks_timeouts assumes insertion-order iteration of jobs which makes it flaky; update the assertions in test_stop_all_tracks_timeouts to compare as unordered collections (e.g., use set(result["stopped"]) == {job_a} and set(result["timed_out"]) == {job_b}) or otherwise assert membership (job_a in result["stopped"], job_b in result["timed_out"]) so the test no longer depends on server.start_keep/stop_keep iteration order when monkeypatching _release_with_timeout.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/reference/cli.md`:
- Around line 70-72: The `keep-gpu list-gpus` command documentation is missing
an options table; add a consistent options section showing the service
connection flags (e.g., --host and --port) for the `keep-gpu list-gpus` heading
so readers know how to point the command at the service. Under the `###
`keep-gpu list-gpus`` block insert an options table listing `--host <host>` and
`--port <port>` with brief descriptions and defaults consistent with other
service commands in the docs.
- Around line 12-14: Update the command summaries for the CLI doc so the "stop"
and "list-gpus" entries include the --host and --port flags like the "status"
and "service-stop" lines do; specifically, modify the summary lines that
currently read "keep-gpu stop (--job-id ID | --all)" and "keep-gpu list-gpus" to
include "[--host 127.0.0.1] [--port 8765]" so they match the detailed sections
and the implemented CLI behavior (commands: stop, list-gpus, and reference
service-stop/status for flag format).
---
Outside diff comments:
In `@src/keep_gpu/single_gpu_controller/rocm_gpu_controller.py`:
- Around line 84-99: The final info log incorrectly states "keep thread stopped
& cache cleared" even when the method returns early because
self._thread.is_alive() timed out and torch.cuda.empty_cache() was not called;
update the control flow so the logger.info("rank %s: keep thread stopped & cache
cleared", self.rank) is only emitted after the cache clear and rsmi shutdown
succeed (i.e., move or guard that log behind the branch where
torch.cuda.empty_cache() is executed and any rsmi_shut_down handling), and
ensure the early-return path logs a distinct warning (using logger.warning with
join_timeout) without claiming the cache was cleared.
---
Nitpick comments:
In `@tests/mcp/test_http_api.py`:
- Around line 50-77: The test duplicates HTTP server setup/teardown (see
test_http_health_and_static_index using make_server, _Server subclass,
_JSONRPCHandler, httpd and thread) so extract that logic into a pytest fixture
(e.g., http_server) that creates make_server(), instantiates the
_Server(("127.0.0.1", 0), _JSONRPCHandler), sets httpd.keepgpu_server, starts
the thread running httpd.serve_forever, yields the base URL and server for
tests, and in teardown calls httpd.shutdown(), httpd.server_close(),
server.shutdown(), and thread.join(timeout=2); update
test_http_health_and_static_index and other tests to accept the fixture and use
the yielded base and server instead of in-test setup/teardown.
In `@tests/mcp/test_server.py`:
- Around line 116-130: The test test_stop_all_tracks_timeouts assumes
insertion-order iteration of jobs which makes it flaky; update the assertions in
test_stop_all_tracks_timeouts to compare as unordered collections (e.g., use
set(result["stopped"]) == {job_a} and set(result["timed_out"]) == {job_b}) or
otherwise assert membership (job_a in result["stopped"], job_b in
result["timed_out"]) so the test no longer depends on
server.start_keep/stop_keep iteration order when monkeypatching
_release_with_timeout.
In `@tests/test_cli_service_commands.py`:
- Around line 183-187: Replace the fragile "assert False, 'Expected
RuntimeError'" usages with explicit pytest.fail calls: import pytest in
tests/test_cli_service_commands.py if it's not already imported, and change the
failing assertions around the cli._http_json_request("GET",
"http://127.0.0.1:8765/health") call (and the other occurrence later in the
file) to pytest.fail("Expected RuntimeError") so the test failure cannot be
optimized away by python -O and clearly signals the intended failure.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: a39269fc-37ba-450f-8102-eacdac6699ee
📒 Files selected for processing (12)
docs/guides/cli.mddocs/plans/cli-service-dashboard/findings.mddocs/plans/cli-service-dashboard/progress.mddocs/reference/cli.mdskills/gpu-keepalive-with-keepgpu/SKILL.mdsrc/keep_gpu/cli.pysrc/keep_gpu/mcp/server.pysrc/keep_gpu/single_gpu_controller/cuda_gpu_controller.pysrc/keep_gpu/single_gpu_controller/rocm_gpu_controller.pytests/mcp/test_http_api.pytests/mcp/test_server.pytests/test_cli_service_commands.py
| keep-gpu stop (--job-id ID | --all) | ||
| keep-gpu service-stop [--host 127.0.0.1] [--port 8765] [--force] | ||
| keep-gpu list-gpus |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, find and read the CLI documentation file
find . -name "cli.md" -type f | head -5Repository: Wangmerlyn/KeepGPU
Length of output: 106
🏁 Script executed:
# Also search for the CLI implementation to verify flag support
find . -name "cli.py" -type f | head -5Repository: Wangmerlyn/KeepGPU
Length of output: 83
🏁 Script executed:
# Search for list-gpus and stop command implementations
rg -n "def.*list.*gpu|def.*stop" --type=py -B 2 -A 10 | head -100Repository: Wangmerlyn/KeepGPU
Length of output: 6858
🏁 Script executed:
# Read the CLI documentation file to see the summary and detailed sections
cat -n ./docs/reference/cli.md | head -80Repository: Wangmerlyn/KeepGPU
Length of output: 3198
🏁 Script executed:
# Read the CLI implementation to check the actual command definitions
cat -n ./src/keep_gpu/cli.py | head -150Repository: Wangmerlyn/KeepGPU
Length of output: 5521
🏁 Script executed:
# Search for the stop and list-gpus command definitions in cli.py
rg -n "def stop|def list.?gpus|@app.command" ./src/keep_gpu/cli.py -A 15 | head -150Repository: Wangmerlyn/KeepGPU
Length of output: 3790
stop and list-gpus command summaries are missing --host/--port flags.
The detailed sections for both commands (lines 64-68 and 70-72) document --host, --port options, but the summary omits them. The implementations confirm both commands support these flags. Update the summaries to be consistent with status and service-stop.
📝 Suggested fix
-keep-gpu stop (--job-id ID | --all)
-keep-gpu list-gpus
+keep-gpu stop (--job-id ID | --all) [--host 127.0.0.1] [--port 8765]
+keep-gpu list-gpus [--host 127.0.0.1] [--port 8765]📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| keep-gpu stop (--job-id ID | --all) | |
| keep-gpu service-stop [--host 127.0.0.1] [--port 8765] [--force] | |
| keep-gpu list-gpus | |
| keep-gpu stop (--job-id ID | --all) [--host 127.0.0.1] [--port 8765] | |
| keep-gpu service-stop [--host 127.0.0.1] [--port 8765] [--force] | |
| keep-gpu list-gpus [--host 127.0.0.1] [--port 8765] |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/reference/cli.md` around lines 12 - 14, Update the command summaries for
the CLI doc so the "stop" and "list-gpus" entries include the --host and --port
flags like the "status" and "service-stop" lines do; specifically, modify the
summary lines that currently read "keep-gpu stop (--job-id ID | --all)" and
"keep-gpu list-gpus" to include "[--host 127.0.0.1] [--port 8765]" so they match
the detailed sections and the implemented CLI behavior (commands: stop,
list-gpus, and reference service-stop/status for flag format).
| ### `keep-gpu list-gpus` | ||
|
|
||
| ## Exit codes | ||
| Returns GPU telemetry from service. |
There was a problem hiding this comment.
list-gpus section lacks an options table.
Other service commands document their --host/--port options. If list-gpus contacts the service (which it must, to return telemetry), add the options table for consistency.
📝 Suggested addition after line 72
Returns GPU telemetry from service.
+
+| Option | Description |
+| --- | --- |
+| `--host`, `--port` | Service host/port. |📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| ### `keep-gpu list-gpus` | |
| ## Exit codes | |
| Returns GPU telemetry from service. | |
| ### `keep-gpu list-gpus` | |
| Returns GPU telemetry from service. | |
| | Option | Description | | |
| | --- | --- | | |
| | `--host`, `--port` | Service host/port. | |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/reference/cli.md` around lines 70 - 72, The `keep-gpu list-gpus` command
documentation is missing an options table; add a consistent options section
showing the service connection flags (e.g., --host and --port) for the `keep-gpu
list-gpus` heading so readers know how to point the command at the service.
Under the `### `keep-gpu list-gpus`` block insert an options table listing
`--host <host>` and `--port <port>` with brief descriptions and defaults
consistent with other service commands in the docs.
Summary
keep-gpuservice commands (start,status,stop,list-gpus,serve) while preserving default blocking behavior for backward compatibility/Validation
pytest tests/mcp/test_server.py tests/mcp/test_http_api.py tests/test_cli_thresholds.py tests/test_cli_service_commands.pypre-commit run --all-filesmkdocs buildSummary by CodeRabbit
New Features
Bug Fixes
Documentation
Tests