Conversation
|
/hold |
|
/hold cancel |
|
/hold for fixing CI issues. |
pkg/toolsets/core/nodes.go
Outdated
| internalk8s "github.com/containers/kubernetes-mcp-server/pkg/kubernetes" | ||
| ) | ||
|
|
||
| func initNodes(_ internalk8s.Openshift) []api.ServerTool { |
There was a problem hiding this comment.
Nit: maybe we can drop the internalk8s.Openshift parameter here since we don't need it? For the initXYZ methods, there's no requirement on this parameter existing - it seems to be present in only some of them
| func initNodes(_ internalk8s.Openshift) []api.ServerTool { | |
| func initNodes() []api.ServerTool { |
pkg/kubernetes/nodes.go
Outdated
| // | ||
| // When namespace is empty, the configured namespace (or "default" if none) is used. When image is empty the | ||
| // default debug image is used. Timeout controls how long we wait for the pod to complete. | ||
| func (k *Kubernetes) NodesDebugExec( |
There was a problem hiding this comment.
Would it be possible to split this function up a little bit? IMO it is getting quite large and is responsible for too much
Maybe we can create functions that:
- create the debug pod
- poll for debug completion
- Retrieve the logs
pkg/kubernetes/nodes.go
Outdated
| // nodeDebugContainerName is the name used for the debug container, matching oc debug defaults. | ||
| nodeDebugContainerName = "debug" | ||
| // defaultNodeDebugTimeout is the maximum time to wait for the debug pod to finish executing. | ||
| defaultNodeDebugTimeout = 5 * time.Minute |
There was a problem hiding this comment.
We may want to lower this timeout as by default this will significantly exceed the client tool call connection timeout: https://github.com/modelcontextprotocol/typescript-sdk/blob/e0de0829019a4eab7af29c05f9a7ec13364f121e/src/shared/protocol.ts#L60
We probably also want to add support for progress notifications to be sent to the client for longer running tool calls like this one (cc @mrunalp @ardaguclu @matzew @manusa )
pkg/kubernetes/nodes.go
Outdated
| grace := int64(0) | ||
| _ = podsClient.Delete(deleteCtx, created.Name, metav1.DeleteOptions{GracePeriodSeconds: &grace}) |
There was a problem hiding this comment.
nit: let's use ptr.To here like we do elsewhere
There was a problem hiding this comment.
The refactored file at pkg/ocp/nodes_debug.go, now uses ptr.To. Thanks.
| defaultNodeDebugTimeout = 5 * time.Minute | ||
| ) | ||
|
|
||
| // NodesDebugExec mimics `oc debug node/<name> -- <command...>` by creating a privileged pod on the target |
There was a problem hiding this comment.
Since this is ocp specific;
would it make sense, to group this functionality into some pkg/ocp package?
There was a problem hiding this comment.
File pkg/ocp/nodes_debug.go is part of ocp package.
There was a problem hiding this comment.
Hold on, I will move everything this PR adds from pkg/kubernetes to pkg/ocp so future rebasing avoids the conflicts.
Cali0707
left a comment
There was a problem hiding this comment.
@harche would it be possible to improve the error messages we provide?
When running this server with claude code claude was able to call the tool, but it frequently got errors such as:
⏺ ocp-debug - Nodes: Debug Exec (MCP)(node: "ip-10-0-112-253.us-east-2.compute.internal", command: ["systemctl","status","kubelet"])
⎿ Error: command exited with code 1 (Error)
I'm not sure if there is a way to get more info about what went wrong, but with the current lack of error messages it was hard for the agent to figure out what went wrong and how to fix it
pkg/kubernetes/kubernetes.go
Outdated
| type Kubernetes struct { | ||
| manager *Manager | ||
| manager *Manager | ||
| podClientFactory func(namespace string) (corev1client.PodInterface, error) |
There was a problem hiding this comment.
Wouldn't these changes in here cause the divergence from upstream?. I think, in this repository we shouldn't allow any changes touching these packages such as kubernetes/, mcp/, etc.
My suggestion is to add first in upstream and simply use it in downstream. Downstream repository should touch only ocp/ directory.
c99af16 to
694501c
Compare
|
/test test |
| return slices.Concat( | ||
| initEvents(), | ||
| initNamespaces(o), |
There was a problem hiding this comment.
IMO - let's not add any openshift-specific tools to the core toolset. I'm worried this may lead to conflicts in the future if we make any upstream changes.
Instead, let's create a new openshift specific toolset (maybe openshift-core) that will hold all the openshift only tools
There was a problem hiding this comment.
@Cali0707 thanks for that feedback. Created new pkg/toolsets/openshift/ package with the openshift-core toolset, also moved nodes_debug_exec from core to openshift-core
|
/retest |
pkg/toolsets/openshift/toolset.go
Outdated
| var _ api.Toolset = (*Toolset)(nil) | ||
|
|
||
| func (t *Toolset) GetName() string { | ||
| return "openshift-core" |
There was a problem hiding this comment.
Let's rename this to be more specific like openshift-node-debug
There was a problem hiding this comment.
how about openshift-debug? I was thinking of adding oc rsync kind of tool too, which may not be limited to just node related debugging.
There was a problem hiding this comment.
That tool in turn can be used in future for copying pprofs from the nodes.
|
/hold for containers#391 |
|
@harche the containers#391 is merged now wanna rebase ? |
Sure, I was planning to get back to this after openshift shiftweek (i.e. from upcoming Monday) |
|
I am looking into the rebasing it, need to resolve this merge conflict. |
Done. Thanks. /hold cancel |
pkg/config/config_test.go
Outdated
| s.Require().Lenf(config.Toolsets, 4, "Expected 4 toolsets, got %d", len(config.Toolsets)) | ||
| for _, toolset := range []string{"core", "config", "helm", "openshift-core"} { |
There was a problem hiding this comment.
I'll need to look into this upstream
There was a problem hiding this comment.
Do you think we should wait here for these upstream changes you are looking at?
There was a problem hiding this comment.
I added a skip to this test since it relies on default values that might be overridden downstream.
|
PR needs rebase. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
The core/upstream tests should not need updating now. |
| // IMPORTANT: this file is used to override default config values in downstream builds. | ||
| // This is intentionally left blank. | ||
| // OpenShift-specific defaults: add openshift-core toolset | ||
| Toolsets: []string{"core", "config", "helm", "openshift-core"}, |
There was a problem hiding this comment.
we should perhaps now place that into the openshift toolset: https://github.com/openshift/openshift-mcp-server/blob/main/pkg/toolsets/openshift/toolset.go
📝 WalkthroughWalkthroughAdded an OpenShift toolset (enabled by default) with Changes
Sequence DiagramsequenceDiagram
participant Client
participant Handler as nodes_debug_exec Handler
participant K8sClient as NodeDebugClient
participant API as Kubernetes API
participant Node as Target Node
Client->>Handler: Invoke nodes_debug_exec(node, command, ...)
Handler->>K8sClient: Validate inputs, resolve namespace/image/timeout
Handler->>K8sClient: Create privileged debug pod on node
K8sClient->>API: Create Pod
API->>Node: Schedule debug pod
loop Polling
Handler->>K8sClient: Get pod status
K8sClient->>API: Fetch pod
API-->>Handler: Pod status
alt Image Pull Error
Handler->>Handler: Fail fast on ErrImagePull/ImagePullBackOff
end
end
Handler->>K8sClient: Retrieve container logs (30s bounded)
K8sClient->>API: Get logs
API-->>Handler: Container output
Handler->>K8sClient: Delete debug pod
K8sClient->>API: Delete Pod
alt Container exit code == 0
Handler-->>Client: Return logs as output
else Container exit code != 0
Handler-->>Client: Return error with exit code + logs
else Termination info missing
Handler-->>Client: Return derived error from pod/container status
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment Tip CodeRabbit can generate a title for your PR based on the changes.Add |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: harche The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (1)
pkg/ocp/nodes_debug.go (1)
111-113:⚠️ Potential issue | 🟠 MajorTimeout path still drops logs and last pod status.
Line 112 returns on poll errors before any log collection, and Line 258 throws away
lastPod/waitMsgon timeout. A command that produces output and then hangs—or a pod that stays pending because the node/scheduling is wrong—will currently collapse to a generic timeout with none of the useful diagnostics.💡 Suggested direction
terminated, lastPod, waitMsg, err := pollForCompletion(ctx, k, ns, created.Name, timeout) if err != nil { - return "", err + logs, _ := retrieveLogs(context.Background(), k, ns, created.Name) + if logs != "" { + return "", fmt.Errorf("%w\nOutput:\n%s", err, logs) + } + if waitMsg != "" { + return "", fmt.Errorf("%w\nLast status: %s", err, waitMsg) + } + return "", err }+ if current.Status.Message != "" { + waitMsg = current.Status.Message + } else if current.Status.Reason != "" { + waitMsg = current.Status.Reason + } + // Wait for the next tick interval before checking pod status again, or timeout if context is done. select { case <-pollCtx.Done(): - return nil, nil, "", fmt.Errorf("timed out waiting for debug pod %s to complete: %w", podName, pollCtx.Err()) + return nil, lastPod, waitMsg, fmt.Errorf("timed out waiting for debug pod %s to complete: %w", podName, pollCtx.Err()) case <-ticker.C: }Also applies to: 230-259
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/ocp/nodes_debug.go` around lines 111 - 113, The code currently returns immediately on errors from pollForCompletion (variables terminated, lastPod, waitMsg) which discards useful diagnostics; modify the error/timeout handling in the functions using pollForCompletion so that when pollForCompletion returns an error (including timeout) you first gather logs and status from lastPod and include waitMsg and lastPod status/output in the returned error or in the logged output before returning; update both the early-return site around pollForCompletion and the timeout path around lines 230-259 to call the existing log/status collection helpers (or add a small helper) to capture pod logs and status and attach them to the error message instead of throwing them away.
🧹 Nitpick comments (1)
pkg/toolsets/openshift/nodes_test.go (1)
22-45: Consider adding test coverage for additional validation cases.The current tests cover missing
nodeand invalidcommandtype. Consider adding tests for:
- Missing
commandargument- Empty command array
These edge cases are handled in
toStringSlice(lines 108-117 innodes.go) but aren't exercised by these tests.💡 Additional test cases
+ s.Run("missing command", func() { + params := api.ToolHandlerParams{ + ToolCallRequest: staticRequest{args: map[string]any{ + "node": "worker-0", + }}, + } + result, err := nodesDebugExec(params) + s.Require().NoError(err) + s.Require().NotNil(result.Error) + s.Contains(result.Error.Error(), "command is required") + }) + + s.Run("empty command array", func() { + params := api.ToolHandlerParams{ + ToolCallRequest: staticRequest{args: map[string]any{ + "node": "worker-0", + "command": []interface{}{}, + }}, + } + result, err := nodesDebugExec(params) + s.Require().NoError(err) + s.Require().NotNil(result.Error) + s.Contains(result.Error.Error(), "command array cannot be empty") + })🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/toolsets/openshift/nodes_test.go` around lines 22 - 45, Add two subtests inside NodesHandlerSuite.TestValidatesInput to exercise the toStringSlice validation: one "missing command" where params.ToolCallRequest has "node" set but no "command" key, and one "empty command array" where "command" is an empty []any{} or []string{}. Call nodesDebugExec(params) in each, require no call error, require result.Error is not nil, and assert the returned error message matches the toStringSlice error for missing command and for empty command array respectively; reference the existing TestValidatesInput structure and the toStringSlice behavior in nodes.go when naming/asserting the expected messages.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/ocp/nodes_debug.go`:
- Around line 152-156: The label value construction for
kubernetes.AppKubernetesName uses sanitizedNode which can end with a trailing
hyphen after truncation (via sanitizeForName/substring), producing an invalid
label; update the code that builds sanitizedNode (and the other occurrence
around lines 314-332) to re-trim any trailing hyphens (and whitespace) after
truncation—i.e., run a final trim that removes '-' characters from the end of
sanitizedNode before using it in fmt.Sprintf("node-debug-%s", sanitizedNode) so
the generated label never ends with a hyphen.
- Around line 267-275: The current retrieveLogs function calls strings.TrimSpace
which strips leading indentation and blank lines and corrupts exec-style debug
output; change the return to preserve verbatim logs by removing the
strings.TrimSpace call and return the raw logs string from PodsLog (i.e., in
retrieveLogs replace the trimmed return with returning logs unchanged), ensuring
NodeDebugContainerName and the PodsLog invocation remain unchanged.
---
Duplicate comments:
In `@pkg/ocp/nodes_debug.go`:
- Around line 111-113: The code currently returns immediately on errors from
pollForCompletion (variables terminated, lastPod, waitMsg) which discards useful
diagnostics; modify the error/timeout handling in the functions using
pollForCompletion so that when pollForCompletion returns an error (including
timeout) you first gather logs and status from lastPod and include waitMsg and
lastPod status/output in the returned error or in the logged output before
returning; update both the early-return site around pollForCompletion and the
timeout path around lines 230-259 to call the existing log/status collection
helpers (or add a small helper) to capture pod logs and status and attach them
to the error message instead of throwing them away.
---
Nitpick comments:
In `@pkg/toolsets/openshift/nodes_test.go`:
- Around line 22-45: Add two subtests inside
NodesHandlerSuite.TestValidatesInput to exercise the toStringSlice validation:
one "missing command" where params.ToolCallRequest has "node" set but no
"command" key, and one "empty command array" where "command" is an empty []any{}
or []string{}. Call nodesDebugExec(params) in each, require no call error,
require result.Error is not nil, and assert the returned error message matches
the toStringSlice error for missing command and for empty command array
respectively; reference the existing TestValidatesInput structure and the
toStringSlice behavior in nodes.go when naming/asserting the expected messages.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: de697222-ba30-46fa-b7b6-7b17168b63ee
📒 Files selected for processing (10)
README.mddocs/configuration.mdinternal/tools/update-readme/main.gopkg/config/config_default_overrides.gopkg/ocp/nodes_debug.gopkg/ocp/nodes_debug_test.gopkg/ocp/nodes_debug_testhelpers_test.gopkg/toolsets/openshift/nodes.gopkg/toolsets/openshift/nodes_test.gopkg/toolsets/openshift/toolset.go
| func retrieveLogs(ctx context.Context, k NodeDebugClient, namespace, podName string) (string, error) { | ||
| logCtx, logCancel := context.WithTimeout(ctx, 30*time.Second) | ||
| defer logCancel() | ||
| logs, logErr := k.PodsLog(logCtx, namespace, podName, NodeDebugContainerName, false, 0) | ||
| if logErr != nil { | ||
| return "", fmt.Errorf("failed to retrieve debug pod logs: %w", logErr) | ||
| } | ||
| return strings.TrimSpace(logs), nil | ||
| } |
There was a problem hiding this comment.
Preserve command output verbatim here.
strings.TrimSpace removes leading indentation and blank lines, not just the trailing newline. For an exec-style debug tool, that's observable output corruption.
💡 Suggested fix
- return strings.TrimSpace(logs), nil
+ return strings.TrimRight(logs, "\r\n"), nil📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| func retrieveLogs(ctx context.Context, k NodeDebugClient, namespace, podName string) (string, error) { | |
| logCtx, logCancel := context.WithTimeout(ctx, 30*time.Second) | |
| defer logCancel() | |
| logs, logErr := k.PodsLog(logCtx, namespace, podName, NodeDebugContainerName, false, 0) | |
| if logErr != nil { | |
| return "", fmt.Errorf("failed to retrieve debug pod logs: %w", logErr) | |
| } | |
| return strings.TrimSpace(logs), nil | |
| } | |
| func retrieveLogs(ctx context.Context, k NodeDebugClient, namespace, podName string) (string, error) { | |
| logCtx, logCancel := context.WithTimeout(ctx, 30*time.Second) | |
| defer logCancel() | |
| logs, logErr := k.PodsLog(logCtx, namespace, podName, NodeDebugContainerName, false, 0) | |
| if logErr != nil { | |
| return "", fmt.Errorf("failed to retrieve debug pod logs: %w", logErr) | |
| } | |
| return strings.TrimRight(logs, "\r\n"), nil | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/ocp/nodes_debug.go` around lines 267 - 275, The current retrieveLogs
function calls strings.TrimSpace which strips leading indentation and blank
lines and corrupts exec-style debug output; change the return to preserve
verbatim logs by removing the strings.TrimSpace call and return the raw logs
string from PodsLog (i.e., in retrieveLogs replace the trimmed return with
returning logs unchanged), ensuring NodeDebugContainerName and the PodsLog
invocation remain unchanged.
Add a privileged debug pod tool that mimics `oc debug node/<name>` for running commands on OpenShift nodes. The tool creates a temporary pod with the UBI9 toolbox image, executes the command, collects output, and cleans up automatically. Key design decisions based on PR review feedback: - All new code lives in pkg/ocp/ and pkg/toolsets/openshift/ only, no upstream packages (pkg/kubernetes/, pkg/mcp/) are modified - Tool is in the "openshift" toolset (not "core") to avoid conflicts - NodeDebugClient interface decouples from concrete k8s client for testability - Error messages include command output/logs for better diagnostics - Default timeout is 1 minute (within MCP client timeout limits) - Gated behind ReadOnly config (read_only=false required to use) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
♻️ Duplicate comments (1)
pkg/ocp/nodes_debug.go (1)
315-334:⚠️ Potential issue | 🟡 MinorTrailing hyphen may remain after truncation in
sanitizeForName, causing invalid label values.The
sanitizeForNamefunction trims leading/trailing hyphens at line 326, but then truncates at lines 330-331 without re-trimming. If the 40th character is a hyphen, the returned value will have a trailing hyphen. This creates invalid Kubernetes label values (used at line 156 forkubernetes.AppKubernetesName), since label values must match the pattern([A-Za-z0-9][-A-Za-z0-9_.]*)?[A-Za-z0-9].While the pod name is correctly handled (line 145 re-trims after its truncation), the label value uses
sanitizedNodewhich may still contain a trailing hyphen fromsanitizeForName.🛠️ Proposed fix
if len(sanitized) > 40 { - sanitized = sanitized[:40] + sanitized = strings.TrimRight(sanitized[:40], "-") + } + if sanitized == "" { + sanitized = "node" } return sanitized🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/ocp/nodes_debug.go` around lines 315 - 334, sanitizeForName can return a trailing hyphen when truncation cuts at a hyphen, producing invalid Kubernetes label values (used for kubernetes.AppKubernetesName). Fix sanitizeForName by re-trimming/truncation-sanitizing after the length cut: after you apply sanitized = sanitized[:40] ensure you trim leading/trailing hyphens (e.g., strings.Trim(sanitized, "-")) and if that yields an empty string reset to "node"; additionally ensure the final character is alphanumeric (remove trailing non-alnum like '-') so the returned value always matches label rules.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@pkg/ocp/nodes_debug.go`:
- Around line 315-334: sanitizeForName can return a trailing hyphen when
truncation cuts at a hyphen, producing invalid Kubernetes label values (used for
kubernetes.AppKubernetesName). Fix sanitizeForName by
re-trimming/truncation-sanitizing after the length cut: after you apply
sanitized = sanitized[:40] ensure you trim leading/trailing hyphens (e.g.,
strings.Trim(sanitized, "-")) and if that yields an empty string reset to
"node"; additionally ensure the final character is alphanumeric (remove trailing
non-alnum like '-') so the returned value always matches label rules.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 7c9ff437-5500-46a9-bfe5-9726b9ff099c
📒 Files selected for processing (10)
README.mddocs/configuration.mdinternal/tools/update-readme/main.gopkg/config/config_default_overrides.gopkg/ocp/nodes_debug.gopkg/ocp/nodes_debug_test.gopkg/ocp/nodes_debug_testhelpers_test.gopkg/toolsets/openshift/nodes.gopkg/toolsets/openshift/nodes_test.gopkg/toolsets/openshift/toolset.go
✅ Files skipped from review due to trivial changes (2)
- pkg/config/config_default_overrides.go
- docs/configuration.md
🚧 Files skipped from review as they are similar to previous changes (4)
- pkg/toolsets/openshift/toolset.go
- pkg/toolsets/openshift/nodes.go
- pkg/toolsets/openshift/nodes_test.go
- README.md
|
@harche: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Tested with claude code,
Summary by CodeRabbit
New Features
Documentation
Chores
Tests