Add KEP-936: Introduce Kubeflow-MCP proposal for AI-Powered Training … by abhijeet-dhumal · Pull Request #937 · kubeflow/community

abhijeet-dhumal · 2026-01-28T09:16:11Z

Resolves #936

Summary

Introduces KEP-936: Kubeflow MCP Server - an AI-powered interface enabling LLM agents to interact with Kubeflow through the Model Context Protocol (MCP).

Key Features

Feature	Description
Dedicated Training Tools	`fine_tune()`, `run_custom_training()`, `run_container_training()` mapping to SDK trainer types
Modular Architecture	`--clients trainer,optimizer,hub` for selective tool loading
Persona-Based Access	`readonly`, `data-scientist`, `ml-engineer`, `platform-admin`
SDK Integration	Wraps `TrainerClient`, `OptimizerClient`, `ModelRegistryClient`

Implementation Plan

Phase	Scope	Tools
1. Core MCP Server	Trainer tools + mcp-tef validation	16
2. Resource Management	Pre-flight checks, Mellea exploration	17
3. Multi-Tenancy	Istio/impersonation, AGNTCY Identity	17
4. Job Lifecycle	Suspend/resume, checkpoints	19
5. Optimizer & Hub	Katib + Model Registry modules	33
6. Future Clients	Pipelines, Spark (when SDK available)	TBD

To address security constraints, I have added dedicated section with :

Authentication: Kubeconfig, ServiceAccount, ServiceAccount + Impersonation, OIDC

Authorization: K8s RBAC flow, required verbs per resource

Multi-User In-Cluster: Istio integration with x-forwarded-user header (see identity-flow diagram)

Secret Management: HF tokens, S3 credentials handling

Multi-Tenancy: Namespace isolation, policy enforcement, ResourceQuota validation

Audit Logging: Structured JSON logs with redaction

astefanutti · 2026-01-30T07:48:05Z

+| Method | Description | Use Case |
+|--------|-------------|----------|
+| **Kubeconfig** | Uses `~/.kube/config` or `KUBECONFIG` env var | Local development, CI/CD |
+| **ServiceAccount Token** | Mounted at `/var/run/secrets/kubernetes.io/serviceaccount/token` | In-cluster deployment |


In the case of in-cluster deployment, should the MCP server rather impersonate users?

Ah good point - I hadn't fully thought through the multi-user in-cluster scenario.

Please correct me here, So the flow would be:

MCP server runs with a ServiceAccount that has impersonate permissions

AI agent authenticates user and passes identity to MCP server

MCP server can use K8s impersonation (--as / Impersonate-User header) for API calls

This way RBAC is enforced per-user even with a shared MCP deployment.

Does this align with how you'd expect it to work? I'm curious if there's a standard pattern for this - perhaps similar to how the Notebooks controller handles user identity? 🤔

@astefanutti @andreyvelich
Please correct me here but what if we use Istio/OAuth2Proxy for this case to inject user identity ? Kubeflow already uses Istio + OIDC right?

To address this constraint, updated Authentication section with three deployment modes:

ServiceAccount Token : Single-user in-cluster
-ServiceAccount + Impersonation : Multi-user in-cluster

OIDC : Enterprise SSO

For multi-user, the flow uses Kubeflow's existing Istio layer:

User authenticates via Kubeflow dashboard

Istio Gateway validates JWT, adds x-forwarded-user header

MCP server reads header, impersonates user for K8s API calls

Also added RBAC example for the impersonator ClusterRole.
This now aligns with how Kubeflow Notebooks handles user identity. ✅

astefanutti · 2026-01-30T07:48:33Z

+
+![Multi-MCP Ecosystem](multi-mcp.png)
+
+**Design Principle:** No overlap. `kubeflow-mcp` should handle training; `kubernetes-mcp-server` should handle generic K8s operations.


kubeflow-mcp would eventually handle more than training.

Ah right - I was thinking the principle should be: kubeflow-mcp owns Kubeflow-specific CRDs (TrainJob, Experiment, ModelVersion, etc.), while kubernetes-mcp-server handles generic K8s resources (PVCs, ConfigMaps, Secrets).

So the row should probably say "Kubeflow CRDs" rather than "TrainJob CRDs"?

Or do you think there's value in having separate MCP servers per Kubeflow component (trainer-mcp, katib-mcp, model-registry-mcp)?
I see Model Registry is building its own MCP server (model-registry#2029). This raises a design question 🤔

I leaned toward unified since it mirrors the unified SDK structure, but curious what you think? Separate servers might give component teams more autonomy.

IIUC, the goal of kubeflow/model-registry#2029 is to provide searchable catalog of MCP servers that users can install via Hub UI. @ederign @kubeflow/kubeflow-hub-team please correct me if I miss-understood it.

Or do you think there's value in having separate MCP servers per Kubeflow component (trainer-mcp, katib-mcp, model-registry-mcp)?

This is good question, we should talk more about it. Since each project has a different scope and set of requirements, a broad range of tools is often necessary to support end-to-end LLMOps. However, exposing too many tools directly to an agent can quickly overwhelm its context window, leading to degraded performance and less reliable reasoning

As a workaround, we can think of introducing special Agent Skills: https://agentskills.io/home, which expose sub-set of MCP tools agent can talk to. These skills can act as curated capability bundles, exposing only a relevant subset of MCP tools to the agent

Similar to this diagram: anthropics/claude-agent-sdk-python#544

Or we can design sub-agents which can be Trainer or Spark experts.

Great pointers! I looked into Agent Skills and the sub-agent pattern. Here's my take:

Agent Skills - These are essentially curated tool bundles. Our modular client loading (--clients trainer) and persona filtering (--persona data-scientist) achieve similar scoping at the MCP level. The difference is we filter at server startup rather than requiring an intermediary skill layer.

Sub-agents - Interesting pattern for complex multi-step workflows. For Phase 1, we're keeping tools granular so the primary agent (Claude/Cursor) can orchestrate. If workflows get complex enough to need dedicated "Trainer expert" sub-agents, i think that's more of a client-side concern - the MCP server just needs to expose the right tool subsets, which we support via --clients.

That said, I added a note in Design Decisions about Speakeasy's dynamic toolsets pattern for Phase 5+ (33+ tools) - meta-tools like "search_tools", "describe_tool", "execute_tool" that let the LLM discover capabilities without loading all schemas upfront.

Open to discussing further if you think we should explicitly support Agent Skills integration!

Sounds good, any thoughts on how local agents (e.g. Claude) can configure MCP server with the desired scope (e.g. --clients trainer) ?
Are we going to provide this argument as an entrypoint to start the FastAPI server?

Yes, via entrypoint args.
Example for Claude Desktop (~/.claude/claude_desktop_config.json):

{ "mcpServers": { "kubeflow": { "command": "kubeflow-mcp", "args": ["serve", "--clients", "trainer", "--persona", "data-scientist"] } } }

astefanutti · 2026-01-30T07:49:53Z

+
+## Design Details
+
+### MCP Tool Inventory


Given the large number of tools, do you have an estimate of the size that it takes in the context based on your prototype?

Is there any concern w.r.t. to scaling the number of tools for all the Kubeflow components?

From the prototype, ~24 tools with docstrings comes to roughly 8-10K tokens.

My intuition is that this is manageable for current MCP clients, but I'm curious about your perspective. A few questions:

Do you know if there's a recommended upper bound for MCP tool count? I haven't found guidance on this in the spec.

For full Kubeflow coverage (Trainer + Katib + Model Registry + Pipelines), we might hit ~40-50 tools. Would lazy loading (only expose tools for installed components) be a reasonable mitigation?

The other approach is policy-based filtering - a readonly user sees only ~7 discovery tools, not all 50. Does this feel like the right direction?

And the discussion from above thread, should we plan separate MCP servers for each kubeflow component ? I think it's good to keep everything under 1 roof as a Unified kubeflow-mcp simialr to kubeflow-sdk which can wrap all SDK clients, single install.. but I would be happy to get more discussion or inputs on this aspect 💭

Happy to add a "Scalability Considerations" section if you think it's worth calling out explicitly.

cc: @andreyvelich @jaiakash @dhanishaphadate

Added "Tool Scalability" section with detailed analysis:

Token estimates:

Phase 1 (trainer only): 16 tools, nearly ~5.5K tokens

Full stack (trainer + optimizer + hub): 30 tools, ~11K tokens

Scalability strategies:

Modular Client Loading (--clients trainer) : 40-70%

Persona Filtering (--persona data-scientist) : 50-70%

Combined : 70-85%

Research shows LLM accuracy degrades beyond 20-25 tools :

With modular loading + personas, we stay within optimal range.

Also tracking MCP Lazy Tool Hydration proposal which could reduce overhead by 90%+ via deferred schema loading

Added mcp-tef validation in each implementation phase to ensure tool descriptions remain distinguishable as we scale.

Pleas review if this approach seems correct?
cc: @astefanutti @andreyvelich

I think, the long term approach would be to build specialized agent personas to reduce context.
For example, if I am only interested in PyTorch fine-tuning, I don't need Spark tools.

Exactly what I have added in proposal ! The proposal includes:

Persona Tools Use Case

readonly 7 Monitoring

data-scientist 12 Fine-tuning

ml-engineer 16 Full training

Combined with --clients trainer (only load trainer tools), a PyTorch fine-tuning user never sees Spark tools. The 70-85% reduction keeps us in the optimal 10-15 tool range per research.

Great discussion! I'd like to offer a different angle on the tool-count concern.

Rather than hard-capping the number of tools through personas or modular loading, I think we should be careful not to artificially constrain the API surface of kubeflow-mcp, especially as the SDK grows and new components get added. Limiting tools solves today's problem but creates friction tomorrow.

Instead, we can delegate the context-optimization problem to the MCP layer itself. There's an open-source project called mcp-optimizer that tackles exactly this, it dynamically prunes and prioritizes tools based on the current task/query, so the LLM only sees the relevant subset at inference time, without us having to bake filtering logic into the server.

I think there are two paths worth considering:

Integrate mcp-optimizer-style logic directly into kubeflow-mcp — so the server itself handles intelligent tool surfacing.

Provide official documentation/guidance on how to use mcp-optimizer (or similar middleware) alongside kubeflow-mcp for users who hit context limits.

This way, the server stays complete and unopinionated, and the optimization lives at the right layer. Happy to explore either direction.

That's interesting, thanks for sharing @Sanskarzz!
What are the main benefits of delegating tool selection to an mcp-optimizer, instead of managing access through multiple sub-agents that statically define Skills that have access to MCP tools by defining them in allowed-tools section?

Great point @Sanskarzz! I agree we shouldn't artificially constrain the API surface.
The way I see it, we have two complementary strategies:

Static filtering (--clients, --persona) - deterministic, predictable, zero overhead. Good for: "I'm a data scientist, I only need training tools." Admin sets it once at deployment.

Dynamic pruning (mcp-optimizer style) - intelligent, query-aware. Good for: "I have 50 tools loaded but this query only needs 5." Adds latency but handles the long tail.

For Phase 1-4, static filtering keeps us in the 10-16 tool range which works well. For Phase 5+ (33+ tools), I think you're right that dynamic approaches become necessary.
I'd suggest:

Phase 1-4: Ship with --clients + --persona (already in proposal)

Phase 5: Add --mode dynamic option that enables semantic tool discovery (the Speakeasy pattern I mentioned)

Document mcp-optimizer as a recommended middleware for users who want external optimization

This way the server stays complete and unopinionated (as you said), and optimization can happen at whichever layer makes sense for the deployment.
Does this approach work? Happy to add explicit guidance on mcp-optimizer integration to the docs.

…ar architecture Address PR kubeflow#937 review feedback: - Security section with Istio/impersonation for multi-tenant - Tool scalability via --clients flag and persona filtering - Dedicated training tools for granular permissions - 6-phase implementation with mcp-tef validation - Mellea and AGNTCY Identity integrations Signed-off-by: abhijeet-dhumal <abhijeetdhumal652@gmail.com>

…Interface Signed-off-by: abhijeet-dhumal <abhijeetdhumal652@gmail.com>

…trainer-specific estimation Signed-off-by: abhijeet-dhumal <abhijeetdhumal652@gmail.com>

Signed-off-by: abhijeet-dhumal <abhijeetdhumal652@gmail.com>

…ar architecture Address PR kubeflow#937 review feedback: - Security section with Istio/impersonation for multi-tenant - Tool scalability via --clients flag and persona filtering - Dedicated training tools for granular permissions - 6-phase implementation with mcp-tef validation - Mellea and AGNTCY Identity integrations Signed-off-by: abhijeet-dhumal <abhijeetdhumal652@gmail.com>

abhijeet-dhumal · 2026-02-21T10:27:28Z

Hi folks 👋
apologies for the silence on this, we've had a busy few weeks.

I have added some more aspects in the proposal :

Security Considerations section (auth, impersonation, audit)
Tool Scalability with modular loading and persona filtering
6-phase implementation plan with mcp-tef validation
HF Skills comparison in Alternatives
Mellea (Phase 2) and AGNTCY Identity (Phase 3) integrations
Multi-MCP scope to cover all Kubeflow CRDs
Tool counts and persona visibility consistency

I've still some bits to work through, but I'd appreciate any early feedback on the approach I've taken. I'm hoping I can raise something soon so folk can start taking a look, even if it's only a draft 🤞

cc @andreyvelich @astefanutti @franciscojavierarceo @jaiakash

Signed-off-by: abhijeet-dhumal <abhijeetdhumal652@gmail.com>

…ry in phase 3 Signed-off-by: abhijeet-dhumal <abhijeetdhumal652@gmail.com>

astefanutti

I left some small comments, otherwise looks good.

Signed-off-by: abhijeet-dhumal <abhijeetdhumal652@gmail.com>

abhijeet-dhumal · 2026-02-24T09:17:52Z

I left some small comments, otherwise looks good.

Thanks a lot @astefanutti for reviewing the PR, I have addressed the comments ✅

astefanutti

Thanks @abhijeet-dhumal!

/lgtm

Signed-off-by: abhijeet-dhumal <abhijeetdhumal652@gmail.com>

abhijeet-dhumal · 2026-03-12T06:12:01Z

Thanks @andreyvelich @thesuperzapper ! I've added the Ownership section as requested ✅
The section now includes:

WG ML Experience as the owning working group
@abhijeet-dhumal as primary maintainer
Proposed repository: kubeflow/mcp-server
Maintainer onboarding plan for future contributors
Experimental status disclaimer (addressing @thesuperzapper's feedback as well)

abhijeet-dhumal · 2026-03-12T06:14:25Z

Thanks for the guidance @thesuperzapper! I've addressed both points:

Experimental status - Added explicit disclaimer that the project will be marked experimental and not intended for production until graduation criteria are met.
Regarding a separate KEP for repo creation - I'm open to this approach. Would a minimal KEP similar to KEP-913-components-repo work? The scope would be:
- Request creation of kubeflow/mcp-server repository
- Assign WG ML Experience leads + @abhijeet-dhumal as initial OWNERS
- Mark as experimental

Happy to create this if you'd prefer that path before proceeding with the technical details in this KEP.
Please let me know if the current Ownership section works, or if you'd like me to split this into a separate repo-creation KEP first.

…rename Signed-off-by: abhijeet-dhumal <abhijeetdhumal652@gmail.com>

abhijeet-dhumal · 2026-03-12T08:04:19Z

@thesuperzapper Following your suggestion, I've created a separate repo-creation KEP: 938: Create kubeflow/mcp-server repository ✅

andreyvelich · 2026-03-12T11:43:59Z

@thesuperzapper @abhijeet-dhumal Can we just include ownership information in the KEP-936?
It might be hard to track this in a two separate documents.
Check this as a reference: https://github.com/andreyvelich/community/blob/76e58f1dddf064ee3bab7a990566e5bd7a9bb19c/proposals/819-kubeflow-sdk/README.md#ownership-of-kubeflow-sdk

abhijeet-dhumal · 2026-03-13T09:23:50Z

@andreyvelich Hey I have added ownership section and related details already in this PR KEP-936 ✅ : https://github.com/kubeflow/community/pull/937/changes#diff-6088d3fe5c7209fb9d22f384545afe417304387ddf7a0c0cb7d962b5d704f746R28
@thesuperzapper If not needed i can close the outstanding PR - KEP: #945 🔴

andreyvelich

Thanks @abhijeet-dhumal, overall lgtm, let's continue the design discussion in the future meetings, but we can at least unblock the initial work.
/lgtm
/assign @chasecadet @thesuperzapper @franciscojavierarceo @juliusvonkohout

andreyvelich · 2026-03-13T11:15:56Z

+@mcp.tool()
+def fine_tune(


I think, we should consolidate all Create TrainJob calls to a single MCP tool which might allow agents to use CustomTrainer or BuiltinTrainer. That will reduce context.
Not a blocker, but we can start with some simple flows and interate.

cc @kubeflow/kubeflow-sdk-team

Thanks @andreyvelich! The separation was intentional !
A unified tool would need a complex union schema with conditional required fields, which LLMs struggle to navigate correctly.
I have already included this concern in design decision section and described reasoning in detail : https://github.com/kubeflow/community/pull/937/changes#diff-6088d3fe5c7209fb9d22f384545afe417304387ddf7a0c0cb7d962b5d704f746R353

devjpt23 · 2026-03-17T02:08:30Z

@andreyvelich @abhijeet-dhumal
The instructions field in the DESIGN.md (line 589-607) could be expanded into a structured, file-backed document. The current instructions string has basic workflow hints and tool selection guidance, which is great. But from my experience building MCP servers, I've found that LLMs still pick the wrong tool even with good descriptions, they need cross-tool workflow context that per-tool descriptions can't encode. Would it be worth loading the instructions from a versioned markdown file instead of an inline string? This would make it easier for the community to contribute scenarios, anti-patterns, and edge case playbooks over time. Happy to elaborate if useful.

devjpt23 · 2026-03-19T04:50:16Z

Hey @abhijeet-dhumal @andreyvelich, I have been working with implementing the MCP server, given my experience with MCP server. here is link to my rep.

I'd like to hear your feedback.

andreyvelich · 2026-03-19T21:05:29Z

We’ve reached quorum from KSC to move this forward, thanks to @abhijeet-dhumal for driving this! 🎉

/lgtm
/approve

Sanskarzz

Thanks for working through this proposal so thoroughly @abhijeet-dhumal — really impressive.
I've read through the full proposal, DESIGN.md, and the prior discussion, and I have a few specific comments below that I think are worth addressing before or during implementation.

Sanskarzz · 2026-03-21T21:19:51Z

+        spec = importlib.util.spec_from_file_location("training_module", temp_path)
+        module = importlib.util.module_from_spec(spec)
+        spec.loader.exec_module(module)
+        train_func = module.train


One thing worth clarifying in the security model here: spec.loader.exec_module(module) actually executes any module-level code in func_code on the MCP server process itself, not just inside the K8s pod. The pod only receives the extracted train function later. So if func_code contains something like:

x = open("/etc/passwd").read() # runs on MCP server at load time! def train(**kwargs): pass

…the open() call runs before any K8s pod is created.

The Security Note says "The func_code is executed within K8s pods, not on the MCP server host" — this is accurate for the train() function body, but not for module-level code. I think it's worth adding a clarifying sentence like:

"Module-level statements in func_code execute on the MCP server host at load time. Only the train() function body is serialized and sent to the K8s worker pod."

The AST checks catch a lot of the obvious cases, but as a defense-in-depth note this distinction matters — especially for a shared in-cluster/gateway deployment

You're right! exec_module() does execute module-level code on the MCP server before the train function is extracted and sent to K8s.
The AST checks (Phase 1) catch obvious cases like open(), subprocess,..., but as you note - defense-in-depth matters for shared gateway deployments.
I'll update the security note during implementation to clarify:

Module-level statements execute on MCP server at load time

Only the train() function body is serialised to K8s worker pods
Thanks for the thorough review!

@Sanskarzz I have added a security note accordingly !

Sanskarzz · 2026-03-21T21:23:47Z

+    try:
+        client = TrainerClient()
+        trainer = BuiltinTrainer(
+            config=TorchTuneConfig(
+                epochs=epochs, batch_size=batch_size, num_nodes=num_nodes,
+                resources_per_node=resources_per_node,
+                peft_config=LoraConfig(lora_rank=16, lora_alpha=32) if peft_method == "lora" else None,
+            )
+        )
+        initializer = Initializer(
+            model=HuggingFaceModelInitializer(storage_uri=f"hf://{model}"),
+            dataset=HuggingFaceDatasetInitializer(storage_uri=f"hf://{dataset}"),
+        )
+        job_name = client.train(runtime=runtime, trainer=trainer, initializer=initializer)
+        return {"success": True, "job_id": job_name, "trainer_type": "BuiltinTrainer"}
+    except ApiException as e:
+        return {"success": False, "error": f"Kubernetes API error: {e.reason}", "status_code": e.status}
+    except Exception as e:
+        return {"success": False, "error": str(e)}
+


Small but potentially important gap: each tool does client = TrainerClient() inline with no arguments, which defaults to KubernetesBackendConfig() and re-reads kubeconfig every call.

For the multi-user in-cluster scenario (where auth.py's create_k8s_client_for_user() creates an impersonated kubernetes.client.ApiClient), there's currently no shown path to pass that into TrainerClient. The SDK's constructor signature accepts:

TrainerClient(backend_config: KubernetesBackendConfig | ...)

Could we add a design note showing how the impersonated client gets wired in? Something like:

# in each tool, after auth: backend_config = KubernetesBackendConfig( client=create_k8s_client_for_user(user_identity), namespace=user_identity["namespace"], ) client = TrainerClient(backend_config=backend_config)

Without this, the per-user namespace isolation and impersonation described in the auth section doesn't actually connect to the tool execution path.

Valid point - here the pseudocode shows TrainerClient() with no args, which doesn't connect to the impersonation path in auth.py. The implementation will wire this properly 👍

I'll add this pattern to the implementation docs. The KEP pseudocode is intentionally simplified for readability..

Sanskarzz · 2026-03-21T21:28:24Z

+---
+
+## Resource Estimation Algorithm
+
+```python
+def estimate_resources(
+    model: str, 
+    peft_method: str,
+    batch_size: int = 4,
+    sequence_length: int = 2048,
+    quantization: str = "bf16",  # "fp32", "bf16", "fp16", "int8", "int4"
+    num_nodes: int = 1,
+) -> dict:


The get_model_info(model) call hits HuggingFace Hub API synchronously in the tool hot path. A few concerns:

No timeout — if HF Hub is slow/unreachable, this blocks indefinitely.

No caching — same model queried repeatedly on every estimate_resources() call.

Air-gapped environments — enterprise Kubeflow deployments often have no external internet access. The tool would always return "confidence": "low" and be essentially useless.

The user_provided_params approach was already discussed in the thread (great suggestion from @MansiSingh17, agreed to by @abhijeet-dhumal for Phase 2), but it's not yet reflected in the algorithm code. Could we update the pseudocode snippet to show:

# Prefer user-provided params for private/air-gapped models if user_provided_params: param_count = user_provided_params.get("param_count") ... else: model_info = get_model_info(model, timeout=10) # explicit timeout

Even a TODO comment would help set expectations for implementors.

When HF Hub is unreachable, we return "confidence": "low" with a message suggesting user_provided_params. This is already the intent - I'll make sure the implementation pseudocode reflects it.

For enterprise deployments, we could also support a local model registry endpoint as an alternative to HF Hub.

Sanskarzz · 2026-03-21T21:29:44Z

+### `wait_for_training` Timeout
+
+```python
+@mcp.tool()
+def wait_for_training(
+    job_id: str,
+    timeout: int = 3600,      # Default: 1 hour
+    poll_interval: int = 30,
+) -> dict:
+    """Returns current status on timeout (doesn't cancel job)."""
+```
+


The default timeout=3600 (1 hour) will be problematic for the Streamable HTTP transport. Most reverse proxies (nginx default: 60s, AWS ALB: 60s, GCP: 30 min) will terminate idle connections well before 1 hour. The SDK's own wait_for_job_status() defaults to 600s.

One option: keep the long timeout for stdio mode (Claude Desktop/Cursor) but document that for HTTP deployments the recommended pattern is polling via get_training_job() at intervals, rather than holding a long-lived connection open. Could add a transport note here:

"For StreamableHTTP deployments, prefer using get_training_job() polling over wait_for_training() to avoid proxy timeout issues."

Good point on transport differences. The 1-hour default makes sense for stdio (Claude Desktop, Cursor) but breaks behind reverse proxies.
but again .. . The KEP pseudocode is intentionally simplified for readability..
We will document the recommended pattern for each transport mode 👍

abhijeet-dhumal · 2026-04-01T18:11:12Z

We’ve reached quorum from KSC to move this forward, thanks to @abhijeet-dhumal for driving this! 🎉

/lgtm /approve

@andreyvelich @thesuperzapper May I request and update on the request here ?

Signed-off-by: abhijeet-dhumal <abhijeetdhumal652@gmail.com>

andreyvelich

/lgtm
/approve
/hold cancel

google-oss-prow · 2026-04-08T14:49:36Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreyvelich

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [andreyvelich]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

google-oss-prow bot requested review from franciscojavierarceo and juliusvonkohout January 28, 2026 09:16

google-oss-prow bot added the size/XL label Jan 28, 2026

abhijeet-dhumal mentioned this pull request Jan 28, 2026

Built an MCP Server for Kubeflow SDK kubeflow/sdk#238

Open

astefanutti reviewed Jan 29, 2026

View reviewed changes

Comment thread proposals/936-kubeflow-mcp-server/README.md Outdated

Comment thread proposals/936-kubeflow-mcp-server/README.md Outdated

abhijeet-dhumal requested a review from astefanutti January 29, 2026 17:49

astefanutti reviewed Jan 30, 2026

View reviewed changes

andreyvelich mentioned this pull request Feb 5, 2026

feat: add MCP server foundation with discovery tools kubeflow/sdk#265

Closed

1 task

google-oss-prow bot added size/XXL and removed size/XL labels Feb 21, 2026

abhijeet-dhumal added 4 commits February 21, 2026 15:46

Add KEP-936: Introduce Kubeflow-MCP proposal for AI-Powered Training …

542c963

…Interface Signed-off-by: abhijeet-dhumal <abhijeetdhumal652@gmail.com>

fix: add security section, HF Skills comparison, KEP-2839 links, and …

65d37de

…trainer-specific estimation Signed-off-by: abhijeet-dhumal <abhijeetdhumal652@gmail.com>

fix: adjust diagrams

1accc8f

Signed-off-by: abhijeet-dhumal <abhijeetdhumal652@gmail.com>

abhijeet-dhumal force-pushed the kep-kubeflow-mcp branch from 6379e45 to f75db71 Compare February 21, 2026 10:17

fix: Move all diagrams to dedicated assets dir

3e35b7e

Signed-off-by: abhijeet-dhumal <abhijeetdhumal652@gmail.com>

abhijeet-dhumal requested a review from astefanutti February 21, 2026 11:57

abhijeet-dhumal added 3 commits February 21, 2026 17:58

fix: clarify func_code serialization for MCP JSON transport

06e5879

Signed-off-by: abhijeet-dhumal <abhijeetdhumal652@gmail.com>

refactor: split into lean KEP and DESIGN.md

3f16bda

Signed-off-by: abhijeet-dhumal <abhijeetdhumal652@gmail.com>

Add AST security checks, enhanced resource estimation and opentelemet…

b3890d9

…ry in phase 3 Signed-off-by: abhijeet-dhumal <abhijeetdhumal652@gmail.com>

abhijeet-dhumal mentioned this pull request Feb 23, 2026

KEP: Kubeflow MCP Server - AI-Powered Training Interface #936

Closed

astefanutti reviewed Feb 23, 2026

View reviewed changes

Comment thread proposals/936-kubeflow-mcp-server/README.md Outdated

Comment thread proposals/936-kubeflow-mcp-server/README.md Outdated

Comment thread proposals/936-kubeflow-mcp-server/README.md Outdated

Comment thread proposals/936-kubeflow-mcp-server/README.md Outdated

fix: use Kubeflow Trainer naming and move Package Structure section

59ef751

Signed-off-by: abhijeet-dhumal <abhijeetdhumal652@gmail.com>

abhijeet-dhumal requested a review from astefanutti February 24, 2026 09:18

astefanutti reviewed Feb 24, 2026

View reviewed changes

google-oss-prow bot assigned astefanutti Feb 24, 2026

docs: rename proposed repo to kubeflow/mcp-server

75b7877

Signed-off-by: abhijeet-dhumal <abhijeetdhumal652@gmail.com>

docs: add design decision for Agent Skills placement and future repo …

dca948c

…rename Signed-off-by: abhijeet-dhumal <abhijeetdhumal652@gmail.com>

abhijeet-dhumal mentioned this pull request Mar 12, 2026

KEP-938: Create kubeflow/mcp-server repository #945

Closed

andreyvelich reviewed Mar 13, 2026

View reviewed changes

google-oss-prow bot assigned chasecadet, franciscojavierarceo, juliusvonkohout, thesuperzapper and andreyvelich Mar 13, 2026

google-oss-prow bot added the lgtm label Mar 13, 2026

google-oss-prow bot added the approved label Mar 19, 2026

Sanskarzz reviewed Mar 21, 2026

View reviewed changes

abhijeet-dhumal requested review from Sanskarzz and andreyvelich April 1, 2026 18:14

fix: clarify func_code executes on MCP server host at load time

c39da2c

Signed-off-by: abhijeet-dhumal <abhijeetdhumal652@gmail.com>

google-oss-prow bot removed the lgtm label Apr 1, 2026

andreyvelich reviewed Apr 8, 2026

View reviewed changes

google-oss-prow bot added lgtm and removed do-not-merge/hold labels Apr 8, 2026

google-oss-prow bot merged commit fbf0a14 into kubeflow:master Apr 8, 2026
2 checks passed


		### Trainer Selection Logic

		![Trainer Selection](trainer-selection.png)

+              | **Unauthorized Access** | Policy layer enforces RBAC at tool level |
+              | **Scope Creep** | Clear delegation to `kubernetes-mcp-server` for generic K8s ops |
+              ## Design Details


		![Multi-MCP Ecosystem](multi-mcp.png)

		Design Principle: No overlap. `kubeflow-mcp` should handle training; `kubernetes-mcp-server` should handle generic K8s operations.

Persona	Tools	Use Case
readonly	7	Monitoring
data-scientist	12	Fine-tuning
ml-engineer	16	Full training

Conversation

abhijeet-dhumal commented Jan 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Key Features

Implementation Plan

Related

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

abhijeet-dhumal Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

abhijeet-dhumal commented Feb 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

astefanutti left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

abhijeet-dhumal commented Feb 24, 2026

Uh oh!

astefanutti left a comment

Choose a reason for hiding this comment

Uh oh!

abhijeet-dhumal commented Jan 28, 2026 •

edited

Loading

abhijeet-dhumal Feb 3, 2026 •

edited

Loading

abhijeet-dhumal commented Feb 21, 2026 •

edited

Loading

abhijeet-dhumal commented Mar 13, 2026 •

edited

Loading

abhijeet-dhumal Mar 13, 2026 •

edited

Loading