Skip to content

Add KEP-936: Introduce Kubeflow-MCP proposal for AI-Powered Training …#937

Merged
google-oss-prow[bot] merged 25 commits intokubeflow:masterfrom
abhijeet-dhumal:kep-kubeflow-mcp
Apr 8, 2026
Merged

Add KEP-936: Introduce Kubeflow-MCP proposal for AI-Powered Training …#937
google-oss-prow[bot] merged 25 commits intokubeflow:masterfrom
abhijeet-dhumal:kep-kubeflow-mcp

Conversation

@abhijeet-dhumal
Copy link
Copy Markdown
Member

@abhijeet-dhumal abhijeet-dhumal commented Jan 28, 2026

Resolves #936

Summary

Introduces KEP-936: Kubeflow MCP Server - an AI-powered interface enabling LLM agents to interact with Kubeflow through the Model Context Protocol (MCP).

Key Features

Feature Description
Dedicated Training Tools fine_tune(), run_custom_training(), run_container_training() mapping to SDK trainer types
Modular Architecture --clients trainer,optimizer,hub for selective tool loading
Persona-Based Access readonly, data-scientist, ml-engineer, platform-admin
SDK Integration Wraps TrainerClient, OptimizerClient, ModelRegistryClient

Implementation Plan

Phase Scope Tools
1. Core MCP Server Trainer tools + mcp-tef validation 16
2. Resource Management Pre-flight checks, Mellea exploration 17
3. Multi-Tenancy Istio/impersonation, AGNTCY Identity 17
4. Job Lifecycle Suspend/resume, checkpoints 19
5. Optimizer & Hub Katib + Model Registry modules 33
6. Future Clients Pipelines, Spark (when SDK available) TBD

Related

/cc @andreyvelich @astefanutti @franciscojavierarceo @juliusvonkohout

kubeflow-mcp-demo.mp4


### Trainer Selection Logic

![Trainer Selection](trainer-selection.png)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be useful to link to kubeflow/trainer#2839 so it's extensible and support the future trainers that will be added.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot @astefanutti for reviewing 🙌
Awesome suggestion, I have added references to KEP-2839 accordingly ✅

| **Unauthorized Access** | Policy layer enforces RBAC at tool level |
| **Scope Creep** | Clear delegation to `kubernetes-mcp-server` for generic K8s ops |

## Design Details
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A section that covers the security aspects would be very valuable.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a dedicated "Security Considerations" section ✅

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To address security constraints, I have added dedicated section with :

  • Authentication: Kubeconfig, ServiceAccount, ServiceAccount + Impersonation, OIDC
  • Authorization: K8s RBAC flow, required verbs per resource
  • Multi-User In-Cluster: Istio integration with x-forwarded-user header (see identity-flow diagram)
  • Secret Management: HF tokens, S3 credentials handling
  • Multi-Tenancy: Namespace isolation, policy enforcement, ResourceQuota validation
  • Audit Logging: Structured JSON logs with redaction

Comment thread proposals/936-kubeflow-mcp-server/README.md
Comment thread proposals/936-kubeflow-mcp-server/README.md Outdated
Comment thread proposals/936-kubeflow-mcp-server/README.md Outdated
| Method | Description | Use Case |
|--------|-------------|----------|
| **Kubeconfig** | Uses `~/.kube/config` or `KUBECONFIG` env var | Local development, CI/CD |
| **ServiceAccount Token** | Mounted at `/var/run/secrets/kubernetes.io/serviceaccount/token` | In-cluster deployment |
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case of in-cluster deployment, should the MCP server rather impersonate users?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah good point - I hadn't fully thought through the multi-user in-cluster scenario.

Please correct me here, So the flow would be:

  1. MCP server runs with a ServiceAccount that has impersonate permissions
  2. AI agent authenticates user and passes identity to MCP server
  3. MCP server can use K8s impersonation (--as / Impersonate-User header) for API calls

This way RBAC is enforced per-user even with a shared MCP deployment.

Does this align with how you'd expect it to work? I'm curious if there's a standard pattern for this - perhaps similar to how the Notebooks controller handles user identity? 🤔

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@astefanutti @andreyvelich
Please correct me here but what if we use Istio/OAuth2Proxy for this case to inject user identity ? Kubeflow already uses Istio + OIDC right?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To address this constraint, updated Authentication section with three deployment modes:

  • ServiceAccount Token : Single-user in-cluster
    -ServiceAccount + Impersonation : Multi-user in-cluster
  • OIDC : Enterprise SSO

For multi-user, the flow uses Kubeflow's existing Istio layer:

  1. User authenticates via Kubeflow dashboard
  2. Istio Gateway validates JWT, adds x-forwarded-user header
  3. MCP server reads header, impersonates user for K8s API calls

Also added RBAC example for the impersonator ClusterRole.
This now aligns with how Kubeflow Notebooks handles user identity. ✅


![Multi-MCP Ecosystem](multi-mcp.png)

**Design Principle:** No overlap. `kubeflow-mcp` should handle training; `kubernetes-mcp-server` should handle generic K8s operations.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kubeflow-mcp would eventually handle more than training.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah right - I was thinking the principle should be: kubeflow-mcp owns Kubeflow-specific CRDs (TrainJob, Experiment, ModelVersion, etc.), while kubernetes-mcp-server handles generic K8s resources (PVCs, ConfigMaps, Secrets).

So the row should probably say "Kubeflow CRDs" rather than "TrainJob CRDs"?

Or do you think there's value in having separate MCP servers per Kubeflow component (trainer-mcp, katib-mcp, model-registry-mcp)?
I see Model Registry is building its own MCP server (model-registry#2029). This raises a design question 🤔

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I leaned toward unified since it mirrors the unified SDK structure, but curious what you think? Separate servers might give component teams more autonomy.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, the goal of kubeflow/model-registry#2029 is to provide searchable catalog of MCP servers that users can install via Hub UI. @ederign @kubeflow/kubeflow-hub-team please correct me if I miss-understood it.

Or do you think there's value in having separate MCP servers per Kubeflow component (trainer-mcp, katib-mcp, model-registry-mcp)?

This is good question, we should talk more about it. Since each project has a different scope and set of requirements, a broad range of tools is often necessary to support end-to-end LLMOps. However, exposing too many tools directly to an agent can quickly overwhelm its context window, leading to degraded performance and less reliable reasoning

As a workaround, we can think of introducing special Agent Skills: https://agentskills.io/home, which expose sub-set of MCP tools agent can talk to. These skills can act as curated capability bundles, exposing only a relevant subset of MCP tools to the agent

Similar to this diagram: anthropics/claude-agent-sdk-python#544

Or we can design sub-agents which can be Trainer or Spark experts.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great pointers! I looked into Agent Skills and the sub-agent pattern. Here's my take:

  • Agent Skills - These are essentially curated tool bundles. Our modular client loading (--clients trainer) and persona filtering (--persona data-scientist) achieve similar scoping at the MCP level. The difference is we filter at server startup rather than requiring an intermediary skill layer.

  • Sub-agents - Interesting pattern for complex multi-step workflows. For Phase 1, we're keeping tools granular so the primary agent (Claude/Cursor) can orchestrate. If workflows get complex enough to need dedicated "Trainer expert" sub-agents, i think that's more of a client-side concern - the MCP server just needs to expose the right tool subsets, which we support via --clients.

That said, I added a note in Design Decisions about Speakeasy's dynamic toolsets pattern for Phase 5+ (33+ tools) - meta-tools like "search_tools", "describe_tool", "execute_tool" that let the LLM discover capabilities without loading all schemas upfront.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Open to discussing further if you think we should explicitly support Agent Skills integration!

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, any thoughts on how local agents (e.g. Claude) can configure MCP server with the desired scope (e.g. --clients trainer) ?
Are we going to provide this argument as an entrypoint to start the FastAPI server?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, via entrypoint args.
Example for Claude Desktop (~/.claude/claude_desktop_config.json):

{
   "mcpServers": {
     "kubeflow": {
       "command": "kubeflow-mcp",
       "args": ["serve", "--clients", "trainer", "--persona", "data-scientist"]
     }
   }
 }


## Design Details

### MCP Tool Inventory
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the large number of tools, do you have an estimate of the size that it takes in the context based on your prototype?

Is there any concern w.r.t. to scaling the number of tools for all the Kubeflow components?

Copy link
Copy Markdown
Member Author

@abhijeet-dhumal abhijeet-dhumal Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the prototype, ~24 tools with docstrings comes to roughly 8-10K tokens.

My intuition is that this is manageable for current MCP clients, but I'm curious about your perspective. A few questions:

  1. Do you know if there's a recommended upper bound for MCP tool count? I haven't found guidance on this in the spec.
  2. For full Kubeflow coverage (Trainer + Katib + Model Registry + Pipelines), we might hit ~40-50 tools. Would lazy loading (only expose tools for installed components) be a reasonable mitigation?
  3. The other approach is policy-based filtering - a readonly user sees only ~7 discovery tools, not all 50. Does this feel like the right direction?
  4. And the discussion from above thread, should we plan separate MCP servers for each kubeflow component ? I think it's good to keep everything under 1 roof as a Unified kubeflow-mcp simialr to kubeflow-sdk which can wrap all SDK clients, single install.. but I would be happy to get more discussion or inputs on this aspect 💭

Happy to add a "Scalability Considerations" section if you think it's worth calling out explicitly.

cc: @andreyvelich @jaiakash @dhanishaphadate

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added "Tool Scalability" section with detailed analysis:

Token estimates:

  • Phase 1 (trainer only): 16 tools, nearly ~5.5K tokens
  • Full stack (trainer + optimizer + hub): 30 tools, ~11K tokens

Scalability strategies:

  • Modular Client Loading (--clients trainer) : 40-70%
  • Persona Filtering (--persona data-scientist) : 50-70%
  • Combined : 70-85%

Research shows LLM accuracy degrades beyond 20-25 tools :

  1. With modular loading + personas, we stay within optimal range.
  2. Also tracking MCP Lazy Tool Hydration proposal which could reduce overhead by 90%+ via deferred schema loading
  3. Added mcp-tef validation in each implementation phase to ensure tool descriptions remain distinguishable as we scale.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pleas review if this approach seems correct?
cc: @astefanutti @andreyvelich

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, the long term approach would be to build specialized agent personas to reduce context.
For example, if I am only interested in PyTorch fine-tuning, I don't need Spark tools.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly what I have added in proposal ! The proposal includes:

Persona Tools Use Case
readonly 7 Monitoring
data-scientist 12 Fine-tuning
ml-engineer 16 Full training

Combined with --clients trainer (only load trainer tools), a PyTorch fine-tuning user never sees Spark tools. The 70-85% reduction keeps us in the optimal 10-15 tool range per research.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great discussion! I'd like to offer a different angle on the tool-count concern.

Rather than hard-capping the number of tools through personas or modular loading, I think we should be careful not to artificially constrain the API surface of kubeflow-mcp, especially as the SDK grows and new components get added. Limiting tools solves today's problem but creates friction tomorrow.

Instead, we can delegate the context-optimization problem to the MCP layer itself. There's an open-source project called mcp-optimizer that tackles exactly this, it dynamically prunes and prioritizes tools based on the current task/query, so the LLM only sees the relevant subset at inference time, without us having to bake filtering logic into the server.

I think there are two paths worth considering:

  1. Integrate mcp-optimizer-style logic directly into kubeflow-mcp — so the server itself handles intelligent tool surfacing.
  2. Provide official documentation/guidance on how to use mcp-optimizer (or similar middleware) alongside kubeflow-mcp for users who hit context limits.

This way, the server stays complete and unopinionated, and the optimization lives at the right layer. Happy to explore either direction.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's interesting, thanks for sharing @Sanskarzz!
What are the main benefits of delegating tool selection to an mcp-optimizer, instead of managing access through multiple sub-agents that statically define Skills that have access to MCP tools by defining them in allowed-tools section?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great point @Sanskarzz! I agree we shouldn't artificially constrain the API surface.
The way I see it, we have two complementary strategies:

  1. Static filtering (--clients, --persona) - deterministic, predictable, zero overhead. Good for: "I'm a data scientist, I only need training tools." Admin sets it once at deployment.
  2. Dynamic pruning (mcp-optimizer style) - intelligent, query-aware. Good for: "I have 50 tools loaded but this query only needs 5." Adds latency but handles the long tail.

For Phase 1-4, static filtering keeps us in the 10-16 tool range which works well. For Phase 5+ (33+ tools), I think you're right that dynamic approaches become necessary.
I'd suggest:

  1. Phase 1-4: Ship with --clients + --persona (already in proposal)
  2. Phase 5: Add --mode dynamic option that enables semantic tool discovery (the Speakeasy pattern I mentioned)
  3. Document mcp-optimizer as a recommended middleware for users who want external optimization

This way the server stays complete and unopinionated (as you said), and optimization can happen at whichever layer makes sense for the deployment.
Does this approach work? Happy to add explicit guidance on mcp-optimizer integration to the docs.

abhijeet-dhumal added a commit to abhijeet-dhumal/community that referenced this pull request Feb 21, 2026
…ar architecture

Address PR kubeflow#937 review feedback:
- Security section with Istio/impersonation for multi-tenant
- Tool scalability via --clients flag and persona filtering
- Dedicated training tools for granular permissions
- 6-phase implementation with mcp-tef validation
- Mellea and AGNTCY Identity integrations

Signed-off-by: abhijeet-dhumal <abhijeetdhumal652@gmail.com>
…Interface

Signed-off-by: abhijeet-dhumal <abhijeetdhumal652@gmail.com>
…trainer-specific estimation

Signed-off-by: abhijeet-dhumal <abhijeetdhumal652@gmail.com>
Signed-off-by: abhijeet-dhumal <abhijeetdhumal652@gmail.com>
…ar architecture

Address PR kubeflow#937 review feedback:
- Security section with Istio/impersonation for multi-tenant
- Tool scalability via --clients flag and persona filtering
- Dedicated training tools for granular permissions
- 6-phase implementation with mcp-tef validation
- Mellea and AGNTCY Identity integrations

Signed-off-by: abhijeet-dhumal <abhijeetdhumal652@gmail.com>
@abhijeet-dhumal
Copy link
Copy Markdown
Member Author

abhijeet-dhumal commented Feb 21, 2026

Hi folks 👋
apologies for the silence on this, we've had a busy few weeks.

I have added some more aspects in the proposal :

  • Security Considerations section (auth, impersonation, audit)
  • Tool Scalability with modular loading and persona filtering
  • 6-phase implementation plan with mcp-tef validation
  • HF Skills comparison in Alternatives
  • Mellea (Phase 2) and AGNTCY Identity (Phase 3) integrations
  • Multi-MCP scope to cover all Kubeflow CRDs
  • Tool counts and persona visibility consistency

I've still some bits to work through, but I'd appreciate any early feedback on the approach I've taken. I'm hoping I can raise something soon so folk can start taking a look, even if it's only a draft 🤞

cc @andreyvelich @astefanutti @franciscojavierarceo @jaiakash

Signed-off-by: abhijeet-dhumal <abhijeetdhumal652@gmail.com>
Signed-off-by: abhijeet-dhumal <abhijeetdhumal652@gmail.com>
Signed-off-by: abhijeet-dhumal <abhijeetdhumal652@gmail.com>
…ry in phase 3

Signed-off-by: abhijeet-dhumal <abhijeetdhumal652@gmail.com>
Copy link
Copy Markdown

@astefanutti astefanutti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some small comments, otherwise looks good.

Comment thread proposals/936-kubeflow-mcp-server/README.md Outdated
Comment thread proposals/936-kubeflow-mcp-server/README.md Outdated
Comment thread proposals/936-kubeflow-mcp-server/README.md Outdated
Comment thread proposals/936-kubeflow-mcp-server/README.md Outdated
Signed-off-by: abhijeet-dhumal <abhijeetdhumal652@gmail.com>
@abhijeet-dhumal
Copy link
Copy Markdown
Member Author

I left some small comments, otherwise looks good.

Thanks a lot @astefanutti for reviewing the PR, I have addressed the comments ✅

Copy link
Copy Markdown

@astefanutti astefanutti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @abhijeet-dhumal!

/lgtm

Signed-off-by: abhijeet-dhumal <abhijeetdhumal652@gmail.com>
@abhijeet-dhumal
Copy link
Copy Markdown
Member Author

Thanks @andreyvelich @thesuperzapper ! I've added the Ownership section as requested ✅
The section now includes:

  • WG ML Experience as the owning working group
  • @abhijeet-dhumal as primary maintainer
  • Proposed repository: kubeflow/mcp-server
  • Maintainer onboarding plan for future contributors
  • Experimental status disclaimer (addressing @thesuperzapper's feedback as well)

@abhijeet-dhumal
Copy link
Copy Markdown
Member Author

Thanks for the guidance @thesuperzapper! I've addressed both points:

  • Experimental status - Added explicit disclaimer that the project will be marked experimental and not intended for production until graduation criteria are met.
  • Regarding a separate KEP for repo creation - I'm open to this approach. Would a minimal KEP similar to KEP-913-components-repo work? The scope would be:
    • Request creation of kubeflow/mcp-server repository
    • Assign WG ML Experience leads + @abhijeet-dhumal as initial OWNERS
    • Mark as experimental

Happy to create this if you'd prefer that path before proceeding with the technical details in this KEP.
Please let me know if the current Ownership section works, or if you'd like me to split this into a separate repo-creation KEP first.

…rename

Signed-off-by: abhijeet-dhumal <abhijeetdhumal652@gmail.com>
@abhijeet-dhumal
Copy link
Copy Markdown
Member Author

@thesuperzapper Following your suggestion, I've created a separate repo-creation KEP: 938: Create kubeflow/mcp-server repository

@andreyvelich
Copy link
Copy Markdown
Member

@thesuperzapper @abhijeet-dhumal Can we just include ownership information in the KEP-936?
It might be hard to track this in a two separate documents.
Check this as a reference: https://github.com/andreyvelich/community/blob/76e58f1dddf064ee3bab7a990566e5bd7a9bb19c/proposals/819-kubeflow-sdk/README.md#ownership-of-kubeflow-sdk

@abhijeet-dhumal
Copy link
Copy Markdown
Member Author

abhijeet-dhumal commented Mar 13, 2026

@andreyvelich Hey I have added ownership section and related details already in this PR KEP-936 ✅ : https://github.com/kubeflow/community/pull/937/changes#diff-6088d3fe5c7209fb9d22f384545afe417304387ddf7a0c0cb7d962b5d704f746R28
@thesuperzapper If not needed i can close the outstanding PR - KEP: #945 🔴

Copy link
Copy Markdown
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @abhijeet-dhumal, overall lgtm, let's continue the design discussion in the future meetings, but we can at least unblock the initial work.
/lgtm
/assign @chasecadet @thesuperzapper @franciscojavierarceo @juliusvonkohout

Comment on lines +43 to +44
@mcp.tool()
def fine_tune(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, we should consolidate all Create TrainJob calls to a single MCP tool which might allow agents to use CustomTrainer or BuiltinTrainer. That will reduce context.
Not a blocker, but we can start with some simple flows and interate.

cc @kubeflow/kubeflow-sdk-team

Copy link
Copy Markdown
Member Author

@abhijeet-dhumal abhijeet-dhumal Mar 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @andreyvelich! The separation was intentional !
A unified tool would need a complex union schema with conditional required fields, which LLMs struggle to navigate correctly.
I have already included this concern in design decision section and described reasoning in detail : https://github.com/kubeflow/community/pull/937/changes#diff-6088d3fe5c7209fb9d22f384545afe417304387ddf7a0c0cb7d962b5d704f746R353

@devjpt23
Copy link
Copy Markdown

@andreyvelich @abhijeet-dhumal
The instructions field in the DESIGN.md (line 589-607) could be expanded into a structured, file-backed document. The current instructions string has basic workflow hints and tool selection guidance, which is great. But from my experience building MCP servers, I've found that LLMs still pick the wrong tool even with good descriptions, they need cross-tool workflow context that per-tool descriptions can't encode. Would it be worth loading the instructions from a versioned markdown file instead of an inline string? This would make it easier for the community to contribute scenarios, anti-patterns, and edge case playbooks over time. Happy to elaborate if useful.

@devjpt23
Copy link
Copy Markdown

Hey @abhijeet-dhumal @andreyvelich, I have been working with implementing the MCP server, given my experience with MCP server. here is link to my rep.

I'd like to hear your feedback.

@andreyvelich
Copy link
Copy Markdown
Member

We’ve reached quorum from KSC to move this forward, thanks to @abhijeet-dhumal for driving this! 🎉

/lgtm
/approve

Copy link
Copy Markdown

@Sanskarzz Sanskarzz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working through this proposal so thoroughly @abhijeet-dhumal — really impressive.
I've read through the full proposal, DESIGN.md, and the prior discussion, and I have a few specific comments below that I think are worth addressing before or during implementation.

Comment on lines +138 to +141
spec = importlib.util.spec_from_file_location("training_module", temp_path)
module = importlib.util.module_from_spec(spec)
spec.loader.exec_module(module)
train_func = module.train
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing worth clarifying in the security model here: spec.loader.exec_module(module) actually executes any module-level code in func_code on the MCP server process itself, not just inside the K8s pod. The pod only receives the extracted train function later. So if func_code contains something like:

x = open("/etc/passwd").read()   # runs on MCP server at load time!
def train(**kwargs):
    pass

…the open() call runs before any K8s pod is created.

The Security Note says "The func_code is executed within K8s pods, not on the MCP server host" — this is accurate for the train() function body, but not for module-level code. I think it's worth adding a clarifying sentence like:

"Module-level statements in func_code execute on the MCP server host at load time. Only the train() function body is serialized and sent to the K8s worker pod."

The AST checks catch a lot of the obvious cases, but as a defense-in-depth note this distinction matters — especially for a shared in-cluster/gateway deployment

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right! exec_module() does execute module-level code on the MCP server before the train function is extracted and sent to K8s.
The AST checks (Phase 1) catch obvious cases like open(), subprocess,..., but as you note - defense-in-depth matters for shared gateway deployments.
I'll update the security note during implementation to clarify:

  • Module-level statements execute on MCP server at load time
  • Only the train() function body is serialised to K8s worker pods
    Thanks for the thorough review!

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Sanskarzz I have added a security note accordingly !

Comment on lines +65 to +84
try:
client = TrainerClient()
trainer = BuiltinTrainer(
config=TorchTuneConfig(
epochs=epochs, batch_size=batch_size, num_nodes=num_nodes,
resources_per_node=resources_per_node,
peft_config=LoraConfig(lora_rank=16, lora_alpha=32) if peft_method == "lora" else None,
)
)
initializer = Initializer(
model=HuggingFaceModelInitializer(storage_uri=f"hf://{model}"),
dataset=HuggingFaceDatasetInitializer(storage_uri=f"hf://{dataset}"),
)
job_name = client.train(runtime=runtime, trainer=trainer, initializer=initializer)
return {"success": True, "job_id": job_name, "trainer_type": "BuiltinTrainer"}
except ApiException as e:
return {"success": False, "error": f"Kubernetes API error: {e.reason}", "status_code": e.status}
except Exception as e:
return {"success": False, "error": str(e)}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small but potentially important gap: each tool does client = TrainerClient() inline with no arguments, which defaults to KubernetesBackendConfig() and re-reads kubeconfig every call.

For the multi-user in-cluster scenario (where auth.py's create_k8s_client_for_user() creates an impersonated kubernetes.client.ApiClient), there's currently no shown path to pass that into TrainerClient. The SDK's constructor signature accepts:

TrainerClient(backend_config: KubernetesBackendConfig | ...)

Could we add a design note showing how the impersonated client gets wired in? Something like:

# in each tool, after auth:
backend_config = KubernetesBackendConfig(
    client=create_k8s_client_for_user(user_identity),
    namespace=user_identity["namespace"],
)
client = TrainerClient(backend_config=backend_config)

Without this, the per-user namespace isolation and impersonation described in the auth section doesn't actually connect to the tool execution path.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Valid point - here the pseudocode shows TrainerClient() with no args, which doesn't connect to the impersonation path in auth.py. The implementation will wire this properly 👍

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add this pattern to the implementation docs. The KEP pseudocode is intentionally simplified for readability..

Comment on lines +204 to +216
---

## Resource Estimation Algorithm

```python
def estimate_resources(
model: str,
peft_method: str,
batch_size: int = 4,
sequence_length: int = 2048,
quantization: str = "bf16", # "fp32", "bf16", "fp16", "int8", "int4"
num_nodes: int = 1,
) -> dict:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The get_model_info(model) call hits HuggingFace Hub API synchronously in the tool hot path. A few concerns:

  1. No timeout — if HF Hub is slow/unreachable, this blocks indefinitely.
  2. No caching — same model queried repeatedly on every estimate_resources() call.
  3. Air-gapped environments — enterprise Kubeflow deployments often have no external internet access. The tool would always return "confidence": "low" and be essentially useless.

The user_provided_params approach was already discussed in the thread (great suggestion from @MansiSingh17, agreed to by @abhijeet-dhumal for Phase 2), but it's not yet reflected in the algorithm code. Could we update the pseudocode snippet to show:

# Prefer user-provided params for private/air-gapped models
if user_provided_params:
    param_count = user_provided_params.get("param_count")
    ...
else:
    model_info = get_model_info(model, timeout=10)  # explicit timeout

Even a TODO comment would help set expectations for implementors.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When HF Hub is unreachable, we return "confidence": "low" with a message suggesting user_provided_params. This is already the intent - I'll make sure the implementation pseudocode reflects it.

For enterprise deployments, we could also support a local model registry endpoint as an alternative to HF Hub.

Comment on lines +448 to +459
### `wait_for_training` Timeout

```python
@mcp.tool()
def wait_for_training(
job_id: str,
timeout: int = 3600, # Default: 1 hour
poll_interval: int = 30,
) -> dict:
"""Returns current status on timeout (doesn't cancel job)."""
```

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default timeout=3600 (1 hour) will be problematic for the Streamable HTTP transport. Most reverse proxies (nginx default: 60s, AWS ALB: 60s, GCP: 30 min) will terminate idle connections well before 1 hour. The SDK's own wait_for_job_status() defaults to 600s.

One option: keep the long timeout for stdio mode (Claude Desktop/Cursor) but document that for HTTP deployments the recommended pattern is polling via get_training_job() at intervals, rather than holding a long-lived connection open. Could add a transport note here:

"For StreamableHTTP deployments, prefer using get_training_job() polling over wait_for_training() to avoid proxy timeout issues."

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point on transport differences. The 1-hour default makes sense for stdio (Claude Desktop, Cursor) but breaks behind reverse proxies.
but again .. . The KEP pseudocode is intentionally simplified for readability..
We will document the recommended pattern for each transport mode 👍

@abhijeet-dhumal
Copy link
Copy Markdown
Member Author

We’ve reached quorum from KSC to move this forward, thanks to @abhijeet-dhumal for driving this! 🎉

/lgtm /approve

@andreyvelich @thesuperzapper May I request and update on the request here ?

Signed-off-by: abhijeet-dhumal <abhijeetdhumal652@gmail.com>
@google-oss-prow google-oss-prow bot removed the lgtm label Apr 1, 2026
Copy link
Copy Markdown
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve
/hold cancel

@google-oss-prow
Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreyvelich

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-prow google-oss-prow bot merged commit fbf0a14 into kubeflow:master Apr 8, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

KEP: Kubeflow MCP Server - AI-Powered Training Interface