Add KEP-936: Introduce Kubeflow-MCP proposal for AI-Powered Training …#937
Conversation
|
|
||
| ### Trainer Selection Logic | ||
|
|
||
|  |
There was a problem hiding this comment.
I think it would be useful to link to kubeflow/trainer#2839 so it's extensible and support the future trainers that will be added.
There was a problem hiding this comment.
Thanks a lot @astefanutti for reviewing 🙌
Awesome suggestion, I have added references to KEP-2839 accordingly ✅
| | **Unauthorized Access** | Policy layer enforces RBAC at tool level | | ||
| | **Scope Creep** | Clear delegation to `kubernetes-mcp-server` for generic K8s ops | | ||
|
|
||
| ## Design Details |
There was a problem hiding this comment.
A section that covers the security aspects would be very valuable.
There was a problem hiding this comment.
Added a dedicated "Security Considerations" section ✅
There was a problem hiding this comment.
To address security constraints, I have added dedicated section with :
- Authentication: Kubeconfig, ServiceAccount, ServiceAccount + Impersonation, OIDC
- Authorization: K8s RBAC flow, required verbs per resource
- Multi-User In-Cluster: Istio integration with x-forwarded-user header (see identity-flow diagram)
- Secret Management: HF tokens, S3 credentials handling
- Multi-Tenancy: Namespace isolation, policy enforcement, ResourceQuota validation
- Audit Logging: Structured JSON logs with redaction
| | Method | Description | Use Case | | ||
| |--------|-------------|----------| | ||
| | **Kubeconfig** | Uses `~/.kube/config` or `KUBECONFIG` env var | Local development, CI/CD | | ||
| | **ServiceAccount Token** | Mounted at `/var/run/secrets/kubernetes.io/serviceaccount/token` | In-cluster deployment | |
There was a problem hiding this comment.
In the case of in-cluster deployment, should the MCP server rather impersonate users?
There was a problem hiding this comment.
Ah good point - I hadn't fully thought through the multi-user in-cluster scenario.
Please correct me here, So the flow would be:
- MCP server runs with a ServiceAccount that has
impersonatepermissions - AI agent authenticates user and passes identity to MCP server
- MCP server can use K8s impersonation (
--as/Impersonate-Userheader) for API calls
This way RBAC is enforced per-user even with a shared MCP deployment.
Does this align with how you'd expect it to work? I'm curious if there's a standard pattern for this - perhaps similar to how the Notebooks controller handles user identity? 🤔
There was a problem hiding this comment.
@astefanutti @andreyvelich
Please correct me here but what if we use Istio/OAuth2Proxy for this case to inject user identity ? Kubeflow already uses Istio + OIDC right?
There was a problem hiding this comment.
To address this constraint, updated Authentication section with three deployment modes:
- ServiceAccount Token : Single-user in-cluster
-ServiceAccount + Impersonation : Multi-user in-cluster - OIDC : Enterprise SSO
For multi-user, the flow uses Kubeflow's existing Istio layer:
- User authenticates via Kubeflow dashboard
- Istio Gateway validates JWT, adds x-forwarded-user header
- MCP server reads header, impersonates user for K8s API calls
Also added RBAC example for the impersonator ClusterRole.
This now aligns with how Kubeflow Notebooks handles user identity. ✅
|
|
||
|  | ||
|
|
||
| **Design Principle:** No overlap. `kubeflow-mcp` should handle training; `kubernetes-mcp-server` should handle generic K8s operations. |
There was a problem hiding this comment.
kubeflow-mcp would eventually handle more than training.
There was a problem hiding this comment.
Ah right - I was thinking the principle should be: kubeflow-mcp owns Kubeflow-specific CRDs (TrainJob, Experiment, ModelVersion, etc.), while kubernetes-mcp-server handles generic K8s resources (PVCs, ConfigMaps, Secrets).
So the row should probably say "Kubeflow CRDs" rather than "TrainJob CRDs"?
Or do you think there's value in having separate MCP servers per Kubeflow component (trainer-mcp, katib-mcp, model-registry-mcp)?
I see Model Registry is building its own MCP server (model-registry#2029). This raises a design question 🤔
There was a problem hiding this comment.
I leaned toward unified since it mirrors the unified SDK structure, but curious what you think? Separate servers might give component teams more autonomy.
There was a problem hiding this comment.
IIUC, the goal of kubeflow/model-registry#2029 is to provide searchable catalog of MCP servers that users can install via Hub UI. @ederign @kubeflow/kubeflow-hub-team please correct me if I miss-understood it.
Or do you think there's value in having separate MCP servers per Kubeflow component (trainer-mcp, katib-mcp, model-registry-mcp)?
This is good question, we should talk more about it. Since each project has a different scope and set of requirements, a broad range of tools is often necessary to support end-to-end LLMOps. However, exposing too many tools directly to an agent can quickly overwhelm its context window, leading to degraded performance and less reliable reasoning
As a workaround, we can think of introducing special Agent Skills: https://agentskills.io/home, which expose sub-set of MCP tools agent can talk to. These skills can act as curated capability bundles, exposing only a relevant subset of MCP tools to the agent
Similar to this diagram: anthropics/claude-agent-sdk-python#544
Or we can design sub-agents which can be Trainer or Spark experts.
There was a problem hiding this comment.
Great pointers! I looked into Agent Skills and the sub-agent pattern. Here's my take:
-
Agent Skills - These are essentially curated tool bundles. Our modular client loading (--clients trainer) and persona filtering (--persona data-scientist) achieve similar scoping at the MCP level. The difference is we filter at server startup rather than requiring an intermediary skill layer.
-
Sub-agents - Interesting pattern for complex multi-step workflows. For Phase 1, we're keeping tools granular so the primary agent (Claude/Cursor) can orchestrate. If workflows get complex enough to need dedicated "Trainer expert" sub-agents, i think that's more of a client-side concern - the MCP server just needs to expose the right tool subsets, which we support via --clients.
That said, I added a note in Design Decisions about Speakeasy's dynamic toolsets pattern for Phase 5+ (33+ tools) - meta-tools like "search_tools", "describe_tool", "execute_tool" that let the LLM discover capabilities without loading all schemas upfront.
There was a problem hiding this comment.
Open to discussing further if you think we should explicitly support Agent Skills integration!
There was a problem hiding this comment.
Sounds good, any thoughts on how local agents (e.g. Claude) can configure MCP server with the desired scope (e.g. --clients trainer) ?
Are we going to provide this argument as an entrypoint to start the FastAPI server?
There was a problem hiding this comment.
Yes, via entrypoint args.
Example for Claude Desktop (~/.claude/claude_desktop_config.json):
{
"mcpServers": {
"kubeflow": {
"command": "kubeflow-mcp",
"args": ["serve", "--clients", "trainer", "--persona", "data-scientist"]
}
}
}
|
|
||
| ## Design Details | ||
|
|
||
| ### MCP Tool Inventory |
There was a problem hiding this comment.
Given the large number of tools, do you have an estimate of the size that it takes in the context based on your prototype?
Is there any concern w.r.t. to scaling the number of tools for all the Kubeflow components?
There was a problem hiding this comment.
From the prototype, ~24 tools with docstrings comes to roughly 8-10K tokens.
My intuition is that this is manageable for current MCP clients, but I'm curious about your perspective. A few questions:
- Do you know if there's a recommended upper bound for MCP tool count? I haven't found guidance on this in the spec.
- For full Kubeflow coverage (Trainer + Katib + Model Registry + Pipelines), we might hit ~40-50 tools. Would lazy loading (only expose tools for installed components) be a reasonable mitigation?
- The other approach is policy-based filtering - a
readonlyuser sees only ~7 discovery tools, not all 50. Does this feel like the right direction? - And the discussion from above thread, should we plan separate MCP servers for each kubeflow component ? I think it's good to keep everything under 1 roof as a Unified kubeflow-mcp simialr to kubeflow-sdk which can wrap all SDK clients, single install.. but I would be happy to get more discussion or inputs on this aspect 💭
Happy to add a "Scalability Considerations" section if you think it's worth calling out explicitly.
There was a problem hiding this comment.
Added "Tool Scalability" section with detailed analysis:
Token estimates:
- Phase 1 (trainer only): 16 tools, nearly ~5.5K tokens
- Full stack (trainer + optimizer + hub): 30 tools, ~11K tokens
Scalability strategies:
- Modular Client Loading (--clients trainer) : 40-70%
- Persona Filtering (--persona data-scientist) : 50-70%
- Combined : 70-85%
Research shows LLM accuracy degrades beyond 20-25 tools :
- With modular loading + personas, we stay within optimal range.
- Also tracking MCP Lazy Tool Hydration proposal which could reduce overhead by 90%+ via deferred schema loading
- Added mcp-tef validation in each implementation phase to ensure tool descriptions remain distinguishable as we scale.
There was a problem hiding this comment.
Pleas review if this approach seems correct?
cc: @astefanutti @andreyvelich
There was a problem hiding this comment.
I think, the long term approach would be to build specialized agent personas to reduce context.
For example, if I am only interested in PyTorch fine-tuning, I don't need Spark tools.
There was a problem hiding this comment.
Exactly what I have added in proposal ! The proposal includes:
| Persona | Tools | Use Case |
|---|---|---|
| readonly | 7 | Monitoring |
| data-scientist | 12 | Fine-tuning |
| ml-engineer | 16 | Full training |
Combined with --clients trainer (only load trainer tools), a PyTorch fine-tuning user never sees Spark tools. The 70-85% reduction keeps us in the optimal 10-15 tool range per research.
There was a problem hiding this comment.
Great discussion! I'd like to offer a different angle on the tool-count concern.
Rather than hard-capping the number of tools through personas or modular loading, I think we should be careful not to artificially constrain the API surface of kubeflow-mcp, especially as the SDK grows and new components get added. Limiting tools solves today's problem but creates friction tomorrow.
Instead, we can delegate the context-optimization problem to the MCP layer itself. There's an open-source project called mcp-optimizer that tackles exactly this, it dynamically prunes and prioritizes tools based on the current task/query, so the LLM only sees the relevant subset at inference time, without us having to bake filtering logic into the server.
I think there are two paths worth considering:
- Integrate mcp-optimizer-style logic directly into kubeflow-mcp — so the server itself handles intelligent tool surfacing.
- Provide official documentation/guidance on how to use mcp-optimizer (or similar middleware) alongside kubeflow-mcp for users who hit context limits.
This way, the server stays complete and unopinionated, and the optimization lives at the right layer. Happy to explore either direction.
There was a problem hiding this comment.
That's interesting, thanks for sharing @Sanskarzz!
What are the main benefits of delegating tool selection to an mcp-optimizer, instead of managing access through multiple sub-agents that statically define Skills that have access to MCP tools by defining them in allowed-tools section?
There was a problem hiding this comment.
Great point @Sanskarzz! I agree we shouldn't artificially constrain the API surface.
The way I see it, we have two complementary strategies:
- Static filtering (--clients, --persona) - deterministic, predictable, zero overhead. Good for: "I'm a data scientist, I only need training tools." Admin sets it once at deployment.
- Dynamic pruning (mcp-optimizer style) - intelligent, query-aware. Good for: "I have 50 tools loaded but this query only needs 5." Adds latency but handles the long tail.
For Phase 1-4, static filtering keeps us in the 10-16 tool range which works well. For Phase 5+ (33+ tools), I think you're right that dynamic approaches become necessary.
I'd suggest:
- Phase 1-4: Ship with --clients + --persona (already in proposal)
- Phase 5: Add --mode dynamic option that enables semantic tool discovery (the Speakeasy pattern I mentioned)
- Document mcp-optimizer as a recommended middleware for users who want external optimization
This way the server stays complete and unopinionated (as you said), and optimization can happen at whichever layer makes sense for the deployment.
Does this approach work? Happy to add explicit guidance on mcp-optimizer integration to the docs.
…ar architecture Address PR kubeflow#937 review feedback: - Security section with Istio/impersonation for multi-tenant - Tool scalability via --clients flag and persona filtering - Dedicated training tools for granular permissions - 6-phase implementation with mcp-tef validation - Mellea and AGNTCY Identity integrations Signed-off-by: abhijeet-dhumal <abhijeetdhumal652@gmail.com>
…Interface Signed-off-by: abhijeet-dhumal <abhijeetdhumal652@gmail.com>
…trainer-specific estimation Signed-off-by: abhijeet-dhumal <abhijeetdhumal652@gmail.com>
Signed-off-by: abhijeet-dhumal <abhijeetdhumal652@gmail.com>
…ar architecture Address PR kubeflow#937 review feedback: - Security section with Istio/impersonation for multi-tenant - Tool scalability via --clients flag and persona filtering - Dedicated training tools for granular permissions - 6-phase implementation with mcp-tef validation - Mellea and AGNTCY Identity integrations Signed-off-by: abhijeet-dhumal <abhijeetdhumal652@gmail.com>
6379e45 to
f75db71
Compare
|
Hi folks 👋 I have added some more aspects in the proposal :
I've still some bits to work through, but I'd appreciate any early feedback on the approach I've taken. I'm hoping I can raise something soon so folk can start taking a look, even if it's only a draft 🤞 cc @andreyvelich @astefanutti @franciscojavierarceo @jaiakash |
Signed-off-by: abhijeet-dhumal <abhijeetdhumal652@gmail.com>
Signed-off-by: abhijeet-dhumal <abhijeetdhumal652@gmail.com>
Signed-off-by: abhijeet-dhumal <abhijeetdhumal652@gmail.com>
…ry in phase 3 Signed-off-by: abhijeet-dhumal <abhijeetdhumal652@gmail.com>
astefanutti
left a comment
There was a problem hiding this comment.
I left some small comments, otherwise looks good.
Signed-off-by: abhijeet-dhumal <abhijeetdhumal652@gmail.com>
Thanks a lot @astefanutti for reviewing the PR, I have addressed the comments ✅ |
Signed-off-by: abhijeet-dhumal <abhijeetdhumal652@gmail.com>
|
Thanks @andreyvelich @thesuperzapper ! I've added the Ownership section as requested ✅
|
|
Thanks for the guidance @thesuperzapper! I've addressed both points:
Happy to create this if you'd prefer that path before proceeding with the technical details in this KEP. |
…rename Signed-off-by: abhijeet-dhumal <abhijeetdhumal652@gmail.com>
|
@thesuperzapper Following your suggestion, I've created a separate repo-creation KEP: 938: Create |
|
@thesuperzapper @abhijeet-dhumal Can we just include ownership information in the KEP-936? |
|
@andreyvelich Hey I have added ownership section and related details already in this PR KEP-936 ✅ : https://github.com/kubeflow/community/pull/937/changes#diff-6088d3fe5c7209fb9d22f384545afe417304387ddf7a0c0cb7d962b5d704f746R28 |
andreyvelich
left a comment
There was a problem hiding this comment.
Thanks @abhijeet-dhumal, overall lgtm, let's continue the design discussion in the future meetings, but we can at least unblock the initial work.
/lgtm
/assign @chasecadet @thesuperzapper @franciscojavierarceo @juliusvonkohout
| @mcp.tool() | ||
| def fine_tune( |
There was a problem hiding this comment.
I think, we should consolidate all Create TrainJob calls to a single MCP tool which might allow agents to use CustomTrainer or BuiltinTrainer. That will reduce context.
Not a blocker, but we can start with some simple flows and interate.
cc @kubeflow/kubeflow-sdk-team
There was a problem hiding this comment.
Thanks @andreyvelich! The separation was intentional !
A unified tool would need a complex union schema with conditional required fields, which LLMs struggle to navigate correctly.
I have already included this concern in design decision section and described reasoning in detail : https://github.com/kubeflow/community/pull/937/changes#diff-6088d3fe5c7209fb9d22f384545afe417304387ddf7a0c0cb7d962b5d704f746R353
|
@andreyvelich @abhijeet-dhumal |
|
Hey @abhijeet-dhumal @andreyvelich, I have been working with implementing the MCP server, given my experience with MCP server. here is link to my rep. I'd like to hear your feedback. |
|
We’ve reached quorum from KSC to move this forward, thanks to @abhijeet-dhumal for driving this! 🎉 /lgtm |
Sanskarzz
left a comment
There was a problem hiding this comment.
Thanks for working through this proposal so thoroughly @abhijeet-dhumal — really impressive.
I've read through the full proposal, DESIGN.md, and the prior discussion, and I have a few specific comments below that I think are worth addressing before or during implementation.
| spec = importlib.util.spec_from_file_location("training_module", temp_path) | ||
| module = importlib.util.module_from_spec(spec) | ||
| spec.loader.exec_module(module) | ||
| train_func = module.train |
There was a problem hiding this comment.
One thing worth clarifying in the security model here: spec.loader.exec_module(module) actually executes any module-level code in func_code on the MCP server process itself, not just inside the K8s pod. The pod only receives the extracted train function later. So if func_code contains something like:
x = open("/etc/passwd").read() # runs on MCP server at load time!
def train(**kwargs):
pass…the open() call runs before any K8s pod is created.
The Security Note says "The func_code is executed within K8s pods, not on the MCP server host" — this is accurate for the train() function body, but not for module-level code. I think it's worth adding a clarifying sentence like:
"Module-level statements in
func_codeexecute on the MCP server host at load time. Only thetrain()function body is serialized and sent to the K8s worker pod."
The AST checks catch a lot of the obvious cases, but as a defense-in-depth note this distinction matters — especially for a shared in-cluster/gateway deployment
There was a problem hiding this comment.
You're right! exec_module() does execute module-level code on the MCP server before the train function is extracted and sent to K8s.
The AST checks (Phase 1) catch obvious cases like open(), subprocess,..., but as you note - defense-in-depth matters for shared gateway deployments.
I'll update the security note during implementation to clarify:
- Module-level statements execute on MCP server at load time
- Only the train() function body is serialised to K8s worker pods
Thanks for the thorough review!
There was a problem hiding this comment.
@Sanskarzz I have added a security note accordingly !
| try: | ||
| client = TrainerClient() | ||
| trainer = BuiltinTrainer( | ||
| config=TorchTuneConfig( | ||
| epochs=epochs, batch_size=batch_size, num_nodes=num_nodes, | ||
| resources_per_node=resources_per_node, | ||
| peft_config=LoraConfig(lora_rank=16, lora_alpha=32) if peft_method == "lora" else None, | ||
| ) | ||
| ) | ||
| initializer = Initializer( | ||
| model=HuggingFaceModelInitializer(storage_uri=f"hf://{model}"), | ||
| dataset=HuggingFaceDatasetInitializer(storage_uri=f"hf://{dataset}"), | ||
| ) | ||
| job_name = client.train(runtime=runtime, trainer=trainer, initializer=initializer) | ||
| return {"success": True, "job_id": job_name, "trainer_type": "BuiltinTrainer"} | ||
| except ApiException as e: | ||
| return {"success": False, "error": f"Kubernetes API error: {e.reason}", "status_code": e.status} | ||
| except Exception as e: | ||
| return {"success": False, "error": str(e)} | ||
|
|
There was a problem hiding this comment.
Small but potentially important gap: each tool does client = TrainerClient() inline with no arguments, which defaults to KubernetesBackendConfig() and re-reads kubeconfig every call.
For the multi-user in-cluster scenario (where auth.py's create_k8s_client_for_user() creates an impersonated kubernetes.client.ApiClient), there's currently no shown path to pass that into TrainerClient. The SDK's constructor signature accepts:
TrainerClient(backend_config: KubernetesBackendConfig | ...)Could we add a design note showing how the impersonated client gets wired in? Something like:
# in each tool, after auth:
backend_config = KubernetesBackendConfig(
client=create_k8s_client_for_user(user_identity),
namespace=user_identity["namespace"],
)
client = TrainerClient(backend_config=backend_config)Without this, the per-user namespace isolation and impersonation described in the auth section doesn't actually connect to the tool execution path.
There was a problem hiding this comment.
Valid point - here the pseudocode shows TrainerClient() with no args, which doesn't connect to the impersonation path in auth.py. The implementation will wire this properly 👍
There was a problem hiding this comment.
I'll add this pattern to the implementation docs. The KEP pseudocode is intentionally simplified for readability..
| --- | ||
|
|
||
| ## Resource Estimation Algorithm | ||
|
|
||
| ```python | ||
| def estimate_resources( | ||
| model: str, | ||
| peft_method: str, | ||
| batch_size: int = 4, | ||
| sequence_length: int = 2048, | ||
| quantization: str = "bf16", # "fp32", "bf16", "fp16", "int8", "int4" | ||
| num_nodes: int = 1, | ||
| ) -> dict: |
There was a problem hiding this comment.
The get_model_info(model) call hits HuggingFace Hub API synchronously in the tool hot path. A few concerns:
- No timeout — if HF Hub is slow/unreachable, this blocks indefinitely.
- No caching — same model queried repeatedly on every
estimate_resources()call. - Air-gapped environments — enterprise Kubeflow deployments often have no external internet access. The tool would always return
"confidence": "low"and be essentially useless.
The user_provided_params approach was already discussed in the thread (great suggestion from @MansiSingh17, agreed to by @abhijeet-dhumal for Phase 2), but it's not yet reflected in the algorithm code. Could we update the pseudocode snippet to show:
# Prefer user-provided params for private/air-gapped models
if user_provided_params:
param_count = user_provided_params.get("param_count")
...
else:
model_info = get_model_info(model, timeout=10) # explicit timeoutEven a TODO comment would help set expectations for implementors.
There was a problem hiding this comment.
When HF Hub is unreachable, we return "confidence": "low" with a message suggesting user_provided_params. This is already the intent - I'll make sure the implementation pseudocode reflects it.
For enterprise deployments, we could also support a local model registry endpoint as an alternative to HF Hub.
| ### `wait_for_training` Timeout | ||
|
|
||
| ```python | ||
| @mcp.tool() | ||
| def wait_for_training( | ||
| job_id: str, | ||
| timeout: int = 3600, # Default: 1 hour | ||
| poll_interval: int = 30, | ||
| ) -> dict: | ||
| """Returns current status on timeout (doesn't cancel job).""" | ||
| ``` | ||
|
|
There was a problem hiding this comment.
The default timeout=3600 (1 hour) will be problematic for the Streamable HTTP transport. Most reverse proxies (nginx default: 60s, AWS ALB: 60s, GCP: 30 min) will terminate idle connections well before 1 hour. The SDK's own wait_for_job_status() defaults to 600s.
One option: keep the long timeout for stdio mode (Claude Desktop/Cursor) but document that for HTTP deployments the recommended pattern is polling via get_training_job() at intervals, rather than holding a long-lived connection open. Could add a transport note here:
"For StreamableHTTP deployments, prefer using
get_training_job()polling overwait_for_training()to avoid proxy timeout issues."
There was a problem hiding this comment.
Good point on transport differences. The 1-hour default makes sense for stdio (Claude Desktop, Cursor) but breaks behind reverse proxies.
but again .. . The KEP pseudocode is intentionally simplified for readability..
We will document the recommended pattern for each transport mode 👍
@andreyvelich @thesuperzapper May I request and update on the request here ? |
Signed-off-by: abhijeet-dhumal <abhijeetdhumal652@gmail.com>
andreyvelich
left a comment
There was a problem hiding this comment.
/lgtm
/approve
/hold cancel
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: andreyvelich The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Resolves #936
Summary
Introduces KEP-936: Kubeflow MCP Server - an AI-powered interface enabling LLM agents to interact with Kubeflow through the Model Context Protocol (MCP).
Key Features
fine_tune(),run_custom_training(),run_container_training()mapping to SDK trainer types--clients trainer,optimizer,hubfor selective tool loadingreadonly,data-scientist,ml-engineer,platform-adminTrainerClient,OptimizerClient,ModelRegistryClientImplementation Plan
Related
/cc @andreyvelich @astefanutti @franciscojavierarceo @juliusvonkohout
kubeflow-mcp-demo.mp4