Skip to content

F104: MCP server migration to the official go-sdk #365

@pocky

Description

@pocky

F104: MCP server migration to the official go-sdk

Scope

In Scope

  • Replace the custom stdio MCP server (pkg/mcpserver, ~1270 lines) with the server from the official SDK github.com/modelcontextprotocol/go-sdk (v1.6.x)
  • Create new internal/infrastructure/mcp/ package wrapping *mcp.Server with provider registration, dedup, and stdio transport
  • Migrate tool schema mapping, handler wrapping with panic isolation, and result conversion (text-only) to the new package
  • Rewrite the server portion of mcp_serve.go to use the SDK while preserving all discovery, plugin bootstrap, signal handling, and exit codes
  • Delete pkg/mcpserver/ after migration verification
  • Update .go-arch-lint.yml and documentation/ADR references

Out of Scope

  • Part 2 — MCP client (consuming external MCP servers): separate project
  • MCP-over-HTTP / streamable transport: not retained
  • HTTP/OpenAI-compatible path (in-process tools.Router): unchanged
  • pkg/acpserver migration: stays custom until F105
  • Image/structured content support in resultToMCP: deferred to F108 Axis C

Deferred

Item Rationale Follow-up
resultToMCP switch c.Type for image/structured content Keep blast radius minimal; text-only is sufficient for parity F108 Axis C
MCP client (external server consumption) Different problem space, separate project scope future
MCP-over-HTTP / streamable transport Not in current architecture, no demand future

User Stories

US1: Equivalent MCP server behavior on the official SDK (P1 - Must Have)

As an AWF maintainer,
I want the MCP stdio server backed by the official go-sdk instead of custom code,
So that protocol conformance, maintenance burden, and upstream parity improve at equivalent observable behavior.

Why this priority: This is the entire feature — without the SDK swap, none of the downstream benefits (F108 Axis C extension, reduced maintenance) are unlocked. The migration must be drop-in equivalent or it is not viable.

Acceptance Scenarios:

  1. Given an agent (Claude/Gemini) connected via mcp_proxy in a workflow, When the agent lists tools, Then the same builtins + plugin tools appear as before the migration
  2. Given the migrated server is running, When an agent invokes a tool, Then the call routes through the SDK to provider.CallTool and returns the same text content as the legacy server
  3. Given a tool handler panics during execution, When the SDK invokes the wrapped handler, Then the panic is recovered and surfaced as CallToolResult{IsError: true} with error == nil

Independent Test: Run a real workflow with mcp_proxy against Claude or Gemini, list and invoke builtins + plugin tools, compare results against the pre-migration baseline.

US2: Rich tool schemas round-trip through the SDK (P2 - Should Have)

As an AWF tool author,
I want my tool's nested schema (with required, enums, nested objects) to survive the map → JSON → jsonschema.Schema conversion,
So that agents receive accurate input contracts and reject malformed calls upstream.

Why this priority: P2 because builtin tools work today with text-only schemas, but plugin tools with rich schemas need this to remain functional after migration.

Acceptance Scenarios:

  1. Given a tool with nested object schema containing required fields and enum constraints, When the server registers it via RegisterProvider, Then the SDK emits an equivalent *jsonschema.Schema in the tool listing
  2. Given an agent sends a call with invalid arguments per the schema, When the SDK validates, Then the validation behavior matches or improves on the legacy server's behavior

Independent Test: Unit test in mapping_test.go round-trips a fixture schema map through schemaFromMap and asserts structural equivalence.

US3: Server tests driven by the official SDK client (P3 - Nice to Have)

As an AWF contributor,
I want integration tests that drive the new server via the SDK's in-memory/pipe transport,
So that future regressions are caught against the same client surface real agents use.

Why this priority: P3 because behavior parity is what gates the migration; SDK-driven tests increase confidence but black-box agent runs already cover correctness.

Acceptance Scenarios:

  1. Given an in-memory transport pair, When an SDK client lists tools against the server, Then the response matches the registered tool set
  2. Given a registered provider with a passing tool and a panicking tool, When the client invokes each, Then the passing tool returns text content and the panicking tool returns IsError: true

Independent Test: mcp_test.go under internal/infrastructure/mcp/ runs end-to-end through the SDK client over a pipe transport.

Edge Cases

  • What happens when two providers register a tool with the same name? → Dedup via s.names map; second registration is skipped or errors per legacy behavior
  • How does the system handle a tool payload larger than the SDK's default StdioTransport cap? → Configure transport to 10 MiB to match legacy ~10 MiB behavior
  • What is the behavior when an unknown tool is called? → Surface error consistent with JSON-RPC -32601 (verify SDK shape)
  • What happens when a handler panics? → defer recover in handlerFor returns CallToolResult{IsError: true}, never propagates panic to the SDK runtime
  • What happens to stdout writes during operation? → Stdout stays clean (protocol channel); all logs go to stderr

Requirements

Functional Requirements

  • FR-001: System MUST register tool providers with the SDK server, deduplicating tool names across providers
  • FR-002: System MUST translate provider tool definitions (name, description, schema, handler) into SDK *mcp.Tool registrations via toolToMCP and schemaFromMap
  • FR-003: System MUST wrap each provider handler with defer recover so panics surface as CallToolResult{IsError: true} with nil Go error
  • FR-004: System MUST convert provider Result content into SDK TextContent via resultToMCP (text-only in this feature)
  • FR-005: System MUST serve the MCP protocol over stdio via the SDK's StdioTransport invoked through srv.Run(ctx, &StdioTransport{})
  • FR-006: System MUST preserve all existing discovery logic from mcp_serve.go: builtins+sandbox RootDir, plugin bootstrap, resolveOperationProvider, signal handling, and exit codes
  • FR-007: System MUST pass the real awf binary version to the SDK server constructor instead of the legacy hardcoded "0.1.0"
  • FR-008: System MUST delete pkg/mcpserver/ after migration with no remaining importers
  • FR-009: System MUST update .go-arch-lint.yml to register the new internal/infrastructure/mcp package and its dependency rules
  • FR-010: Users MUST be able to run an existing workflow with mcp_proxy against a real agent (Claude/Gemini) and see equivalent tool listing and invocation behavior

Non-Functional Requirements

  • NFR-001: Stdout MUST remain protocol-only; all server logs MUST be written to stderr
  • NFR-002: The server MUST handle payloads up to ~10 MiB without truncation (configure StdioTransport cap accordingly)
  • NFR-003: A tool handler panic MUST NOT crash the server process; isolation MUST be verified by test
  • NFR-004: Adapter package (internal/infrastructure/mcp/) MUST reach >85% test coverage
  • NFR-005: make build, make lint, make test, and make test-race MUST all pass with zero violations

Success Criteria

  • SC-001: After migration, an agent-driven workflow using mcp_proxy lists and invokes the same builtins + plugin tools as the pre-migration baseline (100% behavior parity for in-scope content types)
  • SC-002: pkg/mcpserver/ is fully removed from the repository with zero remaining importers
  • SC-003: internal/infrastructure/mcp/ adapter achieves >85% test coverage measured by make test-coverage
  • SC-004: All CI gates (make build && lint && test && test-race) pass green on the migration branch
  • SC-005: Real end-to-end run with Claude or Gemini against an mcp_proxy workflow completes successfully with no protocol errors

Key Entities

Entity Description Key Attributes
Server SDK-backed MCP server wrapping *mcp.Server srv *mcp.Server, names map[string]struct{} (dedup), RegisterProvider, ServeStdio
ToolMapping Converts provider tool definitions to SDK shape toolToMCP, schemaFromMap (map → JSON → *jsonschema.Schema)
HandlerAdapter Wraps provider.CallTool for SDK invocation CallToolParamsRaw input, defer recover, errorResult on panic
ResultMapping Converts provider Result to SDK content resultToMCP (text-only in this feature, switch c.Type in F108 Axis C)

Assumptions

  • The github.com/modelcontextprotocol/go-sdk v1.6.x minimum Go version is compatible with the project's Go 1.25.8
  • The SDK exposes an in-memory or pipe transport suitable for end-to-end tests
  • The SDK's StdioTransport either defaults to or can be configured for a 10 MiB message cap
  • The SDK's "unknown tool" error shape is consistent with JSON-RPC -32601; if not, verification will flag a follow-up
  • Only mcp_serve.go imports pkg/mcpserver (verified in the source spec); no other callers exist
  • The threat model and architecture comments from pkg/mcpserver doc can be carried over verbatim to internal/infrastructure/mcp/doc.go

Metadata

  • Status: backlog
  • Version: v0.11.0
  • Priority: high
  • Estimation: L

Dependencies

  • Blocked by: none
  • Unblocks: F108 Axis C (resultToMCP mapping → switch c.Type for image/structured content)

Clarifications

Section populated during clarify step with resolved ambiguities.

Notes

  • Source spec: .agent/specs/2026-06-02-mcp-server-go-sdk-migration-design.md
  • Research reference: research-improvements.md §2
  • Position in overall sequence: 2 / 6 (after F103 Codex JSONL parity, before F105 ACP → coder-sdk)
  • Grouped with F105 (ACP): same pattern — SDK confined to internal/, tests driven by the SDK client; doing both back-to-back capitalizes on the pattern
  • Anticipates F108 Axis C: resultToMCP stays text-only here; F108 extends it to a switch c.Type
  • Verification items to confirm during implementation: SDK StdioTransport message size cap, "unknown tool" error shape vs -32601, availability of in-memory/pipe transport, SDK minimum Go version
  • The HTTP/OpenAI-compatible path (in-process tools.Router) is explicitly untouched
  • pkg/acpserver remains custom until F105 (the go-sdk is MCP-only)

Metadata

Metadata

Assignees

No one assigned

    Labels

    featureFeature specificationv0.11.0Target version

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions