Skip to content

feat(hosted-mcp): 32 evals for Auth0 hosted MCP server#69

Open
lena-mohmand wants to merge 19 commits into
mainfrom
feat/hosted-mcp-evals
Open

feat(hosted-mcp): 32 evals for Auth0 hosted MCP server#69
lena-mohmand wants to merge 19 commits into
mainfrom
feat/hosted-mcp-evals

Conversation

@lena-mohmand

Copy link
Copy Markdown

Summary

Adds 32 end-to-end evals for the Auth0 hosted MCP server plus the framework infrastructure needed to run them.

Framework changes (new to main)

  • MCPOAuthConfig type + optional auth field on MCPHttpServerConfig — declarative OAuth config for authenticated HTTP MCP servers
  • mintMcpToken helper in eval-core — performs a client-credentials exchange per agent job (no mid-matrix token expiry)
  • calledTool / calledToolOneOf / getSuccessfulMcpCalls grader primitives in eval-graders — event-based, L4/L5 only, match mcp__<server>__<tool> trace entries
  • claude-code runner — mints and forwards Bearer token per job for any HTTP MCP server with an auth block
  • Docker sandbox — forwards MCP_TENANT_DOMAIN, MCP_CLIENT_ID, MCP_CLIENT_SECRET into the container so eval.config.js can register the authenticated server
  • eval.config.js — registers auth0-hosted-mcp server gated on the three env vars above

Evals added (32 total in apps/auth0-evals/src/evals/hosted-mcp/)

Covering 6 domains — applications, resource servers, application grants, actions, forms, logs:

Domain Evals
Applications create_application, setup_native_app, setup_regular_web_app, setup_spa_with_api, setup_m2m_application, rename_application, update_application_callbacks, find_and_inspect_application, audit_tenant_applications, bulk_app_audit, complete_developer_onboarding
Resource Servers / APIs create_api_and_inspect, investigate_api_scopes, add_multiple_scopes, update_api_token_settings, oidc_conformance_review
Actions create_and_deploy_action, enable_action_in_flow, update_action_code, full_action_lifecycle, action_and_form_setup
Forms create_progressive_profile_form, list_and_inspect_forms, update_form_fields
Logs diagnose_with_logs, search_logs_by_user, debug_failed_login, get_log_details
Multi-domain setup_m2m_application, diagnose_and_fix_config, troubleshoot_missing_scope, audit_m2m_grants, complete_developer_onboarding

Eval results (claude-sonnet-4-6, agent+mcp mode, serial run)

  • A (100%): create_application, setup_native_app, setup_regular_web_app, setup_spa_with_api, audit_m2m_grants, rename_application
  • B (80–88%): majority of evals — agent completes the task, minor parameter gaps
  • C (genuine agent gaps, not grader issues):
    • create_and_deploy_action (66%) — agent does not poll auth0_get_action to confirm deployed status after calling auth0_deploy_action
    • create_api_and_inspect (75%) — agent creates the API but does not call a verify tool afterward

Notes

  • Evals that create applications must run with --workers 1 — the test tenant has an app limit that causes false C grades when workers run in parallel
  • codex and copilot runners do not yet forward MCP auth headers (existing TODO comments in those runners); these evals target --agent-type claude-code
  • Pending sync with @bharath.natarajan on which evals belong on the agent-experience board

Test plan

  • npm run build — passes (all 5 packages)
  • npm run evals -- --eval hosted_mcp_list_applications --mode agent --tools mcp --model claude-sonnet-4-6 --agent-type claude-code with MCP_* env vars set — grade A, calledTool L4 passes
  • Run without MCP_* env vars — sandbox logs warning, auth0-hosted-mcp not registered, graders fail as expected

lena-mohmand and others added 19 commits June 26, 2026 15:47
Tests whether an agent correctly creates a SPA in the tenant using
the auth0_create_application MCP tool with the right parameters.

Graders:
- L2: no hallucinated tool names (create_client, create_app, register_application)
- L4: calledTool check for auth0_create_application
- L5: custom arg predicates verifying app_type=spa and name="My Web App"

Grade A (100%) on claude-opus-4-7 with --tools mcp.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Multi-step eval: create API, create M2M app, grant access.
All 3 tools called in sequence with L5 arg checks.
Grades A (100%) on claude-opus-4-7.
Read-and-reason eval: query logs with filter, report failure types and counts.
Grades A (100%) on claude-opus-4-7.
Multi-step eval: create Post-Login action with roles claim, then deploy it.
L5 checks trigger ID, event.authorization.roles usage, and idToken assignment.
Grades A (100%) on claude-opus-4-7.
Tests creating a progressive profiling form with named fields (job_title, company).
L5 checks form name and presence of both field keys in nodes.
Grades A (100%) on claude-opus-4-7 with MCP.
Read-then-write eval: list apps to find by name, get existing callbacks,
update with new URL preserved alongside existing ones.
L5 checks new callback present and client_id was looked up dynamically.
Grades A (100%) on claude-opus-4-7.
…nvestigate_api_scopes

Removed hardcoded tenant domain from all hosted-mcp PROMPT.md files —
domain is injected via eval.config.js env vars and is not needed in prompts.

Added hosted_mcp_investigate_api_scopes: list resource servers, inspect
existing scopes, then update with a new write:inventory scope while
preserving read:inventory (read→reason→write chain).
MCP_TENANT_DOMAIN, MCP_CLIENT_ID, and MCP_CLIENT_SECRET were not passed
to the container, so eval.config.js never registered auth0-hosted-mcp.
Without the server, the agent fell back to auth0-cli bash commands.
- audit-m2m-grants: replace contains() file checks with event predicates
  that inspect tool call results (no file is written in this eval)
- rename-application: same fix for the Analytics Hub confirmation check
- setup-m2m-application: accept scope as string or array (Management API
  returns space-separated string, agent may pass either format)
- complete-developer-onboarding: same scope format fix for send:notifications
… graders

- Add MCPOAuthConfig type and optional auth field to MCPHttpServerConfig
- Add mintMcpToken client-credentials helper in eval-core
- Wire mintMcpToken into claude-code runner (per-job Bearer token forwarding)
- Add calledTool, calledToolOneOf, getSuccessfulMcpCalls grader primitives
- Register auth0-hosted-mcp server in eval.config.js (gated on env vars)
- Forward MCP_TENANT_DOMAIN/CLIENT_ID/CLIENT_SECRET into Docker sandbox
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant