feat(hosted-mcp): 32 evals for Auth0 hosted MCP server by lena-mohmand · Pull Request #69 · auth0/auth0-evals

lena-mohmand · 2026-06-26T20:09:13Z

Summary

Adds 32 end-to-end evals for the Auth0 hosted MCP server plus the framework infrastructure needed to run them.

Framework changes (new to main)

MCPOAuthConfig type + optional auth field on MCPHttpServerConfig — declarative OAuth config for authenticated HTTP MCP servers
mintMcpToken helper in eval-core — performs a client-credentials exchange per agent job (no mid-matrix token expiry)
calledTool / calledToolOneOf / getSuccessfulMcpCalls grader primitives in eval-graders — event-based, L4/L5 only, match mcp__<server>__<tool> trace entries
claude-code runner — mints and forwards Bearer token per job for any HTTP MCP server with an auth block
Docker sandbox — forwards MCP_TENANT_DOMAIN, MCP_CLIENT_ID, MCP_CLIENT_SECRET into the container so eval.config.js can register the authenticated server
eval.config.js — registers auth0-hosted-mcp server gated on the three env vars above

Evals added (32 total in `apps/auth0-evals/src/evals/hosted-mcp/`)

Covering 6 domains — applications, resource servers, application grants, actions, forms, logs:

Domain	Evals
Applications	`create_application`, `setup_native_app`, `setup_regular_web_app`, `setup_spa_with_api`, `setup_m2m_application`, `rename_application`, `update_application_callbacks`, `find_and_inspect_application`, `audit_tenant_applications`, `bulk_app_audit`, `complete_developer_onboarding`
Resource Servers / APIs	`create_api_and_inspect`, `investigate_api_scopes`, `add_multiple_scopes`, `update_api_token_settings`, `oidc_conformance_review`
Actions	`create_and_deploy_action`, `enable_action_in_flow`, `update_action_code`, `full_action_lifecycle`, `action_and_form_setup`
Forms	`create_progressive_profile_form`, `list_and_inspect_forms`, `update_form_fields`
Logs	`diagnose_with_logs`, `search_logs_by_user`, `debug_failed_login`, `get_log_details`
Multi-domain	`setup_m2m_application`, `diagnose_and_fix_config`, `troubleshoot_missing_scope`, `audit_m2m_grants`, `complete_developer_onboarding`

Eval results (claude-sonnet-4-6, agent+mcp mode, serial run)

A (100%): create_application, setup_native_app, setup_regular_web_app, setup_spa_with_api, audit_m2m_grants, rename_application
B (80–88%): majority of evals — agent completes the task, minor parameter gaps
C (genuine agent gaps, not grader issues):
- create_and_deploy_action (66%) — agent does not poll auth0_get_action to confirm deployed status after calling auth0_deploy_action
- create_api_and_inspect (75%) — agent creates the API but does not call a verify tool afterward

Notes

Evals that create applications must run with --workers 1 — the test tenant has an app limit that causes false C grades when workers run in parallel
codex and copilot runners do not yet forward MCP auth headers (existing TODO comments in those runners); these evals target --agent-type claude-code
Pending sync with @bharath.natarajan on which evals belong on the agent-experience board

Test plan

npm run build — passes (all 5 packages)
npm run evals -- --eval hosted_mcp_list_applications --mode agent --tools mcp --model claude-sonnet-4-6 --agent-type claude-code with MCP_* env vars set — grade A, calledTool L4 passes
Run without MCP_* env vars — sandbox logs warning, auth0-hosted-mcp not registered, graders fail as expected

Tests whether an agent correctly creates a SPA in the tenant using the auth0_create_application MCP tool with the right parameters. Graders: - L2: no hallucinated tool names (create_client, create_app, register_application) - L4: calledTool check for auth0_create_application - L5: custom arg predicates verifying app_type=spa and name="My Web App" Grade A (100%) on claude-opus-4-7 with --tools mcp. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Multi-step eval: create API, create M2M app, grant access. All 3 tools called in sequence with L5 arg checks. Grades A (100%) on claude-opus-4-7.

Read-and-reason eval: query logs with filter, report failure types and counts. Grades A (100%) on claude-opus-4-7.

Multi-step eval: create Post-Login action with roles claim, then deploy it. L5 checks trigger ID, event.authorization.roles usage, and idToken assignment. Grades A (100%) on claude-opus-4-7.

Tests creating a progressive profiling form with named fields (job_title, company). L5 checks form name and presence of both field keys in nodes. Grades A (100%) on claude-opus-4-7 with MCP.

Read-then-write eval: list apps to find by name, get existing callbacks, update with new URL preserved alongside existing ones. L5 checks new callback present and client_id was looked up dynamically. Grades A (100%) on claude-opus-4-7.

…nvestigate_api_scopes Removed hardcoded tenant domain from all hosted-mcp PROMPT.md files — domain is injected via eval.config.js env vars and is not needed in prompts. Added hosted_mcp_investigate_api_scopes: list resource servers, inspect existing scopes, then update with a new write:inventory scope while preserving read:inventory (read→reason→write chain).

…d_and_inspect_application evals

…d_inspect evals

…ds evals

…ulk_app_audit evals

…ormance_review evals

…ngs, setup_native_app evals

…cation evals

…and_form_setup evals

MCP_TENANT_DOMAIN, MCP_CLIENT_ID, and MCP_CLIENT_SECRET were not passed to the container, so eval.config.js never registered auth0-hosted-mcp. Without the server, the agent fell back to auth0-cli bash commands.

- audit-m2m-grants: replace contains() file checks with event predicates that inspect tool call results (no file is written in this eval) - rename-application: same fix for the Analytics Hub confirmation check - setup-m2m-application: accept scope as string or array (Management API returns space-separated string, agent may pass either format) - complete-developer-onboarding: same scope format fix for send:notifications

… graders - Add MCPOAuthConfig type and optional auth field to MCPHttpServerConfig - Add mintMcpToken client-credentials helper in eval-core - Wire mintMcpToken into claude-code runner (per-job Bearer token forwarding) - Add calledTool, calledToolOneOf, getSuccessfulMcpCalls grader primitives - Register auth0-hosted-mcp server in eval.config.js (gated on env vars) - Forward MCP_TENANT_DOMAIN/CLIENT_ID/CLIENT_SECRET into Docker sandbox

…and update-form-fields

lena-mohmand and others added 19 commits June 26, 2026 15:47

feat(eval): add hosted_mcp_setup_m2m_application eval

25ee21d

Multi-step eval: create API, create M2M app, grant access. All 3 tools called in sequence with L5 arg checks. Grades A (100%) on claude-opus-4-7.

feat(eval): add hosted_mcp_diagnose_with_logs eval

75efa6f

Read-and-reason eval: query logs with filter, report failure types and counts. Grades A (100%) on claude-opus-4-7.

feat(eval): add hosted_mcp_create_and_deploy_action eval

0b98a18

Multi-step eval: create Post-Login action with roles claim, then deploy it. L5 checks trigger ID, event.authorization.roles usage, and idToken assignment. Grades A (100%) on claude-opus-4-7.

feat(eval): add hosted_mcp_create_progressive_profile_form eval

564be5b

Tests creating a progressive profiling form with named fields (job_title, company). L5 checks form name and presence of both field keys in nodes. Grades A (100%) on claude-opus-4-7 with MCP.

feat(eval): add enable_action_in_flow, audit_tenant_applications, fin…

3348dd3

…d_and_inspect_application evals

feat(eval): add debug_failed_login, update_action_code, create_api_an…

8b2bdfd

…d_inspect evals

feat(eval): add setup_spa_with_api, get_log_details, update_form_fiel…

a446765

…ds evals

feat(eval): add troubleshoot_missing_scope, list_and_inspect_forms, b…

a1d38d4

…ulk_app_audit evals

feat(eval): add full_action_lifecycle, add_multiple_scopes, oidc_conf…

610d01c

…ormance_review evals

feat(eval): add complete_developer_onboarding, update_api_token_setti…

d9b1975

…ngs, setup_native_app evals

feat(eval): add setup_regular_web_app, audit_m2m_grants, rename_appli…

d08fa6c

…cation evals

feat(eval): add diagnose_and_fix_config, search_logs_by_user, action_…

f633788

…and_form_setup evals

fix(sandbox): forward MCP credentials into Docker container

14105a8

MCP_TENANT_DOMAIN, MCP_CLIENT_ID, and MCP_CLIENT_SECRET were not passed to the container, so eval.config.js never registered auth0-hosted-mcp. Without the server, the agent fell back to auth0-cli bash commands.

fix(lint): remove unused calledTool import in list-and-inspect-forms …

062feee

…and update-form-fields

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(hosted-mcp): 32 evals for Auth0 hosted MCP server#69

feat(hosted-mcp): 32 evals for Auth0 hosted MCP server#69
lena-mohmand wants to merge 19 commits into
mainfrom
feat/hosted-mcp-evals

lena-mohmand commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

lena-mohmand commented Jun 26, 2026

Summary

Framework changes (new to main)

Evals added (32 total in apps/auth0-evals/src/evals/hosted-mcp/)

Eval results (claude-sonnet-4-6, agent+mcp mode, serial run)

Notes

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Evals added (32 total in `apps/auth0-evals/src/evals/hosted-mcp/`)