feat(hosted-mcp): 32 evals for Auth0 hosted MCP server#69
Open
lena-mohmand wants to merge 19 commits into
Open
feat(hosted-mcp): 32 evals for Auth0 hosted MCP server#69lena-mohmand wants to merge 19 commits into
lena-mohmand wants to merge 19 commits into
Conversation
Tests whether an agent correctly creates a SPA in the tenant using the auth0_create_application MCP tool with the right parameters. Graders: - L2: no hallucinated tool names (create_client, create_app, register_application) - L4: calledTool check for auth0_create_application - L5: custom arg predicates verifying app_type=spa and name="My Web App" Grade A (100%) on claude-opus-4-7 with --tools mcp. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Multi-step eval: create API, create M2M app, grant access. All 3 tools called in sequence with L5 arg checks. Grades A (100%) on claude-opus-4-7.
Read-and-reason eval: query logs with filter, report failure types and counts. Grades A (100%) on claude-opus-4-7.
Multi-step eval: create Post-Login action with roles claim, then deploy it. L5 checks trigger ID, event.authorization.roles usage, and idToken assignment. Grades A (100%) on claude-opus-4-7.
Tests creating a progressive profiling form with named fields (job_title, company). L5 checks form name and presence of both field keys in nodes. Grades A (100%) on claude-opus-4-7 with MCP.
Read-then-write eval: list apps to find by name, get existing callbacks, update with new URL preserved alongside existing ones. L5 checks new callback present and client_id was looked up dynamically. Grades A (100%) on claude-opus-4-7.
…nvestigate_api_scopes Removed hardcoded tenant domain from all hosted-mcp PROMPT.md files — domain is injected via eval.config.js env vars and is not needed in prompts. Added hosted_mcp_investigate_api_scopes: list resource servers, inspect existing scopes, then update with a new write:inventory scope while preserving read:inventory (read→reason→write chain).
…d_and_inspect_application evals
…ulk_app_audit evals
…ormance_review evals
…ngs, setup_native_app evals
…and_form_setup evals
MCP_TENANT_DOMAIN, MCP_CLIENT_ID, and MCP_CLIENT_SECRET were not passed to the container, so eval.config.js never registered auth0-hosted-mcp. Without the server, the agent fell back to auth0-cli bash commands.
- audit-m2m-grants: replace contains() file checks with event predicates that inspect tool call results (no file is written in this eval) - rename-application: same fix for the Analytics Hub confirmation check - setup-m2m-application: accept scope as string or array (Management API returns space-separated string, agent may pass either format) - complete-developer-onboarding: same scope format fix for send:notifications
… graders - Add MCPOAuthConfig type and optional auth field to MCPHttpServerConfig - Add mintMcpToken client-credentials helper in eval-core - Wire mintMcpToken into claude-code runner (per-job Bearer token forwarding) - Add calledTool, calledToolOneOf, getSuccessfulMcpCalls grader primitives - Register auth0-hosted-mcp server in eval.config.js (gated on env vars) - Forward MCP_TENANT_DOMAIN/CLIENT_ID/CLIENT_SECRET into Docker sandbox
…and update-form-fields
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds 32 end-to-end evals for the Auth0 hosted MCP server plus the framework infrastructure needed to run them.
Framework changes (new to main)
MCPOAuthConfigtype + optionalauthfield onMCPHttpServerConfig— declarative OAuth config for authenticated HTTP MCP serversmintMcpTokenhelper ineval-core— performs a client-credentials exchange per agent job (no mid-matrix token expiry)calledTool/calledToolOneOf/getSuccessfulMcpCallsgrader primitives ineval-graders— event-based, L4/L5 only, matchmcp__<server>__<tool>trace entriesauthblockMCP_TENANT_DOMAIN,MCP_CLIENT_ID,MCP_CLIENT_SECRETinto the container soeval.config.jscan register the authenticated servereval.config.js— registersauth0-hosted-mcpserver gated on the three env vars aboveEvals added (32 total in
apps/auth0-evals/src/evals/hosted-mcp/)Covering 6 domains — applications, resource servers, application grants, actions, forms, logs:
create_application,setup_native_app,setup_regular_web_app,setup_spa_with_api,setup_m2m_application,rename_application,update_application_callbacks,find_and_inspect_application,audit_tenant_applications,bulk_app_audit,complete_developer_onboardingcreate_api_and_inspect,investigate_api_scopes,add_multiple_scopes,update_api_token_settings,oidc_conformance_reviewcreate_and_deploy_action,enable_action_in_flow,update_action_code,full_action_lifecycle,action_and_form_setupcreate_progressive_profile_form,list_and_inspect_forms,update_form_fieldsdiagnose_with_logs,search_logs_by_user,debug_failed_login,get_log_detailssetup_m2m_application,diagnose_and_fix_config,troubleshoot_missing_scope,audit_m2m_grants,complete_developer_onboardingEval results (claude-sonnet-4-6, agent+mcp mode, serial run)
create_application,setup_native_app,setup_regular_web_app,setup_spa_with_api,audit_m2m_grants,rename_applicationcreate_and_deploy_action(66%) — agent does not pollauth0_get_actionto confirmdeployedstatus after callingauth0_deploy_actioncreate_api_and_inspect(75%) — agent creates the API but does not call a verify tool afterwardNotes
--workers 1— the test tenant has an app limit that causes false C grades when workers run in parallelcodexandcopilotrunners do not yet forward MCP auth headers (existing TODO comments in those runners); these evals target--agent-type claude-codeTest plan
npm run build— passes (all 5 packages)npm run evals -- --eval hosted_mcp_list_applications --mode agent --tools mcp --model claude-sonnet-4-6 --agent-type claude-codewithMCP_*env vars set — grade A,calledToolL4 passesMCP_*env vars — sandbox logs warning,auth0-hosted-mcpnot registered, graders fail as expected