feat(mcp-readability): add MCP style-guide compliance eval#459
feat(mcp-readability): add MCP style-guide compliance eval#459akangsha7 wants to merge 2 commits into
Conversation
Add an `mcp_readability` orchestrator to EvalBench that checks MCP endpoints against the Data Cloud MCP style guide and reports compliance metrics (P0/P1/P2 counts, tool count, token budget usage, score). - Tool fetching uses the official `mcp` SDK over Streamable HTTP (`streamable_http_client` + `ClientSession.list_tools`) with URL sanitization and ADC auth; a `file` source supports offline runs. - Tools are rendered to man-page markup (resolves $ref/$defs, flattens nesting, renders enum/default) which feeds the LLM scorer. - Scorer uses Vertex Gemini JSON mode, anchors severity to style-guide priority annotations, supports per-endpoint waivers, evaluates from an LLM-agent-consumption perspective, and runs a dedicated consistency second pass against prior feedback when available. - Results write to a compliance CSV and, when `reporting.bigquery` is configured, append to the existing eval BigQuery table. - Endpoint list mirrors monitored_endpoints.textproto (24 endpoints). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). View this failed invocation of the CLA check for more information. For the most up to date status, view the checks section at the bottom of the pull request. |
| generic ``generate`` for non-Gemini models. | ||
| """ | ||
| client = getattr(self.model, "client", None) | ||
| caller = getattr(self.model, "_call_generate_content", None) |
There was a problem hiding this comment.
lets not do this, either add a flag or a new signature to skip the sanitizer instead of jamming into the internals
| INTERNAL_ERROR = 4 | ||
|
|
||
|
|
||
| def _coerce(enum_cls, value, default): |
There was a problem hiding this comment.
this seems unecessary, we control the config don't we ?
| url = self.sanitize_url(raw_url) | ||
| headers = self._auth_headers(source) | ||
| try: | ||
| return anyio.run( |
There was a problem hiding this comment.
anyio.run(...) creates a fresh event loop. That's fine today because the orchestrator drives this from its concurrent.futures.ThreadPoolExecutor (each worker thread has no running loop). But if this generator is ever called from an existing async context — e.g., the agent-flow async refactors that have been landing — it will raise RuntimeError: cannot be called from running event loop, and the orchestrator will record a FETCH_ERROR row with a confusing message.
Two options:
- Detect an existing loop and use
anyio.from_thread.run(...)in that case, so the generator works in either context. - Or add an explicit comment here that this generator is sync-only by design and will break if invoked from async code, so the next caller doesn't get surprised.
| row.update( | ||
| { | ||
| "check_status": CheckStatus.SUCCESS.name, | ||
| "p0_issues": int(feedback.get("p0_issues", 0)), |
There was a problem hiding this comment.
If the LLM returns a non-numeric value here (e.g. "p0_issues": "five" or null in a way that survives _parse), int(...) raises ValueError / TypeError. The outer try does catch it — but it lands in the generic except Exception block below as INTERNAL_ERROR rather than the more accurate ANALYSIS_ERROR (the model's output is the actual problem). The scorer already exposes _safe_int for exactly this; lift it (or re-define alongside) and use it for all four fields:
from scorers.mcp_style_compliance import _safe_int
...
"p0_issues": _safe_int(feedback.get("p0_issues", 0)),
"p1_issues": _safe_int(feedback.get("p1_issues", 0)),
"p2_issues": _safe_int(feedback.get("p2_issues", 0)),
"compliance_score": _safe_int(feedback.get("compliance_score", 0)),Keeps a malformed model response on the ANALYSIS_ERROR path with full row metrics still attempted, rather than dropping to INTERNAL_ERROR with no compliance fields.
|
|
||
| # The abstract base requires generate_internal; the orchestrator calls | ||
| # fetch_tools directly, but we keep this so the class is a valid generator. | ||
| def generate_internal(self, prompt): |
There was a problem hiding this comment.
This override is a soft-fail shim — "" for any non-dict input — that mostly exists to satisfy the abstract base. The orchestrator never calls it (it uses fetch_tools directly).
Problem: if a future caller plumbs this generator into the standard generation path and passes a string prompt, they get back an empty string with no log and no error — the failure looks like the generator succeeded with no content. That's a debugging nightmare.
Two cleaner options:
- Drop the override and let the base class's abstract method raise.
- Or fail loud:
def generate_internal(self, prompt): raise NotImplementedError( "McpToolsGenerator does not support text generation; " "call fetch_tools(endpoint, defaults) directly." )
The second option is friendliest because the error message tells the next caller exactly what to do.
|
|
||
| # Shared model registry for get_generator (lock + cache), as used by | ||
| # the standard orchestrators. | ||
| self.global_models = { |
There was a problem hiding this comment.
This is a fresh per-orchestrator global_models dict. As a result, get_generator(self.global_models, ...) always misses the registry and instantiates a new model client (Gemini, etc.) per orchestrator. If multiple orchestrators run in the same process (a future suite-config scenario, or someone wiring eval_service to construct several in sequence), they each re-build the same model rather than reusing one.
The other orchestrators accept global_models as a constructor argument so callers can share it. Suggest matching that pattern here:
def __init__(self, config, db_configs, setup_config, report_progress=False, global_models=None):
super().__init__(config, db_configs, setup_config, report_progress)
...
self.global_models = global_models or {
"lock": threading.Lock(),
"registered_models": {},
}Low blast radius (kwarg default preserves current behavior) and keeps the door open to model reuse. Minor — small N of endpoints today, but it's the kind of thing that ossifies into "why does this re-init?" later.
|
PTAL failed actions and CLA |
| config_df = None | ||
| results_df = None | ||
| scores_df = None | ||
| summary_scores_df = None |
There was a problem hiding this comment.
can we have the results needed managed within those data frames and reuse the reporter to handle the result upload?
| # Return Nones so the standard NL2SQL reporting path is skipped. | ||
| return (self.job_id, self.run_time, None, None, None) | ||
|
|
||
| def _write_bigquery(self, bq_config): |
There was a problem hiding this comment.
I would recommend to use reporter to handle the result upload rather than the orchestrator doing the work -- this customized path technically works, but introduces additional burden to maintain as it deviates from the normal evalbench flow.
There was a problem hiding this comment.
IIUC, this refers to the offline file-based approach; I think we should stick to raw json as input, otherwise it remains unclear how this file can be realistically constructed by a developer and aligned with the man page expected by the eval.
| return "" | ||
|
|
||
| @staticmethod | ||
| def _load_previous_feedback(csv_path) -> dict: |
There was a problem hiding this comment.
This seems to be relying on local files and I assume we are planning to extend this to read from BQ? My recommendation would be to do this as a follow-up when we get the full pipeline working e2e (results displayed in dashboard). In other words the initial version by default does not have previous feedback, wdyt?
Add an
mcp_readabilityorchestrator to EvalBench that checks MCP endpoints against the Data Cloud MCP style guide and reports compliance metrics (P0/P1/P2 counts, tool count, token budget usage, score).mcpSDK over Streamable HTTP (streamable_http_client+ClientSession.list_tools) with URL sanitization and ADC auth; afilesource supports offline runs.reporting.bigqueryis configured, append to the existing eval BigQuery table.