feat(mcp-readability): add MCP style-guide compliance eval by akangsha7 · Pull Request #459 · GoogleCloudPlatform/evalbench

akangsha7 · 2026-06-25T01:19:17Z

Add an mcp_readability orchestrator to EvalBench that checks MCP endpoints against the Data Cloud MCP style guide and reports compliance metrics (P0/P1/P2 counts, tool count, token budget usage, score).

Tool fetching uses the official mcp SDK over Streamable HTTP (streamable_http_client + ClientSession.list_tools) with URL sanitization and ADC auth; a file source supports offline runs.
Tools are rendered to man-page markup (resolves $ref/$defs, flattens nesting, renders enum/default) which feeds the LLM scorer.
Scorer uses Vertex Gemini JSON mode, anchors severity to style-guide priority annotations, supports per-endpoint waivers, evaluates from an LLM-agent-consumption perspective, and runs a dedicated consistency second pass against prior feedback when available.
Results write to a compliance CSV and, when reporting.bigquery is configured, append to the existing eval BigQuery table.
Endpoint list mirrors monitored_endpoints.textproto (24 endpoints).

Add an `mcp_readability` orchestrator to EvalBench that checks MCP endpoints against the Data Cloud MCP style guide and reports compliance metrics (P0/P1/P2 counts, tool count, token budget usage, score). - Tool fetching uses the official `mcp` SDK over Streamable HTTP (`streamable_http_client` + `ClientSession.list_tools`) with URL sanitization and ADC auth; a `file` source supports offline runs. - Tools are rendered to man-page markup (resolves $ref/$defs, flattens nesting, renders enum/default) which feeds the LLM scorer. - Scorer uses Vertex Gemini JSON mode, anchors severity to style-guide priority annotations, supports per-endpoint waivers, evaluates from an LLM-agent-consumption perspective, and runs a dedicated consistency second pass against prior feedback when available. - Results write to a compliance CSV and, when `reporting.bigquery` is configured, append to the existing eval BigQuery table. - Endpoint list mirrors monitored_endpoints.textproto (24 endpoints). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

google-cla · 2026-06-25T01:19:39Z

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

IsmailMehdi · 2026-06-25T03:25:58Z

+        generic ``generate`` for non-Gemini models.
+        """
+        client = getattr(self.model, "client", None)
+        caller = getattr(self.model, "_call_generate_content", None)


lets not do this, either add a flag or a new signature to skip the sanitizer instead of jamming into the internals

IsmailMehdi · 2026-06-25T03:28:50Z

+    INTERNAL_ERROR = 4
+
+
+def _coerce(enum_cls, value, default):


this seems unecessary, we control the config don't we ?

IsmailMehdi · 2026-06-25T03:30:24Z

+        url = self.sanitize_url(raw_url)
+        headers = self._auth_headers(source)
+        try:
+            return anyio.run(


anyio.run(...) creates a fresh event loop. That's fine today because the orchestrator drives this from its concurrent.futures.ThreadPoolExecutor (each worker thread has no running loop). But if this generator is ever called from an existing async context — e.g., the agent-flow async refactors that have been landing — it will raise RuntimeError: cannot be called from running event loop, and the orchestrator will record a FETCH_ERROR row with a confusing message.

Two options:

Detect an existing loop and use anyio.from_thread.run(...) in that case, so the generator works in either context.

Or add an explicit comment here that this generator is sync-only by design and will break if invoked from async code, so the next caller doesn't get surprised.

IsmailMehdi · 2026-06-25T03:30:24Z

+            row.update(
+                {
+                    "check_status": CheckStatus.SUCCESS.name,
+                    "p0_issues": int(feedback.get("p0_issues", 0)),


If the LLM returns a non-numeric value here (e.g. "p0_issues": "five" or null in a way that survives _parse), int(...) raises ValueError / TypeError. The outer try does catch it — but it lands in the generic except Exception block below as INTERNAL_ERROR rather than the more accurate ANALYSIS_ERROR (the model's output is the actual problem). The scorer already exposes _safe_int for exactly this; lift it (or re-define alongside) and use it for all four fields:

from scorers.mcp_style_compliance import _safe_int ... "p0_issues": _safe_int(feedback.get("p0_issues", 0)), "p1_issues": _safe_int(feedback.get("p1_issues", 0)), "p2_issues": _safe_int(feedback.get("p2_issues", 0)), "compliance_score": _safe_int(feedback.get("compliance_score", 0)),

Keeps a malformed model response on the ANALYSIS_ERROR path with full row metrics still attempted, rather than dropping to INTERNAL_ERROR with no compliance fields.

IsmailMehdi · 2026-06-25T03:34:42Z

+
+    # The abstract base requires generate_internal; the orchestrator calls
+    # fetch_tools directly, but we keep this so the class is a valid generator.
+    def generate_internal(self, prompt):


This override is a soft-fail shim — "" for any non-dict input — that mostly exists to satisfy the abstract base. The orchestrator never calls it (it uses fetch_tools directly).

Problem: if a future caller plumbs this generator into the standard generation path and passes a string prompt, they get back an empty string with no log and no error — the failure looks like the generator succeeded with no content. That's a debugging nightmare.

Two cleaner options:

Drop the override and let the base class's abstract method raise.

Or fail loud:
def generate_internal(self, prompt): raise NotImplementedError( "McpToolsGenerator does not support text generation; " "call fetch_tools(endpoint, defaults) directly." )

The second option is friendliest because the error message tells the next caller exactly what to do.

IsmailMehdi · 2026-06-25T03:34:42Z

+
+        # Shared model registry for get_generator (lock + cache), as used by
+        # the standard orchestrators.
+        self.global_models = {


This is a fresh per-orchestrator global_models dict. As a result, get_generator(self.global_models, ...) always misses the registry and instantiates a new model client (Gemini, etc.) per orchestrator. If multiple orchestrators run in the same process (a future suite-config scenario, or someone wiring eval_service to construct several in sequence), they each re-build the same model rather than reusing one.

The other orchestrators accept global_models as a constructor argument so callers can share it. Suggest matching that pattern here:

def __init__(self, config, db_configs, setup_config, report_progress=False, global_models=None): super().__init__(config, db_configs, setup_config, report_progress) ... self.global_models = global_models or { "lock": threading.Lock(), "registered_models": {}, }

Low blast radius (kwarg default preserves current behavior) and keeps the door open to model reuse. Minor — small N of endpoints today, but it's the kind of thing that ossifies into "why does this re-init?" later.

IsmailMehdi · 2026-06-25T03:35:20Z

PTAL failed actions and CLA

helloeve · 2026-06-25T14:50:15Z

+            config_df = None
+            results_df = None
+            scores_df = None
+            summary_scores_df = None


can we have the results needed managed within those data frames and reuse the reporter to handle the result upload?

helloeve · 2026-06-25T14:52:08Z

+        # Return Nones so the standard NL2SQL reporting path is skipped.
+        return (self.job_id, self.run_time, None, None, None)
+
+    def _write_bigquery(self, bq_config):


I would recommend to use reporter to handle the result upload rather than the orchestrator doing the work -- this customized path technically works, but introduces additional burden to maintain as it deviates from the normal evalbench flow.

helloeve · 2026-06-25T14:59:21Z

IIUC, this refers to the offline file-based approach; I think we should stick to raw json as input, otherwise it remains unclear how this file can be realistically constructed by a developer and aligned with the man page expected by the eval.

helloeve · 2026-06-25T15:07:54Z

+            return ""
+
+    @staticmethod
+    def _load_previous_feedback(csv_path) -> dict:


This seems to be relying on local files and I assume we are planning to extend this to read from BQ? My recommendation would be to do this as a follow-up when we get the full pipeline working e2e (results displayed in dashboard). In other words the initial version by default does not have previous feedback, wdyt?

akangsha7 requested a review from IsmailMehdi as a code owner June 25, 2026 01:19

IsmailMehdi reviewed Jun 25, 2026

View reviewed changes

Merge branch 'main' into feat/mcp-readability-eval

b705b55

IsmailMehdi requested changes Jun 25, 2026

View reviewed changes

helloeve reviewed Jun 25, 2026

View reviewed changes

akangsha7 closed this Jun 25, 2026

akangsha7 reopened this Jun 25, 2026

akangsha7 marked this pull request as draft June 25, 2026 16:44

akangsha7 closed this Jun 25, 2026

Uh oh!

Conversation

akangsha7 commented Jun 25, 2026

Uh oh!

google-cla Bot commented Jun 25, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

IsmailMehdi commented Jun 25, 2026

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants