Skip to content

feat(mcp-readability): add MCP style-guide compliance eval#459

Closed
akangsha7 wants to merge 2 commits into
GoogleCloudPlatform:mainfrom
akangsha7:feat/mcp-readability-eval
Closed

feat(mcp-readability): add MCP style-guide compliance eval#459
akangsha7 wants to merge 2 commits into
GoogleCloudPlatform:mainfrom
akangsha7:feat/mcp-readability-eval

Conversation

@akangsha7

Copy link
Copy Markdown

Add an mcp_readability orchestrator to EvalBench that checks MCP endpoints against the Data Cloud MCP style guide and reports compliance metrics (P0/P1/P2 counts, tool count, token budget usage, score).

  • Tool fetching uses the official mcp SDK over Streamable HTTP (streamable_http_client + ClientSession.list_tools) with URL sanitization and ADC auth; a file source supports offline runs.
  • Tools are rendered to man-page markup (resolves $ref/$defs, flattens nesting, renders enum/default) which feeds the LLM scorer.
  • Scorer uses Vertex Gemini JSON mode, anchors severity to style-guide priority annotations, supports per-endpoint waivers, evaluates from an LLM-agent-consumption perspective, and runs a dedicated consistency second pass against prior feedback when available.
  • Results write to a compliance CSV and, when reporting.bigquery is configured, append to the existing eval BigQuery table.
  • Endpoint list mirrors monitored_endpoints.textproto (24 endpoints).

Add an `mcp_readability` orchestrator to EvalBench that checks MCP
endpoints against the Data Cloud MCP style guide and reports compliance
metrics (P0/P1/P2 counts, tool count, token budget usage, score).

- Tool fetching uses the official `mcp` SDK over Streamable HTTP
  (`streamable_http_client` + `ClientSession.list_tools`) with URL
  sanitization and ADC auth; a `file` source supports offline runs.
- Tools are rendered to man-page markup (resolves $ref/$defs, flattens
  nesting, renders enum/default) which feeds the LLM scorer.
- Scorer uses Vertex Gemini JSON mode, anchors severity to style-guide
  priority annotations, supports per-endpoint waivers, evaluates from an
  LLM-agent-consumption perspective, and runs a dedicated consistency
  second pass against prior feedback when available.
- Results write to a compliance CSV and, when `reporting.bigquery` is
  configured, append to the existing eval BigQuery table.
- Endpoint list mirrors monitored_endpoints.textproto (24 endpoints).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@akangsha7 akangsha7 requested a review from IsmailMehdi as a code owner June 25, 2026 01:19
@google-cla

google-cla Bot commented Jun 25, 2026

Copy link
Copy Markdown

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

generic ``generate`` for non-Gemini models.
"""
client = getattr(self.model, "client", None)
caller = getattr(self.model, "_call_generate_content", None)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets not do this, either add a flag or a new signature to skip the sanitizer instead of jamming into the internals

INTERNAL_ERROR = 4


def _coerce(enum_cls, value, default):

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems unecessary, we control the config don't we ?

url = self.sanitize_url(raw_url)
headers = self._auth_headers(source)
try:
return anyio.run(

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

anyio.run(...) creates a fresh event loop. That's fine today because the orchestrator drives this from its concurrent.futures.ThreadPoolExecutor (each worker thread has no running loop). But if this generator is ever called from an existing async context — e.g., the agent-flow async refactors that have been landing — it will raise RuntimeError: cannot be called from running event loop, and the orchestrator will record a FETCH_ERROR row with a confusing message.

Two options:

  • Detect an existing loop and use anyio.from_thread.run(...) in that case, so the generator works in either context.
  • Or add an explicit comment here that this generator is sync-only by design and will break if invoked from async code, so the next caller doesn't get surprised.

row.update(
{
"check_status": CheckStatus.SUCCESS.name,
"p0_issues": int(feedback.get("p0_issues", 0)),

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the LLM returns a non-numeric value here (e.g. "p0_issues": "five" or null in a way that survives _parse), int(...) raises ValueError / TypeError. The outer try does catch it — but it lands in the generic except Exception block below as INTERNAL_ERROR rather than the more accurate ANALYSIS_ERROR (the model's output is the actual problem). The scorer already exposes _safe_int for exactly this; lift it (or re-define alongside) and use it for all four fields:

from scorers.mcp_style_compliance import _safe_int
...
"p0_issues": _safe_int(feedback.get("p0_issues", 0)),
"p1_issues": _safe_int(feedback.get("p1_issues", 0)),
"p2_issues": _safe_int(feedback.get("p2_issues", 0)),
"compliance_score": _safe_int(feedback.get("compliance_score", 0)),

Keeps a malformed model response on the ANALYSIS_ERROR path with full row metrics still attempted, rather than dropping to INTERNAL_ERROR with no compliance fields.


# The abstract base requires generate_internal; the orchestrator calls
# fetch_tools directly, but we keep this so the class is a valid generator.
def generate_internal(self, prompt):

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This override is a soft-fail shim — "" for any non-dict input — that mostly exists to satisfy the abstract base. The orchestrator never calls it (it uses fetch_tools directly).

Problem: if a future caller plumbs this generator into the standard generation path and passes a string prompt, they get back an empty string with no log and no error — the failure looks like the generator succeeded with no content. That's a debugging nightmare.

Two cleaner options:

  • Drop the override and let the base class's abstract method raise.
  • Or fail loud:
    def generate_internal(self, prompt):
        raise NotImplementedError(
            "McpToolsGenerator does not support text generation; "
            "call fetch_tools(endpoint, defaults) directly."
        )

The second option is friendliest because the error message tells the next caller exactly what to do.


# Shared model registry for get_generator (lock + cache), as used by
# the standard orchestrators.
self.global_models = {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a fresh per-orchestrator global_models dict. As a result, get_generator(self.global_models, ...) always misses the registry and instantiates a new model client (Gemini, etc.) per orchestrator. If multiple orchestrators run in the same process (a future suite-config scenario, or someone wiring eval_service to construct several in sequence), they each re-build the same model rather than reusing one.

The other orchestrators accept global_models as a constructor argument so callers can share it. Suggest matching that pattern here:

def __init__(self, config, db_configs, setup_config, report_progress=False, global_models=None):
    super().__init__(config, db_configs, setup_config, report_progress)
    ...
    self.global_models = global_models or {
        "lock": threading.Lock(),
        "registered_models": {},
    }

Low blast radius (kwarg default preserves current behavior) and keeps the door open to model reuse. Minor — small N of endpoints today, but it's the kind of thing that ossifies into "why does this re-init?" later.

@IsmailMehdi

Copy link
Copy Markdown
Collaborator

PTAL failed actions and CLA

Comment thread datasets/mcp_readability/style_guide.md
Comment thread evalbench/evalbench.py
config_df = None
results_df = None
scores_df = None
summary_scores_df = None

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we have the results needed managed within those data frames and reuse the reporter to handle the result upload?

# Return Nones so the standard NL2SQL reporting path is skipped.
return (self.job_id, self.run_time, None, None, None)

def _write_bigquery(self, bq_config):

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would recommend to use reporter to handle the result upload rather than the orchestrator doing the work -- this customized path technically works, but introduces additional burden to maintain as it deviates from the normal evalbench flow.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, this refers to the offline file-based approach; I think we should stick to raw json as input, otherwise it remains unclear how this file can be realistically constructed by a developer and aligned with the man page expected by the eval.

return ""

@staticmethod
def _load_previous_feedback(csv_path) -> dict:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be relying on local files and I assume we are planning to extend this to read from BQ? My recommendation would be to do this as a follow-up when we get the full pipeline working e2e (results displayed in dashboard). In other words the initial version by default does not have previous feedback, wdyt?

@akangsha7 akangsha7 closed this Jun 25, 2026
@akangsha7 akangsha7 reopened this Jun 25, 2026
@akangsha7 akangsha7 marked this pull request as draft June 25, 2026 16:44
@akangsha7 akangsha7 closed this Jun 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants