Skip to content

[Evaluator][Skill Eval] Support configurable Switchyard backends for skill-eval agent runs #151

@puhuim

Description

@puhuim

Request

Platform QA would like to run Evaluator skill-eval against different Switchyard backend profiles, so we can compare skill-eval behavior under backend routing strategies instead of only evaluating a single direct model.

Desired profiles:

  • default model passthrough
  • Switchyard random routing, for example GPT 5.4 / DeepSeek V4
  • Switchyard dynamic routing, for example GPT 5.4 / DeepSeek V4 with DeepSeek V4 as classifier

What works today

Switchyard / IGW routing itself appears functional. A Switchyard VirtualModel with translate middleware can accept Claude Code style Anthropic /v1/messages requests and route them to OpenAI-compatible backend models successfully.

Current blocker

When running astra-skill-eval evaluate with Harbor and claude-code, the run fails before skill-eval trials start. Harbor agent/model preflight rejects the Platform Switchyard VirtualModel model id.

Observed failure shape:

Error: model not available for claude-code:
default/<switchyard-virtual-model-name>

Available claude-code models for this key:
  aws/anthropic/bedrock-claude-opus-4-7
  aws/anthropic/bedrock-claude-opus-4-6
  ...

Source analysis

From astra-skill-eval / Harbor source analysis:

  • runner.py resolves ANTHROPIC_MODEL as the claude-code model.
  • model_catalog.py validates the selected model against NVIDIA /models catalog.
  • For claude-code, the compatibility filter only accepts model ids containing anthropic or claude.
  • A Platform Switchyard VM model id such as default/skill-eval-swy-random-... is not in the public NVIDIA catalog and is rejected before Claude Code calls the configured ANTHROPIC_BASE_URL.

The Harbor Claude Code adapter itself appears capable of using custom ANTHROPIC_BASE_URL + ANTHROPIC_MODEL. The blocking issue is the preflight catalog validation path.

Expected behavior

Platform Evaluator skill-eval should support configuring Switchyard backend routing for agent eval runs.

When ANTHROPIC_BASE_URL or another custom gateway base URL points to Platform IGW, astra-skill-eval / Harbor should either:

  • validate model availability against the custom gateway /models endpoint, or
  • allow bypassing public NVIDIA catalog validation for custom gateway models, or
  • provide an explicit config flag for custom agent model preflight.

Impact

QA cannot run end-to-end Evaluator skill-eval coverage through Switchyard backends, even though the gateway route itself works. This blocks testing skill-eval behavior across backend routing modes and makes it hard to compare default passthrough vs random vs dynamic Switchyard routing.

Suggested acceptance criteria

  • Evaluator skill-eval can run with a configured Switchyard VirtualModel as the claude-code agent model.
  • The run performs with-skill and without-skill baseline trials normally.
  • The resulting artifacts show selected backend profile, agent model, and routing stats/distribution.
  • Custom gateway models are not rejected solely because they are absent from NVIDIA public model catalog.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions