Skip to content

feat(orchestrator): per-Agent LLM model selection#387

Merged
ConnysCode merged 4 commits into
byte5ai:mainfrom
sneumannb5:feat/define-llm-model
Jul 2, 2026
Merged

feat(orchestrator): per-Agent LLM model selection#387
ConnysCode merged 4 commits into
byte5ai:mainfrom
sneumannb5:feat/define-llm-model

Conversation

@sneumannb5

Copy link
Copy Markdown
Contributor

What

Per-Agent orchestrator LLM model selection (Closes #296). Each Agent can pin its own model (and per-turn triage routing) via the Admin builder instead of one
global ORCHESTRATOR_MODEL. Model picker is populated from the @omadia/llm-provider registry.

Why

Cost/latency differ per Agent — a high-volume support Agent can run on a cheap model while a reasoning-heavy one stays on the frontier model. Turns an
operator runtime choice into a per-instance setting instead of a deploy-time env decision. The pluggable-provider work only pays off once instances can
actually pick different models.

Test plan

  • cd middleware && npm run typecheck -w @omadia/orchestrator
  • cd middleware && npx tsc --noEmit in middleware and web-ui (clean)
  • new/affected tests pass: configStoreModelRoutingValidation, agentBuilderAgentNode, agentGraphStoreSubAgentModelValidation, orchestratorRegistry, subAgentModelResolution, agentBuilderSubAgentTools
  • manual: set per-Agent model in Admin builder → confirm effectiveModel badge + registry agent rebuilt log, turn runs without 404

Note: full middleware suite has 5 pre-existing failures in durableRecall/recallProbe — bisected to 1550268 (#385), date-relative test rot
(fixed 2026-06-01 data vs now()), unrelated to this PR.

Risk / blast radius

  • No schema change — reuses existing agents.model_routing (JSONB) + agent_subagents.model.
  • No new env var. ORCHESTRATOR_MODEL kept as the fallback tier → existing deployments unchanged.
  • Resolution behaviour change (core): the orchestrator, per-turn router, and in-process sub-agents send model raw to a single provider adapter.
    Per-Agent refs are now resolved to the active provider's bare modelId before build (resolveModelIdForProvider): registry-known same-provider →
    modelId; cross-provider → dropped to default (out of scope, would 404 on the wrong adapter); registry-unknown → passed through raw (preserves
    pre-resolution behaviour, e.g. the default classifier).
  • Public API (additive): GET /v1/builder/models now returns model_id (bare vendor id) alongside id.
  • Hot-applies via the existing per-Agent rebuild — no restart.

Naming-decisions still pending

@ConnysCode ConnysCode left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: Request changes

One blocker and two majors before merge. The core resolution design (3-tier overlay, hot-apply, the effectiveModel surface) is sound, and I verified the no-schema-change / no-env-var claims hold — findings below are scoped, all traced against the cloned head.

Must change before merge

🔴 BLOCKER — per-Agent model selection is broken on claude-cli (subscription) deployments. The claude-cli branch at applyDiff.ts:178 skips ref→modelId resolution and passes the ref through raw. But the new picker writes provider-qualified ids: on a keyless CLI deployment /v1/builder/models offers only claude-cli:opus-cli / sonnet-cli / haiku-cli (registered via listModels(), builtinLlmProviders.ts:136-159), validateModelRef accepts them, so model_routing.main = 'claude-cli:opus-cli' persists. At build the ref stays raw → buildOrchestrator.ts:335 runs config.model.replace(/-cli$/, '')'claude-cli:opus'CliChatAgent passes it verbatim as claude -p --model claude-cli:opus (cliChatAgent.ts:608), an invalid model, so every turn for that Agent fails. The CLI expects the bare alias opus. On a keyless deployment there's no working pick, so the headline feature can't be used. Traced end-to-end; fix inline.

🟠 MAJOR — a pinned sub-agent model can never be cleared back to "inherit". updateSubAgent uses model = COALESCE($4, model) with $4 = patch.model ?? null (agentGraphStore.ts:359/366), so the documented null = "clear" is a no-op — COALESCE(NULL, model) keeps the old value. The operator can't revert a sub-agent from a pinned model to inherit-parent. Fix inline.

🟠 MAJOR — a cross-provider main-model pick is accepted at write time, then silently dropped at runtime. The picker lists every connected provider's models, not just the orchestrator's active provider, and validateModelRef (configStore.ts:343) resolves against the full registry with no defaultProvider — so e.g. openai:gpt-5.5 under an Anthropic orchestrator passes validation and persists. At build, resolveModelIdForProvider returns undefined for the cross-provider id and model falls back to the platform default (applyDiff.ts:188). The picker shows GPT-5.5, the Agent runs on Opus, and the only signal is the effectiveModel badge — no error at write time. (deps.provider is a single global provider — confirmed via OrchestratorRegistrybuildForAgent.) #296 scoped cross-provider routing out, but the picker actively offers a choice that silently won't run. Please scope the per-Agent / sub-agent model choice to the active provider: filter the offered list, or resolve main with { defaultProvider: <active provider> } at write time and reject (or warn on) a cross-provider pick.

Scope — sub-agent model selection (your open question)

#296 explicitly lists "Sub-agent model selection (SUB_AGENT_MODEL)" as out of scope, and this PR ships the sub-agent picker + validateModelRef guard + resolveSubAgentModel. There's no schema cost (it reuses the existing agent_subagents.model column — no migration), so keeping it is defensible — but the sub-agent half carries the MAJOR clear-bug above. Recommendation: if you keep it here, fix the updateSubAgent clear path first; if you'd rather land #296 clean, split the sub-agent picker into its own PR (the clear-bug travels with it). Either is fine — just don't merge the half-working clear.

What I verified holds (no action needed)

  • Hot-reload is race-free. setModelRoutingawait reload() swaps this.active synchronously before the response reads effectiveModel, so acceptance #1/#4 deliver (agentBuilder.ts:194-196, registry/index.ts:212-223).
  • No schema change / no new env var. agents.model_routing + agent_subagents.model both pre-exist (migration 0003); no .sql added; the three DEFAULT_ORCHESTRATOR_MODEL copies (agentRuntime.ts, plugin.ts, config.ts) all agree on claude-opus-4-8.
  • assembleGraph / agentNode blast radius is clean — every call site passes registry, and l.registry is wired in production (index.ts:2009).
  • web-ui ↔ server shapes agree (effectiveModel, model_id, aliases); the additive /v1/builder/models change is behind requireAuth and has no OpenAPI spec to drift.

PR description — factcheck

Claim Reality
"registry-known same-provider → modelId; cross-provider → default; registry-unknown → raw" True for the Anthropic/OpenAI path. Incomplete for claude-cli — those refs are passed raw without resolution → the BLOCKER.
"cross-provider → dropped to default (out of scope, would 404)" True at build — but the write succeeds with no operator signal, and the picker offered the pick.
"No schema change — reuses agents.model_routing + agent_subagents.model" ✅ Verified — no migration added.
"No new env var" ✅ Verified.
"Hot-applies via the existing per-Agent rebuild — no restart" ✅ Verified — race-free on the happy path.
"typecheck clean / listed tests pass" Not re-executed here. Note the new tests cover anthropic + openai but not claude-cli — the path that breaks.

Minimum to merge

  1. Resolve claude-cli refs to the bare modelId before buildOrchestrator strips -cli (BLOCKER).
  2. Let updateSubAgent clear model back to NULL (MAJOR) — or split the sub-agent picker out.
  3. Scope the model picker / write-validation to the orchestrator's active provider so a cross-provider pick can't silently no-op (MAJOR).
Nits (non-blocking)
  • applyDiff.ts:188 — the platform default runtime.model is sent raw (never resolved). Safe today only because the orchestrator_model install field is an enum of bare ids; a defensive resolveModelIdForProvider(runtime.model, …) ?? runtime.model would harden it if that field ever becomes free-text or a registry picker.
  • InspectorPanel.tsxuseModelCatalog() refetches /v1/builder/models on every node selection (once per editor mount). Consider lifting/caching the catalog.
  • agentBuilder.ts reload() — the bare catch {} swallows reload failures, after which the response can echo a stale effectiveModel until the next NOTIFY. Low impact.

(Reviewed against head 9a6a2f9.)

Comment thread middleware/packages/harness-orchestrator/src/registry/applyDiff.ts Outdated
Comment thread middleware/packages/harness-orchestrator/src/registry/agentRuntime.ts Outdated

@ConnysCode ConnysCode left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: Approve

All three items from the last review are addressed and hold up against the cloned head — verified by running the affected suites (65/65 green, orchestrator tsc --noEmit clean). Nothing blocking remains; a couple of non-blocking cleanups below.

Already changed since the last review

  • 🔴→✅ BLOCKER (claude-cli) fixed. applyDiff.ts no longer special-cases the CLI to pass the ref raw — resolveOverlay now resolves every provider via resolveModelIdForProvider(ref, activeProvider) ?? ref?.trim(), so a picker-stored claude-cli:opus-cli resolves to opus-cli before the downstream -cli strip → opus, the alias the CLI expects. The sub-agent CLI path gets the same treatment via the new resolveCliSubAgentModel. Verified: orchestratorRegistry.test.tsBLOCKER (claude-cli): a provider-qualified CLI pick resolves to the bare CLI modelId and subAgentModelResolution.test.ts CLI cases pass.
  • 🟠→✅ MAJOR (sub-agent model can't be cleared) fixed. updateSubAgent drops COALESCE($4, model) for model = CASE WHEN $10 THEN $4 ELSE model END with an explicit modelProvided = patch.model !== undefined guard, so null/'' clears to inherit while undefined keeps. createSubAgent normalises ''NULL. Verified: agentGraphStoreSubAgentModelValidation.test.ts clear-path cases pass.
  • 🟠→✅ MAJOR (cross-provider pick silently dropped) fixed. validateModelRef / validateModelRoutingShape now take activeProvider and reject a ref that resolves to a different provider at write time; the /v1/builder/models picker and the sub-agent/model-routing writes are all scoped to the orchestrator's live llm_provider (wired through getActiveProvider / activeProvider in index.ts). A cross-provider pick errors instead of persisting-then-vanishing. Verified: the cross-provider reject cases in configStoreModelRoutingValidation.test.ts + agentGraphStoreSubAgentModelValidation.test.ts pass.
  • ⚪→✅ NIT (classifier default) fixed. DEFAULT_CLASSIFIER_MODEL is now the registry-served dated id claude-haiku-4-5-20251001, so the code's own default agrees with its write-validation.

Verification

# node 22.22.3, deps built (llm-provider, memory, orchestrator, …)
configStoreModelRoutingValidation      ok
agentGraphStoreSubAgentModelValidation ok
subAgentModelResolution                ok
agentBuilderModelRouting               ok
agentBuilderAgentNode / SubAgentTools  ok
orchestratorRegistry                   16/16 ok  (incl. both #296 BLOCKER cases + FIX2)
npm run typecheck -w @omadia/orchestrator  → EXIT 0 (clean)

PR description — factcheck

Claim Reality
"CLI refs resolve to the bare modelId before the -cli strip" ✅ Verified — resolveOverlay + resolveCliSubAgentModel; CLI test green.
"cross-provider pick rejected at write time (not silently dropped)" ✅ Verified — provider-scoped validateModelRef; reject tests green.
"No schema change / no new env var" ✅ Still holds — no migration, llm_provider reuses the existing orchestrator config key (same reader as hostProviderId).
"typecheck clean / listed tests pass" ✅ Re-executed here — orchestrator typecheck EXIT 0, all listed suites pass.

Non-blocking follow-ups

Nits
  • middleware/src/index.tsorchestratorActiveProviderId() is now the third copy of the installedRegistry.get('@omadia/orchestrator')?.config?.['llm_provider'] read (alongside hostProviderId at ~714 and the sub-agent hydrate reader at ~1608). Consider extracting one shared orchestratorProviderId() helper so a future key/default change lands in one place.
  • middleware/src/routes/builder.ts /models — since deps.activeProvider?.() always returns a string in production (defaults to anthropic), the connectedProviders fallback branch is effectively dead outside tests. Fine as a safety net, just noting it no longer gates the picker.
  • orchestratorRegistry.test.ts FIX2 now pins effectiveModel = 'openai:gpt-5.5' (raw passthrough → fails loud) for a residual/stale cross-provider ref. Correct given write-validation blocks fresh cross-provider picks and the picker is new in this PR (so no pre-existing persisted cross-provider data can exist); just flagging that the build path intentionally no longer drops-to-default for that hypothetical case.

Scope: the sub-agent picker (out of scope per #296) now works end-to-end incl. the clear path, so keeping it here is fine — your call with the team on whether to split. Approving.

(Reviewed against head 7c67e50.)

@ConnysCode ConnysCode enabled auto-merge July 2, 2026 08:11
@ConnysCode ConnysCode merged commit 00ae49a into byte5ai:main Jul 2, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Allow defining the LLM model per orchestrator instance

2 participants