feat(orchestrator): per-Agent LLM model selection by sneumannb5 · Pull Request #387 · byte5ai/omadia

sneumannb5 · 2026-06-29T11:29:36Z

What

Per-Agent orchestrator LLM model selection (Closes #296). Each Agent can pin its own model (and per-turn triage routing) via the Admin builder instead of one
global ORCHESTRATOR_MODEL. Model picker is populated from the @omadia/llm-provider registry.

Why

Cost/latency differ per Agent — a high-volume support Agent can run on a cheap model while a reasoning-heavy one stays on the frontier model. Turns an
operator runtime choice into a per-instance setting instead of a deploy-time env decision. The pluggable-provider work only pays off once instances can
actually pick different models.

Test plan

cd middleware && npm run typecheck -w @omadia/orchestrator
cd middleware && npx tsc --noEmit in middleware and web-ui (clean)
new/affected tests pass: configStoreModelRoutingValidation, agentBuilderAgentNode, agentGraphStoreSubAgentModelValidation, orchestratorRegistry, subAgentModelResolution, agentBuilderSubAgentTools
manual: set per-Agent model in Admin builder → confirm effectiveModel badge + registry agent rebuilt log, turn runs without 404

Note: full middleware suite has 5 pre-existing failures in durableRecall/recallProbe — bisected to 1550268 (#385), date-relative test rot
(fixed 2026-06-01 data vs now()), unrelated to this PR.

Risk / blast radius

No schema change — reuses existing agents.model_routing (JSONB) + agent_subagents.model.
No new env var. ORCHESTRATOR_MODEL kept as the fallback tier → existing deployments unchanged.
Resolution behaviour change (core): the orchestrator, per-turn router, and in-process sub-agents send model raw to a single provider adapter.
Per-Agent refs are now resolved to the active provider's bare modelId before build (resolveModelIdForProvider): registry-known same-provider →
modelId; cross-provider → dropped to default (out of scope, would 404 on the wrong adapter); registry-unknown → passed through raw (preserves
pre-resolution behaviour, e.g. the default classifier).
Public API (additive): GET /v1/builder/models now returns model_id (bare vendor id) alongside id.
Hot-applies via the existing per-Agent rebuild — no restart.

Naming-decisions still pending

Sub-agent model selection (SUB_AGENT_MODEL) is listed out-of-scope in Allow defining the LLM model per orchestrator instance #296 but this PR adds the sub-agent picker + validation. Confirm keep, or split out.

ConnysCode

Review: Request changes

One blocker and two majors before merge. The core resolution design (3-tier overlay, hot-apply, the effectiveModel surface) is sound, and I verified the no-schema-change / no-env-var claims hold — findings below are scoped, all traced against the cloned head.

Must change before merge

🔴 BLOCKER — per-Agent model selection is broken on claude-cli (subscription) deployments. The claude-cli branch at applyDiff.ts:178 skips ref→modelId resolution and passes the ref through raw. But the new picker writes provider-qualified ids: on a keyless CLI deployment /v1/builder/models offers only claude-cli:opus-cli / sonnet-cli / haiku-cli (registered via listModels(), builtinLlmProviders.ts:136-159), validateModelRef accepts them, so model_routing.main = 'claude-cli:opus-cli' persists. At build the ref stays raw → buildOrchestrator.ts:335 runs config.model.replace(/-cli$/, '') → 'claude-cli:opus' → CliChatAgent passes it verbatim as claude -p --model claude-cli:opus (cliChatAgent.ts:608), an invalid model, so every turn for that Agent fails. The CLI expects the bare alias opus. On a keyless deployment there's no working pick, so the headline feature can't be used. Traced end-to-end; fix inline.

🟠 MAJOR — a pinned sub-agent model can never be cleared back to "inherit". updateSubAgent uses model = COALESCE($4, model) with $4 = patch.model ?? null (agentGraphStore.ts:359/366), so the documented null = "clear" is a no-op — COALESCE(NULL, model) keeps the old value. The operator can't revert a sub-agent from a pinned model to inherit-parent. Fix inline.

🟠 MAJOR — a cross-provider main-model pick is accepted at write time, then silently dropped at runtime. The picker lists every connected provider's models, not just the orchestrator's active provider, and validateModelRef (configStore.ts:343) resolves against the full registry with no defaultProvider — so e.g. openai:gpt-5.5 under an Anthropic orchestrator passes validation and persists. At build, resolveModelIdForProvider returns undefined for the cross-provider id and model falls back to the platform default (applyDiff.ts:188). The picker shows GPT-5.5, the Agent runs on Opus, and the only signal is the effectiveModel badge — no error at write time. (deps.provider is a single global provider — confirmed via OrchestratorRegistry → buildForAgent.) #296 scoped cross-provider routing out, but the picker actively offers a choice that silently won't run. Please scope the per-Agent / sub-agent model choice to the active provider: filter the offered list, or resolve main with { defaultProvider: <active provider> } at write time and reject (or warn on) a cross-provider pick.

Scope — sub-agent model selection (your open question)

#296 explicitly lists "Sub-agent model selection (SUB_AGENT_MODEL)" as out of scope, and this PR ships the sub-agent picker + validateModelRef guard + resolveSubAgentModel. There's no schema cost (it reuses the existing agent_subagents.model column — no migration), so keeping it is defensible — but the sub-agent half carries the MAJOR clear-bug above. Recommendation: if you keep it here, fix the updateSubAgent clear path first; if you'd rather land #296 clean, split the sub-agent picker into its own PR (the clear-bug travels with it). Either is fine — just don't merge the half-working clear.

What I verified holds (no action needed)

Hot-reload is race-free. setModelRouting → await reload() swaps this.active synchronously before the response reads effectiveModel, so acceptance #1/#4 deliver (agentBuilder.ts:194-196, registry/index.ts:212-223).
No schema change / no new env var. agents.model_routing + agent_subagents.model both pre-exist (migration 0003); no .sql added; the three DEFAULT_ORCHESTRATOR_MODEL copies (agentRuntime.ts, plugin.ts, config.ts) all agree on claude-opus-4-8.
assembleGraph / agentNode blast radius is clean — every call site passes registry, and l.registry is wired in production (index.ts:2009).
web-ui ↔ server shapes agree (effectiveModel, model_id, aliases); the additive /v1/builder/models change is behind requireAuth and has no OpenAPI spec to drift.

PR description — factcheck

Claim	Reality
"registry-known same-provider → modelId; cross-provider → default; registry-unknown → raw"	True for the Anthropic/OpenAI path. Incomplete for `claude-cli` — those refs are passed raw without resolution → the BLOCKER.
"cross-provider → dropped to default (out of scope, would 404)"	True at build — but the write succeeds with no operator signal, and the picker offered the pick.
"No schema change — reuses `agents.model_routing` + `agent_subagents.model`"	✅ Verified — no migration added.
"No new env var"	✅ Verified.
"Hot-applies via the existing per-Agent rebuild — no restart"	✅ Verified — race-free on the happy path.
"typecheck clean / listed tests pass"	Not re-executed here. Note the new tests cover `anthropic` + `openai` but not `claude-cli` — the path that breaks.

Minimum to merge

Resolve claude-cli refs to the bare modelId before buildOrchestrator strips -cli (BLOCKER).
Let updateSubAgent clear model back to NULL (MAJOR) — or split the sub-agent picker out.
Scope the model picker / write-validation to the orchestrator's active provider so a cross-provider pick can't silently no-op (MAJOR).

Nits (non-blocking)

applyDiff.ts:188 — the platform default runtime.model is sent raw (never resolved). Safe today only because the orchestrator_model install field is an enum of bare ids; a defensive resolveModelIdForProvider(runtime.model, …) ?? runtime.model would harden it if that field ever becomes free-text or a registry picker.
InspectorPanel.tsx — useModelCatalog() refetches /v1/builder/models on every node selection (once per editor mount). Consider lifting/caching the catalog.
agentBuilder.ts reload() — the bare catch {} swallows reload failures, after which the response can echo a stale effectiveModel until the next NOTIFY. Low impact.

(Reviewed against head 9a6a2f9.)

ConnysCode

Review: Approve

All three items from the last review are addressed and hold up against the cloned head — verified by running the affected suites (65/65 green, orchestrator tsc --noEmit clean). Nothing blocking remains; a couple of non-blocking cleanups below.

Already changed since the last review

🔴→✅ BLOCKER (claude-cli) fixed. applyDiff.ts no longer special-cases the CLI to pass the ref raw — resolveOverlay now resolves every provider via resolveModelIdForProvider(ref, activeProvider) ?? ref?.trim(), so a picker-stored claude-cli:opus-cli resolves to opus-cli before the downstream -cli strip → opus, the alias the CLI expects. The sub-agent CLI path gets the same treatment via the new resolveCliSubAgentModel. Verified: orchestratorRegistry.test.ts → BLOCKER (claude-cli): a provider-qualified CLI pick resolves to the bare CLI modelId and subAgentModelResolution.test.ts CLI cases pass.
🟠→✅ MAJOR (sub-agent model can't be cleared) fixed. updateSubAgent drops COALESCE($4, model) for model = CASE WHEN $10 THEN $4 ELSE model END with an explicit modelProvided = patch.model !== undefined guard, so null/'' clears to inherit while undefined keeps. createSubAgent normalises ''→NULL. Verified: agentGraphStoreSubAgentModelValidation.test.ts clear-path cases pass.
🟠→✅ MAJOR (cross-provider pick silently dropped) fixed. validateModelRef / validateModelRoutingShape now take activeProvider and reject a ref that resolves to a different provider at write time; the /v1/builder/models picker and the sub-agent/model-routing writes are all scoped to the orchestrator's live llm_provider (wired through getActiveProvider / activeProvider in index.ts). A cross-provider pick errors instead of persisting-then-vanishing. Verified: the cross-provider reject cases in configStoreModelRoutingValidation.test.ts + agentGraphStoreSubAgentModelValidation.test.ts pass.
⚪→✅ NIT (classifier default) fixed. DEFAULT_CLASSIFIER_MODEL is now the registry-served dated id claude-haiku-4-5-20251001, so the code's own default agrees with its write-validation.

Verification

# node 22.22.3, deps built (llm-provider, memory, orchestrator, …)
configStoreModelRoutingValidation      ok
agentGraphStoreSubAgentModelValidation ok
subAgentModelResolution                ok
agentBuilderModelRouting               ok
agentBuilderAgentNode / SubAgentTools  ok
orchestratorRegistry                   16/16 ok  (incl. both #296 BLOCKER cases + FIX2)
npm run typecheck -w @omadia/orchestrator  → EXIT 0 (clean)

PR description — factcheck

Claim	Reality
"CLI refs resolve to the bare modelId before the `-cli` strip"	✅ Verified — `resolveOverlay` + `resolveCliSubAgentModel`; CLI test green.
"cross-provider pick rejected at write time (not silently dropped)"	✅ Verified — provider-scoped `validateModelRef`; reject tests green.
"No schema change / no new env var"	✅ Still holds — no migration, `llm_provider` reuses the existing orchestrator config key (same reader as `hostProviderId`).
"typecheck clean / listed tests pass"	✅ Re-executed here — orchestrator typecheck EXIT 0, all listed suites pass.

Non-blocking follow-ups

Nits

middleware/src/index.ts — orchestratorActiveProviderId() is now the third copy of the installedRegistry.get('@omadia/orchestrator')?.config?.['llm_provider'] read (alongside hostProviderId at ~714 and the sub-agent hydrate reader at ~1608). Consider extracting one shared orchestratorProviderId() helper so a future key/default change lands in one place.
middleware/src/routes/builder.ts /models — since deps.activeProvider?.() always returns a string in production (defaults to anthropic), the connectedProviders fallback branch is effectively dead outside tests. Fine as a safety net, just noting it no longer gates the picker.
orchestratorRegistry.test.ts FIX2 now pins effectiveModel = 'openai:gpt-5.5' (raw passthrough → fails loud) for a residual/stale cross-provider ref. Correct given write-validation blocks fresh cross-provider picks and the picker is new in this PR (so no pre-existing persisted cross-provider data can exist); just flagging that the build path intentionally no longer drops-to-default for that hypothetical case.

Scope: the sub-agent picker (out of scope per #296) now works end-to-end incl. the clear path, so keeping it here is fine — your call with the team on whether to split. Approving.

(Reviewed against head 7c67e50.)

feat(orchestrator): per-Agent LLM model selection

9a6a2f9

ConnysCode requested changes Jun 30, 2026

View reviewed changes

Comment thread middleware/packages/harness-orchestrator/src/registry/applyDiff.ts Outdated

Comment thread middleware/packages/harness-orchestrator/src/registry/agentGraphStore.ts

Comment thread middleware/packages/harness-orchestrator/src/registry/agentRuntime.ts Outdated

fix(orchestrator): address per-Agent model selection review

7c67e50

ConnysCode approved these changes Jul 2, 2026

View reviewed changes

Merge branch 'main' into feat/define-llm-model

44c5ac9

ConnysCode enabled auto-merge July 2, 2026 08:11

Merge branch 'main' into feat/define-llm-model

537713a

ConnysCode merged commit 00ae49a into byte5ai:main Jul 2, 2026
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(orchestrator): per-Agent LLM model selection#387

feat(orchestrator): per-Agent LLM model selection#387
ConnysCode merged 4 commits into
byte5ai:mainfrom
sneumannb5:feat/define-llm-model

sneumannb5 commented Jun 29, 2026

Uh oh!

ConnysCode left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ConnysCode left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

sneumannb5 commented Jun 29, 2026

What

Why

Test plan

Risk / blast radius

Naming-decisions still pending

Uh oh!

ConnysCode left a comment

Choose a reason for hiding this comment

Must change before merge

Scope — sub-agent model selection (your open question)

What I verified holds (no action needed)

PR description — factcheck

Minimum to merge

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ConnysCode left a comment

Choose a reason for hiding this comment

Already changed since the last review

Verification

PR description — factcheck

Non-blocking follow-ups

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants