feat(orchestrator): per-Agent LLM model selection#387
Conversation
ConnysCode
left a comment
There was a problem hiding this comment.
Review: Request changes
One blocker and two majors before merge. The core resolution design (3-tier overlay, hot-apply, the effectiveModel surface) is sound, and I verified the no-schema-change / no-env-var claims hold — findings below are scoped, all traced against the cloned head.
Must change before merge
🔴 BLOCKER — per-Agent model selection is broken on claude-cli (subscription) deployments. The claude-cli branch at applyDiff.ts:178 skips ref→modelId resolution and passes the ref through raw. But the new picker writes provider-qualified ids: on a keyless CLI deployment /v1/builder/models offers only claude-cli:opus-cli / sonnet-cli / haiku-cli (registered via listModels(), builtinLlmProviders.ts:136-159), validateModelRef accepts them, so model_routing.main = 'claude-cli:opus-cli' persists. At build the ref stays raw → buildOrchestrator.ts:335 runs config.model.replace(/-cli$/, '') → 'claude-cli:opus' → CliChatAgent passes it verbatim as claude -p --model claude-cli:opus (cliChatAgent.ts:608), an invalid model, so every turn for that Agent fails. The CLI expects the bare alias opus. On a keyless deployment there's no working pick, so the headline feature can't be used. Traced end-to-end; fix inline.
🟠 MAJOR — a pinned sub-agent model can never be cleared back to "inherit". updateSubAgent uses model = COALESCE($4, model) with $4 = patch.model ?? null (agentGraphStore.ts:359/366), so the documented null = "clear" is a no-op — COALESCE(NULL, model) keeps the old value. The operator can't revert a sub-agent from a pinned model to inherit-parent. Fix inline.
🟠 MAJOR — a cross-provider main-model pick is accepted at write time, then silently dropped at runtime. The picker lists every connected provider's models, not just the orchestrator's active provider, and validateModelRef (configStore.ts:343) resolves against the full registry with no defaultProvider — so e.g. openai:gpt-5.5 under an Anthropic orchestrator passes validation and persists. At build, resolveModelIdForProvider returns undefined for the cross-provider id and model falls back to the platform default (applyDiff.ts:188). The picker shows GPT-5.5, the Agent runs on Opus, and the only signal is the effectiveModel badge — no error at write time. (deps.provider is a single global provider — confirmed via OrchestratorRegistry → buildForAgent.) #296 scoped cross-provider routing out, but the picker actively offers a choice that silently won't run. Please scope the per-Agent / sub-agent model choice to the active provider: filter the offered list, or resolve main with { defaultProvider: <active provider> } at write time and reject (or warn on) a cross-provider pick.
Scope — sub-agent model selection (your open question)
#296 explicitly lists "Sub-agent model selection (SUB_AGENT_MODEL)" as out of scope, and this PR ships the sub-agent picker + validateModelRef guard + resolveSubAgentModel. There's no schema cost (it reuses the existing agent_subagents.model column — no migration), so keeping it is defensible — but the sub-agent half carries the MAJOR clear-bug above. Recommendation: if you keep it here, fix the updateSubAgent clear path first; if you'd rather land #296 clean, split the sub-agent picker into its own PR (the clear-bug travels with it). Either is fine — just don't merge the half-working clear.
What I verified holds (no action needed)
- Hot-reload is race-free.
setModelRouting→await reload()swapsthis.activesynchronously before the response readseffectiveModel, so acceptance #1/#4 deliver (agentBuilder.ts:194-196,registry/index.ts:212-223). - No schema change / no new env var.
agents.model_routing+agent_subagents.modelboth pre-exist (migration0003); no.sqladded; the threeDEFAULT_ORCHESTRATOR_MODELcopies (agentRuntime.ts,plugin.ts,config.ts) all agree onclaude-opus-4-8. assembleGraph/agentNodeblast radius is clean — every call site passesregistry, andl.registryis wired in production (index.ts:2009).- web-ui ↔ server shapes agree (
effectiveModel,model_id,aliases); the additive/v1/builder/modelschange is behindrequireAuthand has no OpenAPI spec to drift.
PR description — factcheck
| Claim | Reality |
|---|---|
| "registry-known same-provider → modelId; cross-provider → default; registry-unknown → raw" | True for the Anthropic/OpenAI path. Incomplete for claude-cli — those refs are passed raw without resolution → the BLOCKER. |
| "cross-provider → dropped to default (out of scope, would 404)" | True at build — but the write succeeds with no operator signal, and the picker offered the pick. |
"No schema change — reuses agents.model_routing + agent_subagents.model" |
✅ Verified — no migration added. |
| "No new env var" | ✅ Verified. |
| "Hot-applies via the existing per-Agent rebuild — no restart" | ✅ Verified — race-free on the happy path. |
| "typecheck clean / listed tests pass" | Not re-executed here. Note the new tests cover anthropic + openai but not claude-cli — the path that breaks. |
Minimum to merge
- Resolve
claude-clirefs to the baremodelIdbeforebuildOrchestratorstrips-cli(BLOCKER). - Let
updateSubAgentclearmodelback toNULL(MAJOR) — or split the sub-agent picker out. - Scope the model picker / write-validation to the orchestrator's active provider so a cross-provider pick can't silently no-op (MAJOR).
Nits (non-blocking)
applyDiff.ts:188— the platform defaultruntime.modelis sent raw (never resolved). Safe today only because theorchestrator_modelinstall field is an enum of bare ids; a defensiveresolveModelIdForProvider(runtime.model, …) ?? runtime.modelwould harden it if that field ever becomes free-text or a registry picker.InspectorPanel.tsx—useModelCatalog()refetches/v1/builder/modelson every node selection (once per editor mount). Consider lifting/caching the catalog.agentBuilder.tsreload()— the barecatch {}swallows reload failures, after which the response can echo a staleeffectiveModeluntil the next NOTIFY. Low impact.
(Reviewed against head 9a6a2f9.)
ConnysCode
left a comment
There was a problem hiding this comment.
Review: Approve
All three items from the last review are addressed and hold up against the cloned head — verified by running the affected suites (65/65 green, orchestrator tsc --noEmit clean). Nothing blocking remains; a couple of non-blocking cleanups below.
Already changed since the last review
- 🔴→✅ BLOCKER (claude-cli) fixed.
applyDiff.tsno longer special-cases the CLI to pass the ref raw —resolveOverlaynow resolves every provider viaresolveModelIdForProvider(ref, activeProvider) ?? ref?.trim(), so a picker-storedclaude-cli:opus-cliresolves toopus-clibefore the downstream-clistrip →opus, the alias the CLI expects. The sub-agent CLI path gets the same treatment via the newresolveCliSubAgentModel. Verified:orchestratorRegistry.test.ts→BLOCKER (claude-cli): a provider-qualified CLI pick resolves to the bare CLI modelIdandsubAgentModelResolution.test.tsCLI cases pass. - 🟠→✅ MAJOR (sub-agent model can't be cleared) fixed.
updateSubAgentdropsCOALESCE($4, model)formodel = CASE WHEN $10 THEN $4 ELSE model ENDwith an explicitmodelProvided = patch.model !== undefinedguard, sonull/''clears to inherit whileundefinedkeeps.createSubAgentnormalises''→NULL. Verified:agentGraphStoreSubAgentModelValidation.test.tsclear-path cases pass. - 🟠→✅ MAJOR (cross-provider pick silently dropped) fixed.
validateModelRef/validateModelRoutingShapenow takeactiveProviderand reject a ref that resolves to a different provider at write time; the/v1/builder/modelspicker and the sub-agent/model-routing writes are all scoped to the orchestrator's livellm_provider(wired throughgetActiveProvider/activeProviderinindex.ts). A cross-provider pick errors instead of persisting-then-vanishing. Verified: the cross-provider reject cases inconfigStoreModelRoutingValidation.test.ts+agentGraphStoreSubAgentModelValidation.test.tspass. - ⚪→✅ NIT (classifier default) fixed.
DEFAULT_CLASSIFIER_MODELis now the registry-served dated idclaude-haiku-4-5-20251001, so the code's own default agrees with its write-validation.
Verification
# node 22.22.3, deps built (llm-provider, memory, orchestrator, …)
configStoreModelRoutingValidation ok
agentGraphStoreSubAgentModelValidation ok
subAgentModelResolution ok
agentBuilderModelRouting ok
agentBuilderAgentNode / SubAgentTools ok
orchestratorRegistry 16/16 ok (incl. both #296 BLOCKER cases + FIX2)
npm run typecheck -w @omadia/orchestrator → EXIT 0 (clean)
PR description — factcheck
| Claim | Reality |
|---|---|
"CLI refs resolve to the bare modelId before the -cli strip" |
✅ Verified — resolveOverlay + resolveCliSubAgentModel; CLI test green. |
| "cross-provider pick rejected at write time (not silently dropped)" | ✅ Verified — provider-scoped validateModelRef; reject tests green. |
| "No schema change / no new env var" | ✅ Still holds — no migration, llm_provider reuses the existing orchestrator config key (same reader as hostProviderId). |
| "typecheck clean / listed tests pass" | ✅ Re-executed here — orchestrator typecheck EXIT 0, all listed suites pass. |
Non-blocking follow-ups
Nits
middleware/src/index.ts—orchestratorActiveProviderId()is now the third copy of theinstalledRegistry.get('@omadia/orchestrator')?.config?.['llm_provider']read (alongsidehostProviderIdat ~714 and the sub-agent hydrate reader at ~1608). Consider extracting one sharedorchestratorProviderId()helper so a future key/default change lands in one place.middleware/src/routes/builder.ts/models— sincedeps.activeProvider?.()always returns a string in production (defaults toanthropic), theconnectedProvidersfallback branch is effectively dead outside tests. Fine as a safety net, just noting it no longer gates the picker.orchestratorRegistry.test.tsFIX2 now pinseffectiveModel = 'openai:gpt-5.5'(raw passthrough → fails loud) for a residual/stale cross-provider ref. Correct given write-validation blocks fresh cross-provider picks and the picker is new in this PR (so no pre-existing persisted cross-provider data can exist); just flagging that the build path intentionally no longer drops-to-default for that hypothetical case.
Scope: the sub-agent picker (out of scope per #296) now works end-to-end incl. the clear path, so keeping it here is fine — your call with the team on whether to split. Approving.
(Reviewed against head 7c67e50.)
What
Per-Agent orchestrator LLM model selection (Closes #296). Each Agent can pin its own model (and per-turn triage routing) via the Admin builder instead of one
global
ORCHESTRATOR_MODEL. Model picker is populated from the@omadia/llm-providerregistry.Why
Cost/latency differ per Agent — a high-volume support Agent can run on a cheap model while a reasoning-heavy one stays on the frontier model. Turns an
operator runtime choice into a per-instance setting instead of a deploy-time env decision. The pluggable-provider work only pays off once instances can
actually pick different models.
Test plan
configStoreModelRoutingValidation,agentBuilderAgentNode,agentGraphStoreSubAgentModelValidation,orchestratorRegistry,subAgentModelResolution,agentBuilderSubAgentToolseffectiveModelbadge + registryagent rebuiltlog, turn runs without 404Risk / blast radius
agents.model_routing(JSONB) +agent_subagents.model.ORCHESTRATOR_MODELkept as the fallback tier → existing deployments unchanged.modelraw to a single provider adapter.Per-Agent refs are now resolved to the active provider's bare
modelIdbefore build (resolveModelIdForProvider): registry-known same-provider →modelId; cross-provider → dropped to default (out of scope, would 404 on the wrong adapter); registry-unknown → passed through raw (preservespre-resolution behaviour, e.g. the default classifier).
GET /v1/builder/modelsnow returnsmodel_id(bare vendor id) alongsideid.Naming-decisions still pending
SUB_AGENT_MODEL) is listed out-of-scope in Allow defining the LLM model per orchestrator instance #296 but this PR adds the sub-agent picker + validation. Confirm keep, or split out.