Skip to content

feat(org): configurable agent-run timeouts with env-clamped ceiling#168

Open
MesoX wants to merge 1 commit into
willdady:mainfrom
MesoX:feature/configurable-agent-run-timeouts
Open

feat(org): configurable agent-run timeouts with env-clamped ceiling#168
MesoX wants to merge 1 commit into
willdady:mainfrom
MesoX:feature/configurable-agent-run-timeouts

Conversation

@MesoX

@MesoX MesoX commented May 31, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds a per-organization override of agent-run wall-clock + per-step timeouts for both chat-driven and trigger-driven runs. Defaults are unchanged; overrides only take effect when an org admin sets them in the organization settings UI. Every override is clamped server-side to a deployer-supplied environment ceiling — an admin can lower a timeout but never raise it past what the host operator allows.

Motivation

Long-running agent runs (multi-step research, slow MCP/tool calls) sometimes need more headroom than the hardcoded defaults, but the limit should stay under the deployer's control. This exposes a safe, bounded knob per organization.

Changes

Schema

  • Migration 0041 adds a nullable jsonb agent_run_settings column on organization (idempotent ADD COLUMN IF NOT EXISTS; existing rows keep null and behave exactly as before).
  • AgentRunSettings exported from @platypus/schemas as a strict zod object with four optional positive-int fields.

Backend

  • New services/agent-run-settings.tsresolveRunTimeouts(orgId, kind) reads the org override, falls back to env, then to hardcoded defaults, and clamps to the env ceiling. Chat defaults are sourced from run-registry constants so they cannot drift.
  • PUT /organizations/:orgId (admin-only) rejects overrides above the env ceiling with 400 { error } naming the offending env vars.
  • New GET /organizations/:orgId/agent-run-settings/ceilings exposes the current ceilings for the UI.
  • routes/chat.ts and services/trigger-execution.ts resolve timeouts via the new service instead of reading env directly.

Frontend

  • OrganizationForm gains an "Agent run timeouts" section (edit only) with four minute-valued inputs, ceiling placeholders, client-side ceiling validation for inline feedback, and surfacing of the server error message.

Config

  • .env.example documents the four ceiling env vars and their defaults (RUN_PER_RUN_TIMEOUT_MS, RUN_PER_STEP_TIMEOUT_MS, TRIGGER_PER_RUN_TIMEOUT_MS, TRIGGER_PER_STEP_TIMEOUT_MS).

Behavior compatibility

Chat defaults (10 min run / 2 min step) are imported from run-registry's existing defaults, so a run with no env override and no org override behaves identically to before this PR.

Testing

  • services/agent-run-settings.test.ts — env defaults, env overrides, garbage-env rejection, DB-backed resolution including ceiling clamping, no-row fallback, chat-vs-trigger isolation.
  • routes/organization.test.ts — PUT path: 403 for non-admins, 400 (with error key) when over the ceiling, successful within-ceiling persist, and null clears the override.
  • Full suite: backend 795 pass, frontend 16 pass.

🤖 Generated with Claude Code

Adds a per-organization override of the agent-run wall-clock + per-step
timeouts for both chat-driven and trigger-driven runs. Defaults remain
the same; overrides only kick in when an org admin sets them in the
organization settings UI.

Schema

- Migration 0041 adds a nullable jsonb 'agent_run_settings' column on
  'organization'. Idempotent: existing rows keep the default null value
  and continue using env / hardcoded defaults.
- 'AgentRunSettings' shape is exported from '@platypus/schemas' as a
  strict zod object with four optional positive-int fields:
  chatPerRunTimeoutMs, chatPerStepTimeoutMs, triggerPerRunTimeoutMs,
  triggerPerStepTimeoutMs.

Backend

- New 'services/agent-run-settings.ts' exposes
  'resolveRunTimeouts(orgId, kind)' (chat | trigger). It reads the org
  override, falls back to env, and finally to documented hardcoded
  defaults. The result is clamped to the env-supplied ceiling so a
  misconfigured override can never exceed the deployer-allowed maximum.
  Chat defaults are sourced from run-registry's exported constants so
  the two cannot drift.
- 'PUT /organizations/:orgId' (admin-only) validates incoming overrides
  against the env ceilings and returns 400 with a single 'error' message
  (per API conventions) listing the offending env vars — admins can
  lower but never raise.
- New 'GET /organizations/:orgId/agent-run-settings/ceilings' returns
  the current chat + trigger ceilings so the UI can display them next
  to each input.
- 'routes/chat.ts' and 'services/trigger-execution.ts' now invoke
  'resolveRunTimeouts' instead of reading env directly, so the org
  override is honored by every active run.

Frontend

- 'OrganizationForm' gains an 'Agent run timeouts' section (only shown
  on edit, not create). Four minute-valued inputs map to the four
  override fields; placeholders show the current ceilings fetched from
  the new endpoint. Values are converted to milliseconds before saving,
  validated client-side against the fetched ceilings for inline
  feedback, and the server error message is surfaced on rejection.

Config

- '.env.example' documents the four ceiling env vars
  (RUN_PER_RUN_TIMEOUT_MS, RUN_PER_STEP_TIMEOUT_MS,
  TRIGGER_PER_RUN_TIMEOUT_MS, TRIGGER_PER_STEP_TIMEOUT_MS) and their
  defaults.

Tests

- 'services/agent-run-settings.test.ts' covers env-default fallback,
  env-override parsing, garbage-env rejection, and the DB-backed
  'resolveRunTimeouts' lookup including ceiling clamping, the no-row
  fallback, and chat-vs-trigger isolation.
- 'routes/organization.test.ts' covers the new PUT path: 403 for
  non-admins, 400 (with 'error' key) when an override exceeds the
  ceiling, and a successful within-ceiling persist.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
MesoX pushed a commit to MesoX/platypus that referenced this pull request Jun 1, 2026
Ports the reviewed version of the configurable agent-run timeout feature
(upstream PR willdady#168) onto the deploy branch. Behavior unchanged; quality
and convention fixes only.

- Error responses use the singular `error` key per API conventions
  (was a custom `{ errors }` map).
- Remove dead `clampRunTimeouts` / `__TEST_HOOKS__`; source chat defaults
  from run-registry's exported constants so they cannot drift.
- Table-driven ceiling validation; server message reports minutes.
- Frontend validates against the fetched ceilings for inline feedback and
  surfaces the server error message on rejection.
- Document the four ceiling env vars in .env.example; migration trailing
  newline.
- Add PUT /organizations/:orgId tests (403 / 400-over-ceiling /
  within-ceiling persist / null clears the override).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@MesoX

MesoX commented Jun 15, 2026

Copy link
Copy Markdown
Contributor Author

Hey @willdady this would deliberately help with #261 - Chat compaction can take some time (even more on the locally used models on slower hardware (not a big problem to see even 5 minutes of compaction happening with long context and slow dense model). Even with faster model we have run into timeouts. I would update it to the latest main, just let me know whether this is something you want to add to the tool tiself, or is there any other way how to make these things a bit more configurable?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant