Skip to content

fix(agents): validate template cpu/memory before container create (#1197)#1227

Open
dolho wants to merge 1 commit into
devfrom
fix/1197-template-resource-validation
Open

fix(agents): validate template cpu/memory before container create (#1197)#1227
dolho wants to merge 1 commit into
devfrom
fix/1197-template-resource-validation

Conversation

@dolho

@dolho dolho commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Problem

Creating an agent from a GitHub source repo whose template.yaml declares a fractional / Kubernetes-style resources block (cpu: "0.5", memory: "512Mi") aborts deep in container-create with an opaque error:

ERROR services.agent_service.crud: Failed to create agent: invalid literal for int() with base 10: '0.5'

Repo validation, MCP-key creation, subscription assignment and env setup all succeed first; the raw int(cpu) for Docker's NanoCpus (#1126/#1128) then crashes. agent_ownership rolls back but an orphaned mcp_api_keys row is left per attempt. The adjacent mem_limit has the same class of bug (a k8s memory string isn't a valid Docker mem string).

Fix

  • capabilities.py (stdlib-only, the anti-drift home for container spec): add normalize_cpu/normalize_memory + canonical VALID_CPU/VALID_MEMORY. They reject fractional/k8s values with an actionable message and case-fold memory (4G4g).
  • routers/settings.py now imports those sets instead of duplicating the lists — the admin defaults endpoint and the create paths share one source of truth.
  • crud.py (create): normalize/validate config.resources before any side effect and raise a clear HTTP 400 instead of a 500 from deep in Docker; canonical values written back for labels + limits. Roll back the agent-scoped MCP key in the failure path → no orphan rows.
  • lifecycle.py (recreate) and system_agent_service.py: same guard at their nano_cpus/mem_limit sites (all three unguarded sites from the issue).

Verification

tests/unit/test_resource_normalization.py — 20+ cases: valid set accepted, 0.5/512Mi/etc. rejected with "must be one of …", case-folding, empty→default fallback, int()-castability of every normalized cpu, and a drift guard. Full unit run: 34 passed (new + existing test_capability_set). py_compile clean on all five changed modules.

Related to #1197

🤖 Generated with Claude Code

)

A GitHub source repo whose template.yaml declares a fractional/Kubernetes-style
resources block (cpu: "0.5", memory: "512Mi") aborted agent creation deep in
container-create with an opaque `ValueError: invalid literal for int() with
base 10: '0.5'`, after the MCP key was already minted — leaving an orphaned
mcp_api_keys row per attempt (#1126/#1128 added the unguarded int(cpu) at three
sites).

- Add normalize_cpu/normalize_memory + canonical VALID_CPU/VALID_MEMORY in
  services/agent_service/capabilities.py (stdlib-only, the anti-drift home for
  container spec). routers/settings.py now imports these instead of duplicating
  the lists, so the API and the create paths can't drift.
- crud.py create: normalize/validate config.resources BEFORE any side effect
  and raise a clear HTTP 400 on invalid input; write canonical values back so
  labels + limits use them. Roll back the agent-scoped MCP key in the failure
  path so a failed create leaves no orphan row.
- lifecycle.py recreate + system_agent_service.py: same guard at their
  nano_cpus/mem_limit sites.
- tests/unit/test_resource_normalization.py: pins the helpers (valid set,
  fractional/k8s rejection with actionable message, case-folding, default
  fallback, int()-castability, drift guard).

Related to #1197
@github-actions

Copy link
Copy Markdown

⚠️ Nightly unit-suite check skipped — merge conflict against dev.

Resolve by running git merge dev locally and pushing the result. The next nightly run will re-test once the conflict is gone.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant