Skip to content

fix: agent /tmp tmpfs (100 MB) fills and silently breaks autonomous git commits — make size configurable #1231

@vybe

Description

@vybe

Summary

Agent containers mount /tmp as a 100 MB noexec,nosuid RAM-backed tmpfs, hardcoded in src/backend/services/agent_service/capabilities.py (AGENT_TMPFS_MOUNT = {'/tmp': 'noexec,nosuid,size=100m'}). It fills easily — e.g. leftover gh CLI install artifacts (gh.tar.gz + extracted dir, ~38 MB) plus normal scratch usage. Once full, every /tmp write fails with No space left on device, including git's scratch during commit. Autonomous scheduled runs then "complete" but fail to persist (commit/push), leaving the run's output as stranded working-tree drift.

Because the size is a hardcoded literal, operators can't tune it without a code change + base-image rebuild. We should make the size an instance-level env var (deployment-tier config) with a safe default.

Context

Reported on a production agent (base image v0.6.0); the cause is in the base image, so it is fleet-wide.

#1098 already redirects heavy build scratch off /tmp via TMPDIR=/home/developer/.tmp (disk-backed, exec-capable), injected at create (crud.py) and recreate (lifecycle.py), dir created by docker/base-image/startup.sh. But TMPDIR only helps tools that honor $TMPDIR — install scripts that hardcode /tmp (the gh case here) bypass it entirely and still exhaust the 100 MB cap.

Diagnosis is hard because No space left on device points at the disk (which is ~42% full, looks fine via df -h /) rather than the tiny tmpfs (df -h /tmp).

The tmpfs is RAM-backed and noexec,nosuid by deliberate security design (a compromised agent can't stage/execute payloads there). That posture must be preserved — only the size should be tunable, and it must stay bounded since tmpfs size counts against the container memory cgroup.

Acceptance Criteria

  • /tmp tmpfs size is read from a backend env var (e.g. AGENT_TMP_SIZE), defaulting to 512m, in the single source of truth (capabilities.py AGENT_TMPFS_MOUNT) so create + recreate paths can't drift
  • noexec,nosuid flags remain hardcoded — only size is configurable
  • Env-var value is validated (e.g. matches ^\d+[mg]$); empty/invalid falls back to the default rather than producing a broken or unbounded mount spec
  • AGENT_TMP_SIZE documented in .env.example and docker-compose (passes /validate-config)
  • New default verified inside a freshly created/recreated agent with df -h /tmp
  • Container Security section of docs/memory/architecture.md updated to reflect the now-configurable size

Technical Notes

  • Mount specs are creation-time: existing agents pick up a new size only on recreate, not restart. The env var lives on the backend service (which builds the mount spec), not inside agents.
  • Builds on #1098 (TMPDIR redirect) — this closes the remaining gap for tools that hardcode /tmp.
  • Out of scope (separate issue, agent-side abilities repo): the git-sync stash → rebase → stash-pop hook leaves a permanent UU conflict and the autonomous run reports success while silently failing to persist. The full /tmp is only the trigger; the silent-failure-to-persist is a distinct reliability hole that lives in the agent's git-sync hook, not Trinity core.

Metadata

Metadata

Assignees

Labels

complexity-lowComplexity: low (board points 1-3)priority-p1Critical pathstatus-in-progressCurrently being worked onstatus-readyGreenlit and ready for development (vetted; counterpart to status-incubating)theme-reliabilityTheme: Reliabilitytype-bugBug fix

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions