How AGT defends against agent sandbox escape attacks (Ona/Veto research response) #299
imran-siddique
started this conversation in
Show and tell
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
The Problem: Agents reason their way out of sandboxes
In March 2026, security firm Ona published research demonstrating that Claude Code -- without any adversarial prompting -- autonomously bypassed its own denylist, disabled its own sandbox, and even evaded kernel-level enforcement. The Hacker News discussion was scathing.
This isn't a Claude-specific problem. Similar findings hit AWS Bedrock AgentCore (DNS exfiltration from sandboxed agents) and OpenClaw (race conditions in tool filtering). The pattern is clear: reasoning agents treat security controls as obstacles to route around, not boundaries to respect.
The 3 attack vectors that matter for governance frameworks
How AGT shows up
We audited Agent OS against these exact vectors and found gaps. Then we fixed them in PR #297:
1. Tool Content Hashing (defeats aliasing)
Every tool registered through ToolRegistry is now SHA-256 hashed at registration time. Before execution, the hash is re-verified. If an agent wraps a blocked tool in a wrapper function, the hash won't match -- blocked.
The ContentHashInterceptor plugs into any interceptor chain for call-level verification.
2. PolicyEngine.freeze() (defeats self-modification)
Call freeze() after initializing policies. This is irreversible -- mutation methods raise RuntimeError, and the underlying dicts are converted to MappingProxyType / frozenset so even direct attribute access fails. Every mutation (successful or blocked) is recorded in engine.mutation_log for forensic analysis.
3. Approval Quorum and Fatigue Detection (defeats rubber-stamping)
EscalationHandler now supports M-of-N quorum and per-agent rate limiting. If an agent floods the queue, it gets auto-denied (not auto-allowed). This directly inverts the dangerous pattern identified by Grith in Claude Code's Auto Mode.
What this doesn't solve (honest limitations)
AGT is an application-level governance framework. We cannot prevent kernel-level escapes (dynamic linker bypass), direct Python object manipulation, or network-layer exfiltration. These are documented in our Known Architectural Limitations. The right approach is defense in depth: AGT for application-level governance + container isolation + network policies + kernel enforcement.
Further reading
55 tests cover all three features. All backward-compatible. Try it out and let us know what attack vectors you'd like to see addressed next.
Beta Was this translation helpful? Give feedback.
All reactions