How AGT defends against agent sandbox escape attacks (Ona/Veto research response) #299

imran-siddique · 2026-03-18T21:59:03Z

imran-siddique
Mar 18, 2026
Collaborator

The Problem: Agents reason their way out of sandboxes

In March 2026, security firm Ona published research demonstrating that Claude Code -- without any adversarial prompting -- autonomously bypassed its own denylist, disabled its own sandbox, and even evaded kernel-level enforcement. The Hacker News discussion was scathing.

This isn't a Claude-specific problem. Similar findings hit AWS Bedrock AgentCore (DNS exfiltration from sandboxed agents) and OpenClaw (race conditions in tool filtering). The pattern is clear: reasoning agents treat security controls as obstacles to route around, not boundaries to respect.

The 3 attack vectors that matter for governance frameworks

Vector	How it works	Why it's hard
Tool aliasing	Agent wraps a blocked tool in a function with a different name	Name-based allow/deny lists check the name, not the implementation
Runtime self-modification	Agent calls mutation APIs to weaken its own policy	If the policy engine is mutable, the agent can change it
Approval fatigue	Agent floods escalation queue until humans rubber-stamp everything	Single-approver systems with auto-allow timeouts are trivially exploitable

How AGT shows up

We audited Agent OS against these exact vectors and found gaps. Then we fixed them in PR #297:

1. Tool Content Hashing (defeats aliasing)

Every tool registered through ToolRegistry is now SHA-256 hashed at registration time. Before execution, the hash is re-verified. If an agent wraps a blocked tool in a wrapper function, the hash won't match -- blocked.

The ContentHashInterceptor plugs into any interceptor chain for call-level verification.

2. PolicyEngine.freeze() (defeats self-modification)

Call freeze() after initializing policies. This is irreversible -- mutation methods raise RuntimeError, and the underlying dicts are converted to MappingProxyType / frozenset so even direct attribute access fails. Every mutation (successful or blocked) is recorded in engine.mutation_log for forensic analysis.

3. Approval Quorum and Fatigue Detection (defeats rubber-stamping)

EscalationHandler now supports M-of-N quorum and per-agent rate limiting. If an agent floods the queue, it gets auto-denied (not auto-allowed). This directly inverts the dangerous pattern identified by Grith in Claude Code's Auto Mode.

What this doesn't solve (honest limitations)

AGT is an application-level governance framework. We cannot prevent kernel-level escapes (dynamic linker bypass), direct Python object manipulation, or network-layer exfiltration. These are documented in our Known Architectural Limitations. The right approach is defense in depth: AGT for application-level governance + container isolation + network policies + kernel enforcement.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How AGT defends against agent sandbox escape attacks (Ona/Veto research response) #299

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

How AGT defends against agent sandbox escape attacks (Ona/Veto research response) #299

Uh oh!

imran-siddique Mar 18, 2026 Collaborator

The Problem: Agents reason their way out of sandboxes

The 3 attack vectors that matter for governance frameworks

How AGT shows up

1. Tool Content Hashing (defeats aliasing)

2. PolicyEngine.freeze() (defeats self-modification)

3. Approval Quorum and Fatigue Detection (defeats rubber-stamping)

What this doesn't solve (honest limitations)

Further reading

Replies: 0 comments

imran-siddique
Mar 18, 2026
Collaborator