Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions benchmarks/injection/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# LLM response cache — never committed (may be large; re-derivable from APIs).
cache/

# HuggingFace dataset cache, if a local one is created here.
.hf_cache/

# results/ IS committed (results.json + error_analysis.md are deliverables).
99 changes: 99 additions & 0 deletions benchmarks/injection/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,99 @@
# Aegis injection-detection benchmark

A reproducible, **honest** benchmark that evaluates the Aegis four-stage content-security
pipeline (`server/content_security.py`) as a prompt-injection / memory-poisoning **detector**,
against established baselines, with full confusion-matrix metrics and a per-stage ablation.

This measures Aegis in its actual threat model: **detecting injection/poisoning in content being
written to memory**. It is *not* an LLM-jailbreak-defense benchmark. The headline numbers,
ablation, latency comparison, and limitations live in
[`docs/security/benchmark.md`](../../docs/security/benchmark.md).

## What it measures

Every system is wrapped as `predict(text) -> bool` and scored on **both** malicious and benign
corpora, reported as a full confusion matrix → **precision, recall, F1, FPR, accuracy**, plus
**median per-item latency** and **bootstrapped 95% CIs** (n=1000, seed=42).

**Systems:** `no_protection`, `naive_regex`, `protectai_deberta`, `llm_guard`,
`llm_judge_openai`, `llm_judge_anthropic`, `aegis_stages_1_3`, `aegis_stages_1_4_openai`,
`aegis_stages_1_4_anthropic`.

**Datasets:** `deepset/prompt-injections` (direct), `InjecAgent` (indirect, 250 sampled),
`benign_public` (dolly, 750), `benign_synth` (750 templated memory entries).

## Setup

```bash
# from the repo root (aegis-memory-main/)
python -m venv .venv-bench && source .venv-bench/bin/activate # Windows: .venv-bench\Scripts\Activate.ps1
pip install -r benchmarks/injection/requirements.txt
```

`torch`/`transformers` are large (CPU wheels, a few minutes). If `llm-guard` cannot co-resolve
with the pinned `transformers`/`torch`, install it in a separate venv or skip it — the benchmark
marks `llm_guard` as `not_run` and proceeds.

### API keys

`llm_judge_*` and Aegis `aegis_stages_1_4_*` call paid APIs. Keys are read from the environment
or `aegis-memory-main/.env` **only** (never hardcoded):

```
OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...
```

If a key is absent, that system is reported `not_run` (the run continues). Responses are cached
under `cache/` keyed by `(system_id, model_id, sha256(prompt))`, so **re-runs never re-bill**.

## Run

```bash
# Smoke test (20 items/dataset) — validates wiring end to end:
python benchmarks/injection/run_benchmark.py --limit 20

# Full run:
python benchmarks/injection/run_benchmark.py

# Subsets:
python benchmarks/injection/run_benchmark.py --systems aegis_stages_1_3,naive_regex
python benchmarks/injection/run_benchmark.py --datasets deepset,benign_synth
```

### Expected runtime (CPU-only laptop, full corpora)

| Stage | Cost |
|---|---|
| `no_protection`, `naive_regex`, `aegis_stages_1_3` | seconds (deterministic) |
| `protectai_deberta`, `llm_guard` | a few minutes (CPU inference) |
| `llm_judge_*`, `aegis_stages_1_4_*` | API-bound; ~$1–2 total once, then cache-served |

## Outputs

- `results/results.json` — full machine-readable results: every system × dataset, confusion
matrices, P/R/F1/FPR/accuracy, latencies, bootstrap CIs, the Aegis stage ablation, dataset
revisions, model versions, seed, timestamp, cache stats.
- `results/error_analysis.md` — false negatives (missed injections, categorized) + a sample of
false positives (benign flagged).
- `cache/` — LLM response cache (git-ignored).

## Files

| File | Purpose |
|---|---|
| `datasets.py` | 4 dataset loaders, pinned revisions, graceful missing-source handling |
| `systems.py` | `predict(text)->bool` adapters, response cache, per-stage attribution |
| `metrics.py` | confusion matrix, P/R/F1/FPR/accuracy, bootstrap CIs, stage ablation |
| `run_benchmark.py` | orchestrator: loads `.env`, runs systems × datasets, writes results |
| `_paths.py` | puts `server/` + repo root on `sys.path` (mirrors `tests/conftest.py`) |

## Reproducibility notes

- All subsampling uses **seed 42**; exact counts and resolved dataset revisions are recorded in
`results.json`.
- `aegis_stages_1_4_*` forces Stage 4 on every item via `trust_level="untrusted"` so the ablation
can measure Stage 4's standalone contribution. **Production gates Stage 4 conditionally** — this
is a measurement choice, stated in `results.json["meta"]` and the writeup.
- Detection logic is **never reimplemented**: Aegis systems call the real
`ContentSecurityScanner.scan` / `.scan_async` from `server/content_security.py`.
9 changes: 9 additions & 0 deletions benchmarks/injection/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
"""Research-grade prompt-injection detection benchmark for Aegis Memory.

Evaluates the Aegis four-stage content-security pipeline
(``server/content_security.py``) as a prompt-injection / memory-poisoning
detector, against established baselines, with full confusion-matrix metrics
and a per-stage ablation.

See ``README.md`` for how to reproduce.
"""
27 changes: 27 additions & 0 deletions benchmarks/injection/_paths.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
"""Import-path bootstrap for the injection benchmark.

The Aegis server modules use *bare* imports (``from content_security import
...``) and expect ``<repo>/server`` on ``sys.path`` (see ``tests/conftest.py``).
The ``aegis_memory`` package lives at the repo root. Importing this module
makes both importable without installing the server, so the benchmark can call
the real ``ContentSecurityScanner`` rather than reimplementing detection logic.
"""

from __future__ import annotations

import sys
from pathlib import Path

# benchmarks/injection/_paths.py -> repo root is two parents up.
REPO_ROOT = Path(__file__).resolve().parents[2]
SERVER_DIR = REPO_ROOT / "server"


def ensure_paths() -> None:
"""Prepend repo root and server/ to sys.path (idempotent)."""
for p in (str(SERVER_DIR), str(REPO_ROOT)):
if p not in sys.path:
sys.path.insert(0, p)


ensure_paths()
Loading
Loading