quantifylabs · quantifylabs · Jun 2, 2026 · May 31, 2026 · Jun 2, 2026 · chatgpt-codex-connector
diff --git a/.github/workflows/pip-audit.yml b/.github/workflows/pip-audit.yml
@@ -0,0 +1,35 @@
+name: pip-audit (shipped deps)
+
+# Audits ONLY the shipped-library dependency surface (server/requirements.txt and the
+# pyproject.toml core + [server] deps) against the OSV database. Benchmark/dev-only deps
+# (benchmarks/injection/requirements.txt) are intentionally NOT gated here — their residual
+# advisories are triaged and accepted in docs/security/vuln-triage.md. This job is blocking:
+# it fails the PR on any NEW vulnerability reachable by library users.
+
+on:
+  pull_request:
+    branches: [main]
+  push:
+    branches: [main]
+
+permissions:
+  contents: read
+
+jobs:
+  audit:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@34e114876b0b11c390a56381ad16ebd13914f8d5  # v4.3.1
+      - uses: actions/setup-python@a26af69be951a213d495a4c3e4e4022e16d87065  # v5.6.0
+        with:
+          python-version: "3.12"
+      - name: Install pip-audit
+        run: pip install pip-audit
+      - name: Audit server/requirements.txt
+        run: python -m pip_audit -r server/requirements.txt
+      - name: Audit installed shipped package (pyproject core + [server])
+        # Resolves the real shipped tree from pyproject.toml so transitive deps not pinned in
+        # server/requirements.txt are covered too. No --ignore-vuln: this surface is clean today.
+        run: |
+          pip install -e ".[server]"
+          python -m pip_audit
diff --git a/README.md b/README.md
@@ -307,6 +307,19 @@ Benchmarked on 8 vCPU / 7.6 GB RAM (Intel 13th Gen), 1000 memories, Docker Compo
 
 > Query tail latency (p95/p99) is dominated by the external OpenAI embedding call, not Aegis or PostgreSQL. Write and vote operations that skip embedding are consistently under 100ms at p50.
 
+## Security benchmark
+
+Does the [4-stage content security pipeline](#built-for-a-world-where-agents-get-compromised) actually catch prompt injection? We measured it as a detector against five baselines (DeBERTa, LLM Guard, an LLM judge, and more) on labelled injection + benign corpora — with full confusion-matrix metrics, a per-stage ablation, and an honest error analysis. **The false-positive rate is reported next to recall everywhere** — a blocker that flags everything is useless.
+
+| Aegis configuration | Recall | FPR | Median latency |
+|-----------------------------------------|------:|-----:|---------------:|
+| Stages 1–3 (deterministic, no API call) | 0.14 | 0.00 | 46 µs |
+| Stages 1–4 (+ LLM classifier) | 0.67 | 0.00 | 1.2 s |
+
+> `deepset/prompt-injections`, direct injection (N=662). The free deterministic core adds **zero** false positives here and across 1,500 benign memory snippets (1 FP); the optional LLM stage trades ~1s of latency for a 4.6× recall gain. Stage 2 (PII) contributes ~0 to injection recall by design — it's a different threat category.
+
+→ **Full results, ablation, baselines, latency, and limitations: [`docs/security/benchmark.md`](docs/security/benchmark.md)** · reproduce with `python benchmarks/injection/run_benchmark.py`.
+
 ## Deployment
 
 ### Docker Compose

diff --git a/SECURITY.md b/SECURITY.md
@@ -28,3 +28,11 @@ Include: affected version, reproduction steps, and impact assessment.
 
 For deeper security architecture (4-stage content pipeline, HMAC-SHA256 integrity,
 OWASP 4-tier trust hierarchy), see [docs/guides/security.mdx](docs/guides/security.mdx).
+
+## Dependency Vulnerabilities
+
+The shipped-library dependency surface is audited against OSV in CI
+([`.github/workflows/pip-audit.yml`](.github/workflows/pip-audit.yml)) and currently reports
+**zero known vulnerabilities**. For the full triage — including the benchmark/dev-only residual
+that is documented and accepted (never shipped to PyPI users) — see
+[docs/security/vuln-triage.md](docs/security/vuln-triage.md).
diff --git a/benchmarks/injection/requirements.txt b/benchmarks/injection/requirements.txt
@@ -8,11 +8,11 @@
 
 # --- Core (always needed) ---
 numpy>=1.26,<2.0            # metrics + bootstrap resampling
-python-dotenv==1.0.1        # load OPENAI_/ANTHROPIC_ keys from aegis-memory-main/.env
+python-dotenv==1.2.2        # load OPENAI_/ANTHROPIC_ keys; >=1.2.2 clears CVE-2026-28684
 
 # --- Datasets ---
 datasets==2.19.1            # deepset/prompt-injections, databricks-dolly-15k
-huggingface-hub==0.23.4
+huggingface-hub==0.30.2     # >=0.30 required by transformers>=4.53; datasets 2.19.1 allows it
 requests>=2.31.0            # InjecAgent raw fetch (best-effort)
 
 # --- ML baseline: protectai_deberta  AND  framework baseline: llm_guard ---
@@ -21,11 +21,16 @@ requests>=2.31.0            # InjecAgent raw fetch (best-effort)
 # (deberta-v3 text-classification works across this transformers range too).
 # CPU torch wheel is large (~200MB); install takes a few minutes.
 # IMPORTANT: transformers 5.x breaks llm-guard 0.3.x (import error), and
-# llm-guard 0.3.15 requires torch>=2.4 — so cap transformers<5 and let llm-guard
-# pull a compatible torch. deberta-v3 text-classification works in this range.
+# llm-guard 0.3.15 requires torch>=2.4 and transformers>=4.43.4 — so cap
+# transformers<5 and let llm-guard pull a compatible torch. deberta-v3
+# text-classification works in this range.
+# Security floor: >=4.53.0 clears every transformers advisory that has a fix
+# below 5.x (CVE-2024-12720, CVE-2025-1194/3263/3264/3777/3933/5197/6051/6638/6921,
+# PYSEC-2024-227/228/229, PYSEC-2025-40). The remaining advisories have no <5 fix
+# and are documented as accepted benchmark-only risk in docs/security/vuln-triage.md.
 torch>=2.4
-transformers>=4.41,<5
-sentencepiece==0.2.0        # deberta-v3 tokenizer needs this
+transformers>=4.53.0,<5
+sentencepiece==0.2.1        # deberta-v3 tokenizer needs this; >=0.2.1 clears CVE-2026-1260
 llm-guard==0.3.15
 # If the resolver still cannot satisfy llm-guard on your platform, drop it and
 # rerun — the benchmark marks `llm_guard` as "not run" and proceeds.

diff --git a/docs/security/vuln-triage.md b/docs/security/vuln-triage.md
@@ -0,0 +1,112 @@
+# Dependency vulnerability triage
+
+_Audited with `pip-audit` 2.10.0 (OSV) on 2026-06-02. Ground truth for this PR; the OpenSSF
+Scorecard viewer refreshes on its own schedule after merge._
+
+## Headline
+
+**Zero known vulnerabilities in the shipped library.** Every advisory OSV reports for this repo
+lives in **benchmark-only dev tooling** (`benchmarks/injection/requirements.txt`), which is never
+installed by people who `pip install aegis-memory`. Before this PR those advisories spanned
+**3 distinct packages (28 advisory instances)**; after conservative bumps the residual is
+**1 package (9 advisories), all in `transformers`, with no fix available below the major version
+that breaks the benchmark's `llm-guard` dependency.**
+
+| Surface | Manifest | Before | After |
+|---|---|--:|--:|
+| Shipped library | `server/requirements.txt` | 0 | 0 |
+| Shipped library | `pyproject.toml` (core + `[server]`) | 0 | 0 |
+| Benchmark / dev-only | `benchmarks/injection/requirements.txt` | 3 pkgs / 28 | **1 pkg / 9** |
+
+The shipped surface was already clean thanks to the transitive security floors in
+`server/requirements.txt` (`idna>=3.15`, `pygments>=2.20.0`, `tqdm>=4.66.3`). It is now also
+gated in CI by [`.github/workflows/pip-audit.yml`](../../.github/workflows/pip-audit.yml) so a new
+shipped-dependency vulnerability fails the build.
+
+> **Note on the Scorecard count.** The public viewer has shown ~53 OSV advisories. That number
+> counts *every advisory ID* across the fuller tree Scorecard resolves — including the duplicate
+> IDs `pip-audit` also emits (e.g. `PYSEC-2024-227/228/229` were each listed twice) and the
+> `PYSEC-2025-211..218` cluster, which is **one package**, not eight. The number that actually
+> matters is **distinct shipped-dependency packages needing a fix: zero.**
+
+## Manifests scanned
+
+| Manifest | Role |
+|---|---|
+| `server/requirements.txt` | Shipped library runtime deps (PyPI install surface) |
+| `pyproject.toml` (`dependencies`, `[server]`) | Shipped library / server extra |
+| `benchmarks/injection/requirements.txt` | Benchmark-only dev tooling (transformers, torch, datasets, llm-guard, …) — not shipped |
+
+No `setup.py`, `poetry.lock`, or other lockfiles exist in the repo.
+
+## Triage table (one row per distinct package)
+
+| Package | Version (before → after) | Manifest | Advisories (grouped) | Fix available | Safe bump? | Action |
+|---|---|---|---|---|---|---|
+| `python-dotenv` | `1.0.1` → `1.2.2` | benchmark-only | CVE-2026-28684 | yes (`1.2.2`) | yes — API-compatible | **Bumped** |
+| `sentencepiece` | `0.2.0` → `0.2.1` | benchmark-only | CVE-2026-1260 | yes (`0.2.1`) | yes — patch; deberta-v3 tokenizer unaffected | **Bumped** |
+| `transformers` | `4.46.3` → `4.53.3` (floor `>=4.41,<5` → `>=4.53.0,<5`) | benchmark-only | 14 with a `<5` fix · 8 no-fix (`PYSEC-2025-211..218`) · 1 needing 5.x (`CVE-2026-1839`) | partial | bump to highest `<5`; rest unbumpable | **Bumped (partial)** + residual documented below |
+| `huggingface-hub` | `0.23.4` → `0.30.2` | benchmark-only | none (compat bump) | n/a | yes — required by `transformers>=4.53`; `datasets==2.19.1` allows it | **Bumped (to satisfy transformers)** |
+
+### transformers advisories cleared by the `>=4.53.0` floor (14)
+
+`PYSEC-2024-227`, `PYSEC-2024-228`, `PYSEC-2024-229` (4.48.0) · `PYSEC-2025-40` (4.49.0) ·
+`CVE-2024-12720` (4.48.0) · `CVE-2025-1194` (4.50.0) · `CVE-2025-3263`, `CVE-2025-3264` (4.51.0) ·
+`CVE-2025-3777`, `CVE-2025-3933` (4.52.1) · `CVE-2025-5197`, `CVE-2025-6638`, `CVE-2025-6051`,
+`CVE-2025-6921` (4.53.0).
+
+## Known unfixable / accepted residual
+
+All residual is **benchmark-only** dev tooling in `transformers 4.53.3`. It is **not reachable by
+library users** — `transformers` is not a dependency of `aegis-memory` or its `[server]` extra; it
+is installed only by someone running the injection benchmark in an isolated venv. Risk to shipped
+users: **none.**
+
+| Advisory | Why it can't be bumped | Reachability |
+|---|---|---|
+| `PYSEC-2025-211` | No fixed version published in OSV (no `<5` patch) | benchmark-only |
+| `PYSEC-2025-212` | No fixed version published in OSV | benchmark-only |
+| `PYSEC-2025-213` | No fixed version published in OSV | benchmark-only |
+| `PYSEC-2025-214` | No fixed version published in OSV | benchmark-only |
+| `PYSEC-2025-215` | No fixed version published in OSV | benchmark-only |
+| `PYSEC-2025-216` | No fixed version published in OSV | benchmark-only |
+| `PYSEC-2025-217` | No fixed version published in OSV | benchmark-only |
+| `PYSEC-2025-218` | No fixed version published in OSV | benchmark-only |
+| `CVE-2026-1839` | Fix only in `5.0.0rc3`; `transformers 5.x` breaks `llm-guard 0.3.15` (the benchmark's `<5` ceiling) | benchmark-only |
+
+### Deliberate ignore list
+
+If/when `pip-audit` is run over the benchmark manifest in tooling, the residual is suppressed
+*explicitly* (a reviewed decision, not an oversight):
+
+```
+python -m pip_audit -r benchmarks/injection/requirements.txt `
+  --ignore-vuln PYSEC-2025-211 --ignore-vuln PYSEC-2025-212 `
+  --ignore-vuln PYSEC-2025-213 --ignore-vuln PYSEC-2025-214 `
+  --ignore-vuln PYSEC-2025-215 --ignore-vuln PYSEC-2025-216 `
+  --ignore-vuln PYSEC-2025-217 --ignore-vuln PYSEC-2025-218 `
+  --ignore-vuln CVE-2026-1839
+```
+
+The shipped-deps CI job (`.github/workflows/pip-audit.yml`) needs **no** ignore list — that surface
+is clean — and intentionally does **not** audit the benchmark manifest, so the accepted residual
+above never blocks a merge.
+
+## Proposal for the maintainer (not done in this PR)
+
+A large majority of OSV signal for this repo comes from benchmark-only tooling. To make attribution
+unambiguous, the benchmark extras could be moved into an isolated optional-dependency group, e.g.
+`[project.optional-dependencies] benchmark = [...]` in `pyproject.toml`, installed via
+`pip install aegis-memory[benchmark]`. This is **clarity of attribution**, not concealment —
+Scorecard may still scan any manifest in the repo. Flagged here for a maintainer decision; the
+dependency layout is intentionally **not** restructured in this PR.
+
+## Verification performed
+
+1. `python -m pip_audit -r server/requirements.txt` → `No known vulnerabilities found`.
+2. `python -m pip_audit` over the `pyproject.toml` core + `[server]` resolved tree → `No known vulnerabilities found`.
+3. `python -m pip_audit -r benchmarks/injection/requirements.txt` → 9 advisories, all the documented
+   `transformers` residual above (down from 28 across 3 packages).
+4. `python -m pytest tests/` → 493 passed, 2 skipped (the only errors are `asyncpg` connection
+   failures from tests that need a live Postgres, which CI provides via its `postgres` service;
+   unrelated to the dependency bumps, which touch no shipped code).