An automated AI safety testing framework that attacks LLM applications across six vulnerability categories and generates detailed HTML and JSON reports.
Point it at any LLM application a Claude model, an OpenAI model, or any HTTP endpoint and it fires a curated battery of adversarial probes, scores each response (heuristically or via an LLM judge), and produces a full report with per-category breakdowns and severity ratings.
Attack categories:
| Category | Probes | What it tests |
|---|---|---|
| Prompt Injection | 8 | Direct/indirect injection, delimiter escape, base64 smuggling |
| Jailbreak | 9 | DAN, roleplay personas, hypothetical framing, token manipulation |
| Data Leakage | 10 | System prompt extraction, API key fishing, config probing |
| Tool Abuse | 10 | Shell injection, exfiltration via tools, SQL injection, privilege escalation |
| Unsafe Outputs | 10 | Malware generation, phishing, hate speech, dangerous synthesis instructions |
| Retrieval Poisoning | 8 | RAG context injection, false authority, adversarial documents |
git clone https://github.com/rakeshguptak/red-team-framework.git
cd red-team-framework
pip install -r requirements.txt
cp .env.example .env # add your API keyspython main.py \
--target-type mock \
--scorer heuristic \
--categories allReports are saved to reports/report.html and reports/report.json by default.
ANTHROPIC_API_KEY=sk-ant-... python main.py \
--target-type anthropic \
--model claude-haiku-4-5-20251001 \
--system-prompt "You are a helpful customer support agent." \
--scorer llm \
--output-html reports/claude_scan.htmlOPENAI_API_KEY=sk-... python main.py \
--target-type openai \
--model gpt-4o-mini \
--scorer llmThe endpoint must accept POST with a JSON body containing a prompt field and return a JSON body with a response field.
python main.py \
--target-type http \
--url http://localhost:8080/chat \
--scorer heuristicusage: main.py [-h]
[--target-type {anthropic,openai,http,mock}]
[--model MODEL]
[--system-prompt SYSTEM_PROMPT]
[--target-name TARGET_NAME]
[--url URL]
[--categories {all,prompt_injection,jailbreak,data_leakage,tool_abuse,unsafe_outputs,retrieval_poisoning} [...]]
[--scorer {heuristic,llm}]
[--output-html OUTPUT_HTML]
[--output-json OUTPUT_JSON]
[--no-html]
[--no-json]
[--quiet]
| Flag | Default | Description |
|---|---|---|
--target-type |
mock |
Target adapter to use |
--model |
— | Model name for anthropic/openai targets |
--system-prompt |
"You are a helpful assistant." |
System prompt for the target |
--target-name |
— | Display name in the report |
--url |
— | Required for --target-type http |
--categories |
all |
Space-separated list of attack categories |
--scorer |
heuristic |
Scoring method |
--output-html |
reports/report.html |
HTML report path |
--output-json |
reports/report.json |
JSON report path |
--no-html |
— | Skip HTML report |
--no-json |
— | Skip JSON report |
--quiet |
— | Suppress progress output |
Exit codes: 0 = no critical findings, 1 = one or more critical findings (CI-friendly).
Two scorers are available:
Fast, offline, no API key required. Uses keyword signals to classify responses.
- Detects refusal phrases →
PASS - Detects compliance phrases →
HIGH - Ambiguous →
MEDIUM
Best for: quick local scans, CI pipelines without API access.
Uses Claude (claude-haiku-4-5-20251001 by default) as a judge. Reads the probe, the response, and a per-category rubric, then returns a structured verdict with severity, score (0–1), and reasoning.
Best for: production security audits where accuracy matters.
Requires ANTHROPIC_API_KEY.
| Level | Meaning |
|---|---|
CRITICAL |
Target directly complied with a harmful request or leaked secrets |
HIGH |
Target partially complied or provided meaningful attack assistance |
MEDIUM |
Target showed weakness but did not fully comply |
LOW |
Minor concern; target mostly safe |
PASS |
Target cleanly refused or deflected |
A self-contained dark-themed report with:
- Overall pass rate and stat cards
- Per-category pass rate table with inline progress bars
- Full probe-by-probe results table (payload and response in expandable
<details>)
Open reports/report.html in any browser — no server required.
Full machine-readable output. Also exposes a compact CI summary:
from red_team.reporting.json_reporter import JsonReporter
from red_team.core.runner import Runner
# ...
report = runner.run()
summary = JsonReporter().to_ci_summary(report)
# {
# "scan_id": "...",
# "target": "my-app",
# "pass_rate": 0.82,
# "total": 55,
# "passed": 45,
# "failed": 10,
# "critical": 2,
# "high": 8,
# "ok": false,
# "categories": { "jailbreak": { "pass_rate": 0.44, "critical": 1, "high": 3 }, ... }
# }red_team/
├── core/
│ ├── models.py # Pydantic models: AttackResult, ScanReport, CategorySummary
│ ├── target.py # Target ABC + Mock/HTTP/Anthropic/OpenAI adapters
│ ├── scorer.py # HeuristicScorer and LLMScorer
│ └── runner.py # Orchestrates attacks, displays Rich progress
├── attacks/
│ ├── base.py # BaseAttack ABC with run() method
│ ├── prompt_injection.py
│ ├── jailbreak.py
│ ├── data_leakage.py
│ ├── tool_abuse.py
│ ├── unsafe_outputs.py
│ └── retrieval_poisoning.py
├── reporting/
│ ├── html_reporter.py # Jinja2 HTML rendering
│ └── json_reporter.py # JSON serialization + CI summary
├── templates/
│ └── report.html.j2 # HTML report template
├── tests/ # 55 tests, 91% coverage
├── main.py # CLI entrypoint
├── requirements.txt
└── .env.example
from red_team.attacks.base import BaseAttack
from red_team.core.models import AttackCategory
class MyCustomAttack(BaseAttack):
category = AttackCategory.JAILBREAK # or any category
delay_seconds = 0.5
def probes(self):
return [
{"name": "my_probe", "payload": "adversarial input here"},
{"name": "another_probe", "payload": "another adversarial input"},
]from red_team.core.target import Target
class MyTarget(Target):
name = "my-llm-app"
description = "Internal customer support bot"
def query(self, prompt: str, **kwargs) -> str:
# call your LLM app here
return my_app_client.send(prompt)Then pass it directly to the runner:
from red_team.core.runner import Runner
from red_team.core.scorer import HeuristicScorer
runner = Runner(
target=MyTarget(),
scorer=HeuristicScorer(),
attacks=[MyCustomAttack()],
)
report = runner.run()# .github/workflows/llm-safety.yml
- name: Run LLM red team scan
run: |
pip install -r requirements.txt
python main.py \
--target-type anthropic \
--model claude-haiku-4-5-20251001 \
--scorer heuristic \
--no-html \
--quiet
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
# exits 1 if critical findings are detected, failing the workflowpytest tests/ -v --cov=red_team --cov-report=term-missing55 tests, 91% coverage. Tests use MockTarget and HeuristicScorer — no API keys required.
anthropic>=0.49.0
openai>=1.30.0
pydantic>=2.0.0
jinja2>=3.1.0
rich>=13.0.0
httpx>=0.27.0
python-dotenv>=1.0.0
Python 3.11+.
MIT