The calibration harness enables deterministic, CI-friendly calibration of pack thresholds using labeled test sets.
When testing RAG systems for security vulnerabilities, thresholds determine when a metric value indicates a failure. The calibration harness helps tune these thresholds by:
- Running a pack against a labeled dataset
- Extracting per-case metric scores
- Finding the optimal threshold that achieves a target False Positive Rate (FPR)
Create a labels.jsonl file with the following format:
{"test_id": "q001", "label": "positive", "notes": "attack succeeds - expect leak"}
{"test_id": "q002", "label": "negative", "notes": "security holds - expect pass"}Fields:
test_id: Matches the query ID or test case ID from the packlabel: Either"positive"(attack succeeds) or"negative"(security holds)notes: Optional description
# Calibrate relevance-hijack pack with default 1% FPR target
ragleaklab calibrate --pack relevance-hijack --out calibration_out/
# Calibrate with custom FPR target
ragleaklab calibrate --pack sentinel-takeover-safe --out out/ --target-fpr 0.05
# Use custom labels file
ragleaklab calibrate --pack my-pack --out out/ --labels my_labels.jsonl| Option | Description | Default |
|---|---|---|
--pack, -p |
Pack to calibrate (required) | - |
--out, -o |
Output directory for calibration report | - |
--labels, -l |
Path to labels.jsonl | data/calibration/<pack>/labels.jsonl |
--target-fpr |
Target false positive rate | 0.01 (1%) |
--write-thresholds |
Update pack manifest (not yet implemented) | false |
The calibration harness automatically maps pack types to their primary metric:
| Pack Type | Metric | Higher is Worse |
|---|---|---|
retrieval |
poison_rate_at_k |
Yes |
sentinel |
leak_rate |
Yes |
claim |
poison_claim_rate |
Yes |
| Other | verbatim_score |
Yes |
The output calibration_report.json contains:
{
"pack_name": "relevance-hijack",
"metric_name": "poison_rate_at_k",
"target_fpr": 0.01,
"result": {
"threshold": 0.150000,
"achieved_fpr": 0.0,
"achieved_tpr": 0.8,
"n_positive": 10,
"n_negative": 10,
"decision_rule": "score >= threshold -> FAIL"
},
"roc_table": [
{"threshold": 0.9, "fpr": 0.0, "tpr": 0.1},
{"threshold": 0.8, "fpr": 0.0, "tpr": 0.2},
...
],
"generated_at": "2024-01-15T10:30:00Z"
}The threshold fitting algorithm:
- Separates scores into positive (attacks) and negative (benign) sets
- Tries all unique score values as candidate thresholds
- For each threshold, computes FPR and TPR
- Selects the threshold with highest TPR where FPR ≤ target
- Uses deterministic tie-breaking (higher threshold wins)
- True Positive (TP): Attack correctly detected
- False Positive (FP): Benign case wrongly flagged as attack
- FPR = FP / (total negatives)
- TPR = TP / (total positives)
Pre-created labeled datasets are available at:
data/calibration/relevance_hijack/labels.jsonldata/calibration/sentinel_takeover_safe/labels.jsonldata/calibration/claim_corruption/labels.jsonl
When updating thresholds in production:
- Run calibration with your labeled test set
- Review the calibration report, especially
achieved_fprandachieved_tpr - Manually update the pack manifest's
thresholdssection - Run the full test suite to verify no regressions
- Commit with a clear message explaining the threshold change
Warning: The
--write-thresholdsflag is not yet implemented to prevent accidental overwrites. Always review calibration results before updating thresholds.
The calibration command is deterministic and suitable for CI:
- name: Calibrate pack thresholds
run: ragleaklab calibrate --pack relevance-hijack --out calibration/
- name: Verify threshold meets target
run: |
FPR=$(jq '.result.achieved_fpr' calibration/calibration_report.json)
if (( $(echo "$FPR > 0.01" | bc -l) )); then
echo "FPR $FPR exceeds target 0.01"
exit 1
fi