RAGLeakLab includes tools for generating synthetic test data with deterministic, seeded random values.
# Generate a synthetic corpus with 50 documents and 3 claims each
uv run python -m ragleaklab assets build \
--out data/generated/test_v1 \
--seed 1337 \
--docs 50 \
--claims-per-doc 3| Option | Default | Description |
|---|---|---|
--out, -o |
(required) | Output directory |
--seed, -s |
42 | Random seed for determinism |
--docs, -d |
10 | Number of documents |
--claims-per-doc, -c |
3 | Claims per document |
--no-pii |
false | Exclude PII-type claims (EMAIL, PHONE) |
The generator creates:
data/generated/test_v1/
├── doc_0000.txt # Document with embedded claims
├── doc_0001.txt
├── ...
├── claims.jsonl # All claims in JSONL format
└── manifest.json # Generation parameters and hash
| Type | Example |
|---|---|
EMAIL |
"Contact email is john.doe@example.com" |
PHONE |
"Phone number is 555-123-4567" |
ACCOUNT_ID |
"Account ID is ACC-847291" |
SECRET_CODEWORD |
"Secret codeword is ALPHA-BRAVO-CHARLIE" |
INTERNAL_PROJECT |
"Project codename is Phoenix-Dragon" |
Same seed always produces identical output:
# These two runs produce identical claims.jsonl
ragleaklab assets build --out run1 --seed 1337
ragleaklab assets build --out run2 --seed 1337
# Verify
diff run1/claims.jsonl run2/claims.jsonl # No differencesThe manifest.json records generation parameters:
{
"generated_at": "2024-01-15T10:30:00",
"seed": 1337,
"n_docs": 50,
"claims_per_doc": 3,
"include_pii": true,
"total_claims": 150,
"corpus_hash": "a1b2c3d4e5f6..."
}Use corpus_hash to verify reproducibility across runs.
- CI Testing: Generate consistent test data for regression checks
- Benchmarking: Compare RAG pipelines with identical inputs
- Development: Quick setup of test corpora without real data
RAGLeakLab uses YAML manifests to version and validate assets. Each asset type has a corresponding manifest format.
Located in each corpus directory (e.g., data/corpus_public/corpus.yaml):
name: corpus_public # Unique identifier
version: "1.0.0" # Semantic version
seed: 1337 # Optional: random seed if generated
doc_count: 50 # Number of documents
claims_supported: # Claim types present (empty if none)
- EMAIL
- PHONE
- ACCOUNT_ID
hash: "a1b2c3d4..." # SHA-256 of file treeLocated in attack directories (e.g., data/attacks/attacks.yaml):
name: attacks_default # Unique identifier
version: "1.0.0" # Semantic version
threat_coverage: # Threat types covered
- canary
- verbatim
case_count: 20 # Total test cases
hash: "e5f6g7h8..." # SHA-256 of file treeLocated alongside pack files (e.g., src/ragleaklab/packs/v1/canary-basic.pack.yaml):
name: canary-basic # Pack name
version: "1.0.0" # Semantic version
corpus_ref: null # Optional corpus reference (name@version)
attacks_ref: "canary-basic@1.0.0" # Required attacks reference
thresholds_ref: null # Optional thresholds config
expected_report_fields: # Required fields in report
- aggregates.canary- Semantic Versioning: All assets use
MAJOR.MINOR.PATCHformat - Hash Integrity: SHA-256 hash computed over file tree (excluding manifests,
out/,__pycache__/) - Immutability: Once published, a version's hash must not change
- Backward Compatibility: MINOR/PATCH updates must not break existing tests
Use the assets module to validate manifests:
from ragleaklab.assets import load_corpus_manifest, validate_hash
from pathlib import Path
manifest = load_corpus_manifest(Path("data/corpus_public/corpus.yaml"))
assert validate_hash(manifest, Path("data/corpus_public"))Hashes are computed deterministically:
- Walk directory recursively
- Exclude:
__pycache__/,*.pyc,.git/,out/, manifest files - Sort files by relative path
- For each file: hash(relative_path + contents)
- Combine into final SHA-256
from ragleaklab.assets.hash import compute_tree_hash
from pathlib import Path
h = compute_tree_hash(Path("data/corpus_public"))
print(h) # 64-character hex stringValidate all manifests in a directory:
# Validate all assets from project root
ragleaklab assets validate --path .
# Validate specific directory
ragleaklab assets validate --path data/
# Strict mode (treat warnings as errors)
ragleaklab assets validate --path . --strictThis checks:
- Schema validity against Pydantic models
- Hash integrity (computed vs. stored)
- Reference resolution in pack manifests
- Report field validity
Analyze attack coverage to find gaps in test suite:
ragleaklab attacks coverage --attacks data/attacks/ --out coverage.jsonConsole output shows:
- Case counts per threat type
- Case counts per strategy
- Matrix of threat × strategy combinations
- Missing expected combinations (if manifest defines expectations)
{
"total_cases": 30,
"threats": {"canary": 15, "verbatim": 10, "membership": 5},
"strategies": {"direct_ask": 12, "roleplay": 8, "indirect_ask": 10},
"matrix": {
"canary": {"direct_ask": 5, "roleplay": 5, "indirect_ask": 5},
"verbatim": {"direct_ask": 4, "roleplay": 3, "indirect_ask": 3},
"membership": {"direct_ask": 3, "indirect_ask": 2}
},
"tags": {"regression": 10, "priority": 5},
"missing_combos": [
{"threat": "membership", "strategy": "roleplay"}
]
}Add strategy_coverage to attacks.yaml manifest:
name: attacks_default
version: "1.0.0"
threat_coverage:
- canary
- verbatim
strategy_coverage:
- direct_ask
- roleplay
- indirect_askMissing combos are detected when both expectations are defined.