feat: Add PII probing transforms and scoring#315
Merged
Conversation
Add comprehensive PII extraction capabilities for AI red teaming: Transforms (dreadnode/transforms/pii_extraction.py): - repeat_word_divergence: Trigger training data memorization via Carlini et al. technique - continue_exact_text: Force exact continuation of memorized prefixes - complete_from_internet: Probe for memorized web content - partial_pii_completion: Adaptive PII extraction with contextual hints - public_figure_pii_probe: Test disclosure of public figure PII Scorers (dreadnode/scorers/pii_advanced.py): - training_data_memorization: Detect memorized text via entropy, repetition, and structural patterns - credential_leakage: Pattern-based detection for API keys, tokens, passwords (13 types) - pii_disclosure_rate: Binary scorer for eval aggregation - wilson_score_interval: Statistical confidence intervals for disclosure rates - calculate_disclosure_rate_with_ci: Helper for disclosure rate analysis with 95% CI Example notebook (examples/airt/pii_extraction_attacks.ipynb): - TAP attacks with PII extraction transforms - Eval-based disclosure rate testing with statistical confidence intervals - Credential leakage detection examples Tests: - 21 transform tests (test_pii_extraction_transforms.py) - 38 scorer tests (test_pii_advanced_scorers.py) - All tests use static inputs, no LLM calls Based on research: - Carlini et al. (USENIX 2024): Extracting Training Data from LLMs - PII-Scope Benchmark (arXiv 2410.06704): 48.9% extraction success rate - Model Inversion Attacks (arXiv 2507.04478): Password/credential extraction
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Key Changes:
Added:
dreadnode/transforms/pii_extraction.py: 5 transformsrepeat_word_divergence: Trigger memorization (Carlini technique)continue_exact_text: Force prefix completioncomplete_from_internet: Probe memorized web contentpartial_pii_completion: Adaptive extraction with hintspublic_figure_pii_probe: Test public figure disclosuredreadnode/scorers/pii_advanced.py: 3 scorers + 2 helperstraining_data_memorization: Entropy/pattern detectioncredential_leakage: 13 credential types (API keys, tokens)pii_disclosure_rate: Binary scorer for eval aggregationwilson_score_interval: Statistical confidence intervalscalculate_disclosure_rate_with_ci: Helper for 95% CI analysisexamples/airt/pii_extraction_attacks.ipynb: Usage examplestests/test_pii_extraction_transforms.py: 21 transform teststests/test_pii_advanced_scorers.py: 38 scorer testsChanged:
dreadnode/transforms/__init__.py: Export pii_extraction moduledreadnode/scorers/__init__.py: Export new scorers and helpersGenerated Summary:
pii_advanced.pymodule.training_data_memorization: Detects verbatim memorized text from training data.credential_leakage: Identifies potential leaked credentials, API keys, and tokens.pii_disclosure_rate: Binary detection of PII for evaluation purposes.wilson_score_interval: Calculates statistical confidence intervals for PII disclosure rates.calculate_disclosure_rate_with_ci: Aggregates PII detection results to compute disclosure rates.__init__.pyfiles to include new scorer functions and maintain module imports.pii_extraction.pywith functions targeting specific PII extraction techniques:repeat_word_divergence,continue_exact_text,complete_from_internet,partial_pii_completion, andpublic_figure_pii_probe.This summary was generated with ❤️ by rigging
Research References: