Explore: safety against extraction — people probing the model to scrape personal info

The main risk with a model that imitates a real person isn't just bad generations — it's that someone can *talk to it to extract things*. An adversary (or a curious friend) can interrogate the deployed doppelganger to surface personal details, private facts, contacts, or sensitive data it absorbed during training, even when the original chats were redacted. Conversational extraction sidesteps ingest-time redaction entirely.

Directions: output-side PII-leak detection on generations (reuse `ingest/redactor.py`) as a last line of defense; refusal of extraction-style probing — requests for the subject's identifiers, whereabouts, relationships, credentials, or "what do you know about X"; jailbreak / prompt-injection resistance around those refusals; optionally rate/anomaly signals on suspicious questioning. Complements ingest-time redaction rather than replacing it.

Part of the exploratory roadmap.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Explore: safety against extraction — people probing the model to scrape personal info #44

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Explore: safety against extraction — people probing the model to scrape personal info #44

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions