The main risk with a model that imitates a real person isn't just bad generations — it's that someone can talk to it to extract things. An adversary (or a curious friend) can interrogate the deployed doppelganger to surface personal details, private facts, contacts, or sensitive data it absorbed during training, even when the original chats were redacted. Conversational extraction sidesteps ingest-time redaction entirely.
Directions: output-side PII-leak detection on generations (reuse ingest/redactor.py) as a last line of defense; refusal of extraction-style probing — requests for the subject's identifiers, whereabouts, relationships, credentials, or "what do you know about X"; jailbreak / prompt-injection resistance around those refusals; optionally rate/anomaly signals on suspicious questioning. Complements ingest-time redaction rather than replacing it.
Part of the exploratory roadmap.
The main risk with a model that imitates a real person isn't just bad generations — it's that someone can talk to it to extract things. An adversary (or a curious friend) can interrogate the deployed doppelganger to surface personal details, private facts, contacts, or sensitive data it absorbed during training, even when the original chats were redacted. Conversational extraction sidesteps ingest-time redaction entirely.
Directions: output-side PII-leak detection on generations (reuse
ingest/redactor.py) as a last line of defense; refusal of extraction-style probing — requests for the subject's identifiers, whereabouts, relationships, credentials, or "what do you know about X"; jailbreak / prompt-injection resistance around those refusals; optionally rate/anomaly signals on suspicious questioning. Complements ingest-time redaction rather than replacing it.Part of the exploratory roadmap.