Skip to content

Explore: safety against extraction — people probing the model to scrape personal info #44

Description

@NotYuSheng

The main risk with a model that imitates a real person isn't just bad generations — it's that someone can talk to it to extract things. An adversary (or a curious friend) can interrogate the deployed doppelganger to surface personal details, private facts, contacts, or sensitive data it absorbed during training, even when the original chats were redacted. Conversational extraction sidesteps ingest-time redaction entirely.

Directions: output-side PII-leak detection on generations (reuse ingest/redactor.py) as a last line of defense; refusal of extraction-style probing — requests for the subject's identifiers, whereabouts, relationships, credentials, or "what do you know about X"; jailbreak / prompt-injection resistance around those refusals; optionally rate/anomaly signals on suspicious questioning. Complements ingest-time redaction rather than replacing it.

Part of the exploratory roadmap.

Metadata

Metadata

Assignees

No one assigned

    Labels

    explorationExploratory research direction

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions