Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
/venv
*.json
*.jsonl
*.csv
/saves
/LLaMA-Factory
/merged
Expand Down
4 changes: 4 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -82,6 +82,8 @@ Doppelganger/
└── result.json ← place your export here
```

> **Note:** Telegram exports unzip to a dated folder like `DataExport_2025-07-09/result.json`. Move (or copy) that `result.json` to `data/result.json` — `setup.sh` looks for it there. Alternatively, point `python -m ingest` at the file directly with `--input path/to/result.json`.

**2. Clone and run setup**

The setup scripts create a virtual environment, install pinned dependencies (LLaMA-Factory **0.9.4**), create your `.env`, and process the export into `data/chat_sharegpt.json`.
Expand Down Expand Up @@ -123,6 +125,8 @@ source venv/bin/activate
llamafactory-cli train configs/train_lora.yaml
```

> **Note:** The tracked `train_lora.yaml` defaults to the small **Qwen1.5-1.8B-Chat** so this step runs fast as an end-to-end smoke test. It's a real model but too small to convincingly mimic your writing — for real results, switch to a larger base (e.g. Qwen2.5-14B-Instruct) via a local override. See [Fine-Tune Your Model](#fine-tune-your-model-lora).

## Usage

`python -m ingest` turns a raw export into a training-ready dataset. Useful flags:
Expand Down
187 changes: 187 additions & 0 deletions docs/data-pipeline.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,187 @@
# The ingestion pipeline (source-agnostic)

This document describes every transformation Doppelganger applies to turn a raw
chat export into a training-ready dataset. The pipeline is **source-agnostic**:
a per-platform *adapter* normalizes the export into a common message stream, and
**every stage after that is shared across all sources**.

The only source-specific step is stage 1 (the adapter). For how a particular
platform's export is parsed, see the per-source docs:

- [Telegram](sources/telegram.md) — supported today
- WhatsApp, Discord, … — planned; each drops in under [`docs/sources/`](sources/)

Entry point: [`python -m ingest`](../ingest/__main__.py) →
[`ingest/cli.py:main`](../ingest/cli.py). End-to-end flow:

```
export (platform-specific)
│ (1) ADAPTER PARSE ingest/adapters/<source>.py ← source-specific
NormalizedMessage stream ← common interface; everything below is shared
│ (2) BUILD SAMPLES ingest/core.py
│ a. split by silence gap
│ b. stitch reply-linked splits
│ c. assemble + merge turns
role/text conversation samples
│ (3) SENSITIVE-DATA SCAN ingest/redactor.py (+ optional LLM)
│ (4) REDACTION APPLY off | replace | drop
│ (5) LLM QUALITY AUDIT ingest/validator.py (optional)
│ (6) SHAREGPT FORMAT ingest/sharegpt.py
data/chat_sharegpt.json → LLaMA-Factory SFT
```

---

## 1. Adapter parse — `ingest/adapters/<source>.py`

Each platform has one adapter that reads its native export and emits a common
**`NormalizedMessage`** stream ([`ingest/message.py`](../ingest/message.py)),
decoupling every downstream stage from any platform's specific schema. Whatever
the source, the adapter is responsible for:

- **identifying which sender is you** (tagging each message `sender_is_self`),
- **filtering non-messages** (system/service events, empty entries),
- **producing plain text** for each message, and
- **preserving reply + sender metadata** (`reply_to_id`, `sender_id`) for the
sessionizing and group-chat stages below.

Output fields: `chat_id, timestamp, sender_id, sender_is_self, text,
message_id, reply_to_id`.

> Adding a new platform means writing **only** this adapter so it emits the same
> `NormalizedMessage` stream — stages 2–6 are unchanged. Document it under
> [`docs/sources/`](sources/). Telegram's specifics live in
> [sources/telegram.md](sources/telegram.md).

## 2. Build conversation samples — `ingest/core.py:build_samples`

Turns the flat message stream into multi-turn conversations. Three sub-steps:

**a. Split by silence gap** (`_split_into_conversations`)
Messages in a chat are cut into separate conversations wherever there's a
silence longer than `--conversation-gap` (default **3600s / 1h**). A quiet hour
is treated as a topic boundary.

**b. Stitch reply-linked splits** (`_merge_by_reply`)
A gap-split is undone when a later message *replies to* an earlier one (via the
adapter's `reply_to_id` metadata): those conversations are unioned back together
and re-sorted. This recovers slow threads that a pure time-gap would wrongly
split. (Sources without reply metadata simply skip this — it's a no-op.)

**c. Assemble + merge turns** (`_assemble_turns`)
Each message becomes a turn with role `user` (other people) or `assistant`
(you). Consecutive messages from the **same role** within `--message-chain`
(default **30s**) are merged into one turn (people send several quick texts as
one "turn"). Conversations with only one side are dropped — you need both a
`user` and an `assistant` turn to train on.

Group chats: by default the other side is collapsed into a single `user`
speaker. With **`--multi-speaker`**, each non-self sender keeps their identity
and their turns are labelled (`Bob: ...`); your own turns are never labelled.

**Output:** `Sample = List[{"role": "user"|"assistant", "text": str}]`.

## 3. Sensitive-data scan — `ingest/redactor.py`

A **non-destructive** pass that finds (but does not remove) personal/secret data
so you can review it before training. See [privacy](#privacy-notes).

- **Regex detectors** ([`ingest/redaction/`](../ingest/redaction/)): emails,
phone numbers, payment cards (Luhn-checked), IP/MAC, API keys/tokens, plus
pluggable country ID packs (`--redact-locales`, default `SG`). Universal
patterns always run.
- **Writes `data/redaction_report.json`** — every finding with `conversation`,
`turn`, `role`, `category`, `detector`, `severity`, and a masked `preview`.
A summary table is printed to the terminal.
- **Optional LLM redaction** (`--llm-redact`): an OpenAI-compatible model flags
Comment thread
coderabbitai[bot] marked this conversation as resolved.
context-dependent PII (names, secrets) that regex misses. **Local-first** —
it refuses a hosted API unless `--allow-cloud-redaction` is set, so chat text
never leaves your machine by default.
- Skip with `--skip-redact-scan` (or `--no-audit` to skip scan *and* validation).

## 4. Apply redaction — `ingest/redactor.py:apply`

Acts on the findings according to `--redact`:

| Mode | Effect |
|------|--------|
| `off` *(default)* | Scan + report only. Nothing changed. |
| `replace` | Swap each detected span for a `[CATEGORY]` placeholder. Keeps every conversation; removes the secret. |
| `drop` | Remove any conversation containing a detection. Smaller, more conservative dataset. |

`--redact` is honoured even if the scan was skipped, so the dataset can't
silently retain sensitive data you asked to remove.

## 5. LLM quality audit — `ingest/validator.py` (optional)

When enabled (`LLM_VALIDATE=true`, an OpenAI-compatible endpoint configured),
an LLM scores each conversation for coherence, quality, and human/assistant
pairing. It **drops weak samples** and can **split over-merged** ones into
cleaner conversations. Disable with `--skip-validation` or `--no-audit`.

## 6. ShareGPT format — `ingest/sharegpt.py:to_sharegpt`

Converts role/text samples into the exact ShareGPT shape LLaMA-Factory consumes
and writes `data/chat_sharegpt.json` (registered in
[`configs/dataset_info.json`](../configs/dataset_info.json)). Roles map
`user → human`, `assistant → gpt`.

**Crucially, each conversation is coerced into the structure LLaMA-Factory's
converter requires** (`_coerce_alternating`): it must **start with `human`**,
**strictly alternate** `human/gpt`, and **end with `gpt`** (even number of
turns). Raw chats break these rules all the time, so the converter:

- **merges consecutive same-speaker turns** (multi-speaker labels stay in the
text), so alternation holds;
- **drops a leading `gpt` turn** (the other person messaged first — very common);
- **drops a trailing `human` turn** (so the sample ends on a trainable response).

Without this, LLaMA-Factory silently discards every non-conforming conversation
at train time (logging only `Invalid role tag` / `Invalid message count`
warnings) — on one real export that quietly cut **3,527 samples down to 997**.
With it, ~3,300 of those samples survive and the dataset's reported count matches
what actually trains. See issue
[#42](https://github.com/NotYuSheng/Doppelganger/issues/42).

> Loss is masked to your (`gpt`) turns only during SFT (`train_on_prompt: false`),
> so `human` turns — including multi-speaker labels — condition the model but are
> never themselves generated.

### Alternate output: JSONL

`--format jsonl` writes the intermediate role/text samples
(`data/chat_dataset.jsonl`, one conversation per line) instead — useful for
inspection or custom downstream processing. It does **not** apply the ShareGPT
coercion.

---

## Useful flags (quick reference)

| Flag | Default | Stage | Purpose |
|------|---------|-------|---------|
| `--source` | `telegram` | 1 | Which adapter parses the export |
| `--input` | `./data/result.json` | 1 | Path to the raw export |
| `--self-name` | auto | 1 | Override "which sender is you" |
| `--conversation-gap` | `3600` | 2a | Silence (s) that starts a new conversation |
| `--message-chain` | `30` | 2c | Max gap (s) to merge same-sender messages into one turn |
| `--multi-speaker` | off | 2c | Keep + label individual senders in group chats |
| `--redact` | `off` | 4 | `off` / `replace` / `drop` |
| `--redact-locales` | `SG` | 3 | Country ID packs for the scan |
| `--llm-redact` | off | 3 | LLM-assisted PII detection (local-first) |
| `--skip-redact-scan` | off | 3 | Skip the sensitive-data scan |
| `--skip-validation` | off | 5 | Skip the LLM quality audit |
| `--no-audit` | off | 3+5 | Skip scan *and* validation |
| `--format` | `sharegpt` | 6 | `sharegpt` (training) or `jsonl` (intermediate) |

## Privacy notes

The scan is a **safety net, not a guarantee** — regex and LLM detection both
miss real cases and raise false positives. Before training or sharing anything:
review `data/redaction_report.json` yourself, get consent from others in group
chats, and treat the dataset, `redaction_report.json` (which contains raw
values), trained adapters, and merged checkpoints all as sensitive. They are
gitignored by default; keep them that way.
82 changes: 82 additions & 0 deletions docs/sources/telegram.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,82 @@
# Source: Telegram

How Doppelganger parses a **Telegram** export into the normalized message stream
that the [shared pipeline](../data-pipeline.md) consumes. This is the only stage
that knows anything Telegram-specific; everything downstream (sessionizing,
scanning, redaction, ShareGPT formatting) is source-agnostic.

Adapter: [`ingest/adapters/telegram.py`](../../ingest/adapters/telegram.py).
Use it with `--source telegram` (the default).

## Exporting your data

In **Telegram Desktop**: `Settings > Advanced > Export Telegram Data`. Select
your chat(s), choose **JSON** format (not HTML), and export.

The export unzips to a dated folder:

```
DataExport_2025-07-09/
└── result.json ← this is the file the adapter reads
```

Point the pipeline at it one of two ways:

```bash
# a) move/copy it to the default location
cp DataExport_2025-07-09/result.json data/result.json
python -m ingest --source telegram

# b) or pass the path directly
python -m ingest --source telegram --input DataExport_2025-07-09/result.json
```

> `setup.sh` expects `data/result.json`; it will stop with a "not found" error
> until the file is there.

## What the adapter does

- **Detects who "you" are.** Read from the export's `personal_information`
(first + last name), or overridden with `--self-name "Your Name"`. If it can't
be determined, the adapter raises rather than guessing — pass `--self-name`.
Every message is tagged `sender_is_self` so the shared pipeline knows which
turns are yours (the ones the model learns to generate).
- **Filters non-messages.** `service` events (pins, joins, calls) and
empty/invalid entries are skipped.
- **Joins rich-text fragments.** Telegram stores formatted messages as a list of
entity objects (`text_entities`); the adapter concatenates them back into a
single plain-text string.
- **Reads reply + group metadata.** `reply_to_message_id` (used downstream to
stitch reply-linked conversations) and per-sender identity (used for
`--multi-speaker` group handling) are preserved.

## Output

A flat list of `NormalizedMessage`
([`ingest/message.py`](../../ingest/message.py)):

```python
NormalizedMessage(
chat_id, # which chat the message belongs to
timestamp, # unix seconds
sender_id, # sender display name / id
sender_is_self, # True if this is you
text, # plain-text content
message_id, # for reply resolution
reply_to_id, # the message this replies to, if any
)
```

From here the [shared pipeline](../data-pipeline.md) takes over.

## Telegram-relevant flags

| Flag | Default | Purpose |
|------|---------|---------|
| `--source` | `telegram` | Selects this adapter |
| `--input` | `./data/result.json` | Path to the Telegram `result.json` |
| `--self-name` | auto | Override "which sender is you" when auto-detection fails or is wrong |
| `--multi-speaker` | off | In group chats, keep + label each non-self sender (`Bob: ...`) instead of collapsing the other side into one speaker |

All other flags (`--conversation-gap`, `--message-chain`, `--redact`, …) belong
to the shared pipeline — see [data-pipeline.md](../data-pipeline.md).
6 changes: 6 additions & 0 deletions doppelganger/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
"""Doppelganger orchestrator: a step-based runner over the ingest pipeline and
LLaMA-Factory training.

Run ``python -m doppelganger`` for an interactive menu, or a named subcommand
(``parse``, ``audit``, ``train``, ``merge``, ``chat``, ``auto``).
"""
Loading
Loading