-
Notifications
You must be signed in to change notification settings - Fork 1
Fix silent ~72% conversation drop (#42) + step-based runner #43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
6 commits
Select commit
Hold shift + click to select a range
0daefd3
fix(sharegpt): coerce conversations to LLaMA-Factory's alternation co…
NotYuSheng 38ed5b3
feat(runner): step-based doppelganger CLI over the ingest + training …
NotYuSheng 9e1347f
docs: split ingestion docs (shared pipeline + per-source); minor READ…
NotYuSheng 280ef83
feat(runner): size-based epoch advisor for `train`
NotYuSheng 28e9b78
fix/harden-runner-cli-error-handling
NotYuSheng d484b80
fix/guard-config-parsing-and-temp-file-cleanup
NotYuSheng File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,6 +1,7 @@ | ||
| /venv | ||
| *.json | ||
| *.jsonl | ||
| *.csv | ||
| /saves | ||
| /LLaMA-Factory | ||
| /merged | ||
|
|
||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,187 @@ | ||
| # The ingestion pipeline (source-agnostic) | ||
|
|
||
| This document describes every transformation Doppelganger applies to turn a raw | ||
| chat export into a training-ready dataset. The pipeline is **source-agnostic**: | ||
| a per-platform *adapter* normalizes the export into a common message stream, and | ||
| **every stage after that is shared across all sources**. | ||
|
|
||
| The only source-specific step is stage 1 (the adapter). For how a particular | ||
| platform's export is parsed, see the per-source docs: | ||
|
|
||
| - [Telegram](sources/telegram.md) — supported today | ||
| - WhatsApp, Discord, … — planned; each drops in under [`docs/sources/`](sources/) | ||
|
|
||
| Entry point: [`python -m ingest`](../ingest/__main__.py) → | ||
| [`ingest/cli.py:main`](../ingest/cli.py). End-to-end flow: | ||
|
|
||
| ``` | ||
| export (platform-specific) | ||
| │ (1) ADAPTER PARSE ingest/adapters/<source>.py ← source-specific | ||
| ▼ | ||
| NormalizedMessage stream ← common interface; everything below is shared | ||
| │ (2) BUILD SAMPLES ingest/core.py | ||
| │ a. split by silence gap | ||
| │ b. stitch reply-linked splits | ||
| │ c. assemble + merge turns | ||
| ▼ | ||
| role/text conversation samples | ||
| │ (3) SENSITIVE-DATA SCAN ingest/redactor.py (+ optional LLM) | ||
| │ (4) REDACTION APPLY off | replace | drop | ||
| │ (5) LLM QUALITY AUDIT ingest/validator.py (optional) | ||
| │ (6) SHAREGPT FORMAT ingest/sharegpt.py | ||
| ▼ | ||
| data/chat_sharegpt.json → LLaMA-Factory SFT | ||
| ``` | ||
|
|
||
| --- | ||
|
|
||
| ## 1. Adapter parse — `ingest/adapters/<source>.py` | ||
|
|
||
| Each platform has one adapter that reads its native export and emits a common | ||
| **`NormalizedMessage`** stream ([`ingest/message.py`](../ingest/message.py)), | ||
| decoupling every downstream stage from any platform's specific schema. Whatever | ||
| the source, the adapter is responsible for: | ||
|
|
||
| - **identifying which sender is you** (tagging each message `sender_is_self`), | ||
| - **filtering non-messages** (system/service events, empty entries), | ||
| - **producing plain text** for each message, and | ||
| - **preserving reply + sender metadata** (`reply_to_id`, `sender_id`) for the | ||
| sessionizing and group-chat stages below. | ||
|
|
||
| Output fields: `chat_id, timestamp, sender_id, sender_is_self, text, | ||
| message_id, reply_to_id`. | ||
|
|
||
| > Adding a new platform means writing **only** this adapter so it emits the same | ||
| > `NormalizedMessage` stream — stages 2–6 are unchanged. Document it under | ||
| > [`docs/sources/`](sources/). Telegram's specifics live in | ||
| > [sources/telegram.md](sources/telegram.md). | ||
|
|
||
| ## 2. Build conversation samples — `ingest/core.py:build_samples` | ||
|
|
||
| Turns the flat message stream into multi-turn conversations. Three sub-steps: | ||
|
|
||
| **a. Split by silence gap** (`_split_into_conversations`) | ||
| Messages in a chat are cut into separate conversations wherever there's a | ||
| silence longer than `--conversation-gap` (default **3600s / 1h**). A quiet hour | ||
| is treated as a topic boundary. | ||
|
|
||
| **b. Stitch reply-linked splits** (`_merge_by_reply`) | ||
| A gap-split is undone when a later message *replies to* an earlier one (via the | ||
| adapter's `reply_to_id` metadata): those conversations are unioned back together | ||
| and re-sorted. This recovers slow threads that a pure time-gap would wrongly | ||
| split. (Sources without reply metadata simply skip this — it's a no-op.) | ||
|
|
||
| **c. Assemble + merge turns** (`_assemble_turns`) | ||
| Each message becomes a turn with role `user` (other people) or `assistant` | ||
| (you). Consecutive messages from the **same role** within `--message-chain` | ||
| (default **30s**) are merged into one turn (people send several quick texts as | ||
| one "turn"). Conversations with only one side are dropped — you need both a | ||
| `user` and an `assistant` turn to train on. | ||
|
|
||
| Group chats: by default the other side is collapsed into a single `user` | ||
| speaker. With **`--multi-speaker`**, each non-self sender keeps their identity | ||
| and their turns are labelled (`Bob: ...`); your own turns are never labelled. | ||
|
|
||
| **Output:** `Sample = List[{"role": "user"|"assistant", "text": str}]`. | ||
|
|
||
| ## 3. Sensitive-data scan — `ingest/redactor.py` | ||
|
|
||
| A **non-destructive** pass that finds (but does not remove) personal/secret data | ||
| so you can review it before training. See [privacy](#privacy-notes). | ||
|
|
||
| - **Regex detectors** ([`ingest/redaction/`](../ingest/redaction/)): emails, | ||
| phone numbers, payment cards (Luhn-checked), IP/MAC, API keys/tokens, plus | ||
| pluggable country ID packs (`--redact-locales`, default `SG`). Universal | ||
| patterns always run. | ||
| - **Writes `data/redaction_report.json`** — every finding with `conversation`, | ||
| `turn`, `role`, `category`, `detector`, `severity`, and a masked `preview`. | ||
| A summary table is printed to the terminal. | ||
| - **Optional LLM redaction** (`--llm-redact`): an OpenAI-compatible model flags | ||
| context-dependent PII (names, secrets) that regex misses. **Local-first** — | ||
| it refuses a hosted API unless `--allow-cloud-redaction` is set, so chat text | ||
| never leaves your machine by default. | ||
| - Skip with `--skip-redact-scan` (or `--no-audit` to skip scan *and* validation). | ||
|
|
||
| ## 4. Apply redaction — `ingest/redactor.py:apply` | ||
|
|
||
| Acts on the findings according to `--redact`: | ||
|
|
||
| | Mode | Effect | | ||
| |------|--------| | ||
| | `off` *(default)* | Scan + report only. Nothing changed. | | ||
| | `replace` | Swap each detected span for a `[CATEGORY]` placeholder. Keeps every conversation; removes the secret. | | ||
| | `drop` | Remove any conversation containing a detection. Smaller, more conservative dataset. | | ||
|
|
||
| `--redact` is honoured even if the scan was skipped, so the dataset can't | ||
| silently retain sensitive data you asked to remove. | ||
|
|
||
| ## 5. LLM quality audit — `ingest/validator.py` (optional) | ||
|
|
||
| When enabled (`LLM_VALIDATE=true`, an OpenAI-compatible endpoint configured), | ||
| an LLM scores each conversation for coherence, quality, and human/assistant | ||
| pairing. It **drops weak samples** and can **split over-merged** ones into | ||
| cleaner conversations. Disable with `--skip-validation` or `--no-audit`. | ||
|
|
||
| ## 6. ShareGPT format — `ingest/sharegpt.py:to_sharegpt` | ||
|
|
||
| Converts role/text samples into the exact ShareGPT shape LLaMA-Factory consumes | ||
| and writes `data/chat_sharegpt.json` (registered in | ||
| [`configs/dataset_info.json`](../configs/dataset_info.json)). Roles map | ||
| `user → human`, `assistant → gpt`. | ||
|
|
||
| **Crucially, each conversation is coerced into the structure LLaMA-Factory's | ||
| converter requires** (`_coerce_alternating`): it must **start with `human`**, | ||
| **strictly alternate** `human/gpt`, and **end with `gpt`** (even number of | ||
| turns). Raw chats break these rules all the time, so the converter: | ||
|
|
||
| - **merges consecutive same-speaker turns** (multi-speaker labels stay in the | ||
| text), so alternation holds; | ||
| - **drops a leading `gpt` turn** (the other person messaged first — very common); | ||
| - **drops a trailing `human` turn** (so the sample ends on a trainable response). | ||
|
|
||
| Without this, LLaMA-Factory silently discards every non-conforming conversation | ||
| at train time (logging only `Invalid role tag` / `Invalid message count` | ||
| warnings) — on one real export that quietly cut **3,527 samples down to 997**. | ||
| With it, ~3,300 of those samples survive and the dataset's reported count matches | ||
| what actually trains. See issue | ||
| [#42](https://github.com/NotYuSheng/Doppelganger/issues/42). | ||
|
|
||
| > Loss is masked to your (`gpt`) turns only during SFT (`train_on_prompt: false`), | ||
| > so `human` turns — including multi-speaker labels — condition the model but are | ||
| > never themselves generated. | ||
|
|
||
| ### Alternate output: JSONL | ||
|
|
||
| `--format jsonl` writes the intermediate role/text samples | ||
| (`data/chat_dataset.jsonl`, one conversation per line) instead — useful for | ||
| inspection or custom downstream processing. It does **not** apply the ShareGPT | ||
| coercion. | ||
|
|
||
| --- | ||
|
|
||
| ## Useful flags (quick reference) | ||
|
|
||
| | Flag | Default | Stage | Purpose | | ||
| |------|---------|-------|---------| | ||
| | `--source` | `telegram` | 1 | Which adapter parses the export | | ||
| | `--input` | `./data/result.json` | 1 | Path to the raw export | | ||
| | `--self-name` | auto | 1 | Override "which sender is you" | | ||
| | `--conversation-gap` | `3600` | 2a | Silence (s) that starts a new conversation | | ||
| | `--message-chain` | `30` | 2c | Max gap (s) to merge same-sender messages into one turn | | ||
| | `--multi-speaker` | off | 2c | Keep + label individual senders in group chats | | ||
| | `--redact` | `off` | 4 | `off` / `replace` / `drop` | | ||
| | `--redact-locales` | `SG` | 3 | Country ID packs for the scan | | ||
| | `--llm-redact` | off | 3 | LLM-assisted PII detection (local-first) | | ||
| | `--skip-redact-scan` | off | 3 | Skip the sensitive-data scan | | ||
| | `--skip-validation` | off | 5 | Skip the LLM quality audit | | ||
| | `--no-audit` | off | 3+5 | Skip scan *and* validation | | ||
| | `--format` | `sharegpt` | 6 | `sharegpt` (training) or `jsonl` (intermediate) | | ||
|
|
||
| ## Privacy notes | ||
|
|
||
| The scan is a **safety net, not a guarantee** — regex and LLM detection both | ||
| miss real cases and raise false positives. Before training or sharing anything: | ||
| review `data/redaction_report.json` yourself, get consent from others in group | ||
| chats, and treat the dataset, `redaction_report.json` (which contains raw | ||
| values), trained adapters, and merged checkpoints all as sensitive. They are | ||
| gitignored by default; keep them that way. | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,82 @@ | ||
| # Source: Telegram | ||
|
|
||
| How Doppelganger parses a **Telegram** export into the normalized message stream | ||
| that the [shared pipeline](../data-pipeline.md) consumes. This is the only stage | ||
| that knows anything Telegram-specific; everything downstream (sessionizing, | ||
| scanning, redaction, ShareGPT formatting) is source-agnostic. | ||
|
|
||
| Adapter: [`ingest/adapters/telegram.py`](../../ingest/adapters/telegram.py). | ||
| Use it with `--source telegram` (the default). | ||
|
|
||
| ## Exporting your data | ||
|
|
||
| In **Telegram Desktop**: `Settings > Advanced > Export Telegram Data`. Select | ||
| your chat(s), choose **JSON** format (not HTML), and export. | ||
|
|
||
| The export unzips to a dated folder: | ||
|
|
||
| ``` | ||
| DataExport_2025-07-09/ | ||
| └── result.json ← this is the file the adapter reads | ||
| ``` | ||
|
|
||
| Point the pipeline at it one of two ways: | ||
|
|
||
| ```bash | ||
| # a) move/copy it to the default location | ||
| cp DataExport_2025-07-09/result.json data/result.json | ||
| python -m ingest --source telegram | ||
|
|
||
| # b) or pass the path directly | ||
| python -m ingest --source telegram --input DataExport_2025-07-09/result.json | ||
| ``` | ||
|
|
||
| > `setup.sh` expects `data/result.json`; it will stop with a "not found" error | ||
| > until the file is there. | ||
|
|
||
| ## What the adapter does | ||
|
|
||
| - **Detects who "you" are.** Read from the export's `personal_information` | ||
| (first + last name), or overridden with `--self-name "Your Name"`. If it can't | ||
| be determined, the adapter raises rather than guessing — pass `--self-name`. | ||
| Every message is tagged `sender_is_self` so the shared pipeline knows which | ||
| turns are yours (the ones the model learns to generate). | ||
| - **Filters non-messages.** `service` events (pins, joins, calls) and | ||
| empty/invalid entries are skipped. | ||
| - **Joins rich-text fragments.** Telegram stores formatted messages as a list of | ||
| entity objects (`text_entities`); the adapter concatenates them back into a | ||
| single plain-text string. | ||
| - **Reads reply + group metadata.** `reply_to_message_id` (used downstream to | ||
| stitch reply-linked conversations) and per-sender identity (used for | ||
| `--multi-speaker` group handling) are preserved. | ||
|
|
||
| ## Output | ||
|
|
||
| A flat list of `NormalizedMessage` | ||
| ([`ingest/message.py`](../../ingest/message.py)): | ||
|
|
||
| ```python | ||
| NormalizedMessage( | ||
| chat_id, # which chat the message belongs to | ||
| timestamp, # unix seconds | ||
| sender_id, # sender display name / id | ||
| sender_is_self, # True if this is you | ||
| text, # plain-text content | ||
| message_id, # for reply resolution | ||
| reply_to_id, # the message this replies to, if any | ||
| ) | ||
| ``` | ||
|
|
||
| From here the [shared pipeline](../data-pipeline.md) takes over. | ||
|
|
||
| ## Telegram-relevant flags | ||
|
|
||
| | Flag | Default | Purpose | | ||
| |------|---------|---------| | ||
| | `--source` | `telegram` | Selects this adapter | | ||
| | `--input` | `./data/result.json` | Path to the Telegram `result.json` | | ||
| | `--self-name` | auto | Override "which sender is you" when auto-detection fails or is wrong | | ||
| | `--multi-speaker` | off | In group chats, keep + label each non-self sender (`Bob: ...`) instead of collapsing the other side into one speaker | | ||
|
|
||
| All other flags (`--conversation-gap`, `--message-chain`, `--redact`, …) belong | ||
| to the shared pipeline — see [data-pipeline.md](../data-pipeline.md). |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,6 @@ | ||
| """Doppelganger orchestrator: a step-based runner over the ingest pipeline and | ||
| LLaMA-Factory training. | ||
|
|
||
| Run ``python -m doppelganger`` for an interactive menu, or a named subcommand | ||
| (``parse``, ``audit``, ``train``, ``merge``, ``chat``, ``auto``). | ||
| """ |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.