Skip to content

Fix silent ~72% conversation drop (#42) + step-based runner#43

Merged
NotYuSheng merged 6 commits into
mainfrom
feat/sharegpt-fix-and-runner
Jun 25, 2026
Merged

Fix silent ~72% conversation drop (#42) + step-based runner#43
NotYuSheng merged 6 commits into
mainfrom
feat/sharegpt-fix-and-runner

Conversation

@NotYuSheng

@NotYuSheng NotYuSheng commented Jun 25, 2026

Copy link
Copy Markdown
Owner

What & why

Two related improvements to the ingestion path, plus docs.

1. Fix: ~72% of conversations were silently dropped at training time (Fixes #42)

to_sharegpt() only checked that a conversation contained both a human and a
gpt turn, but LLaMA-Factory's ShareGPT converter additionally requires each
conversation to start with human, strictly alternate, and end with
gpt
. Real chats break these rules constantly (the other person messages
first; same-role runs survive reply-stitching / --multi-speaker), so
non-conforming samples were dropped after the pipeline reported success — only
visible as Invalid role tag / Invalid message count warnings.

On one real export: 3,527 written → 997 actually trained on.

The new _coerce_alternating() normalizes each conversation instead of only
role-presence checking:

  • merge consecutive same-speaker turns (multi-speaker labels stay in the text),
  • drop a leading gpt turn,
  • drop a trailing human turn.

This salvages human,gpt,humanhuman,gpt rather than discarding it. Same
export now yields 3,324 usable samples (0 malformed), and the reported count
matches what trains.

Tests: updated the two goldens that encoded the old (droppable) shapes; added
CoerceAlternatingTest (leading-assistant strip, trailing-user strip, same-role
merge, unusable-pair drop, well-formedness invariant). 52/52 pass.

2. Feature: step-based doppelganger runner

python -m doppelganger exposes the pipeline as discrete steps:

  • no args → interactive numbered menu with ✓/✗ step status;
  • subcommandsparse, audit, train, merge, chat;
  • auto → runs parse → audit end-to-end (like the old one-shot ingest;
    --redact replace by default with a warning), --train to chain training.

parse/audit reuse ingest.*; train/merge/chat wrap llamafactory-cli.
Splitting parse and audit makes the "review before training" checkpoint a
first-class step. The existing python -m ingest one-shot is unchanged.

3. Docs

  • Split the source-agnostic pipeline (docs/data-pipeline.md) from
    source-specific parsing (docs/sources/telegram.md), matching the adapter
    architecture so WhatsApp/Discord docs slot in later.
  • README notes on the DataExport/ export-folder layout and the small default
    model. Ignore *.csv.

Notes for reviewers

  • No behavior change to python -m ingest.
  • New data/chat_dataset.jsonl (raw parse output) is gitignored alongside the
    other data/ artifacts.

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Added a command-line menu and subcommands for parsing data, auditing/redacting datasets, training, merging, chatting, and running the full workflow.
    • Added new documentation for the data pipeline and Telegram import steps, plus clearer Quick Start notes.
    • Added support for stricter conversation formatting so exported training data matches expected chat-model requirements.
  • Bug Fixes

    • Improved handling of malformed chat turns by dropping incomplete or invalid conversations and merging repeated speaker turns.
    • Updated ignored files to prevent CSV exports from being tracked.

NotYuSheng and others added 3 commits June 26, 2026 00:33
…ntract

to_sharegpt only checked that both human and gpt roles were present, but
LLaMA-Factory additionally requires each conversation to start with human,
strictly alternate, and end with gpt. Real chats violate this constantly
(other person messages first; same-role runs survive reply-stitching /
multi-speaker), so ~72% of samples were silently dropped at train time
(3527 written -> 997 trained) with only WARNING logs.

Add _coerce_alternating: merge consecutive same-speaker turns (labels stay
in the text), strip a leading gpt turn, strip a trailing human turn. Salvages
human,gpt,human -> human,gpt instead of discarding the whole sample. On the
same export this raises usable samples from 997 to ~3324.

Fixes #42

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…pipeline

Add `python -m doppelganger`: an interactive numbered menu (no args) plus
named subcommands parse / audit / train / merge / chat, and an `auto` mode
that runs parse -> audit end-to-end (like the old one-shot ingest, redact
replace by default). parse/audit reuse ingest.*; train/merge/chat shell out
to llamafactory-cli. The parse/audit split makes "review before training" a
first-class step.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ME/gitignore

Separate the source-agnostic pipeline (docs/data-pipeline.md) from
source-specific parsing (docs/sources/telegram.md), reflecting the adapter
architecture so WhatsApp/Discord docs can slot in. Add README notes on the
DataExport folder layout and the small default model. Ignore *.csv (review
exports derived from chat data).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@coderabbitai

coderabbitai Bot commented Jun 25, 2026

Copy link
Copy Markdown

Review Change Stack

Warning

Review limit reached

@NotYuSheng, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 40 minutes and 25 seconds. Learn how PR review limits work.

Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file).

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based credits.

🚦 How do rate limits work?

CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan review availability.

For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, additional reviews become available more gradually as earlier reviews age out of the rolling window.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 3435dd93-5ef7-4bb4-a93d-586d31967d60

📥 Commits

Reviewing files that changed from the base of the PR and between 9e1347f and d484b80.

📒 Files selected for processing (4)
  • docs/data-pipeline.md
  • doppelganger/__main__.py
  • doppelganger/steps.py
  • tests/test_runner.py
📝 Walkthrough

Walkthrough

Adds a doppelganger CLI runner, pipeline step orchestration, ShareGPT turn coercion, updated ingestion/training docs, and tests for the new conversation-shaping rules.

Changes

Ingest runner and ShareGPT shaping

Layer / File(s) Summary
CLI entrypoint and step menu
doppelganger/__init__.py, doppelganger/__main__.py, doppelganger/steps.py
Adds the package runner, subcommand parser, interactive menu, and shared step-status metadata for the ingest/training workflow.
Pipeline stages and usage docs
doppelganger/steps.py, docs/data-pipeline.md, docs/sources/telegram.md, README.md, .gitignore
Implements parse, audit, train, merge, chat, and auto stages; documents the pipeline and Telegram export/input shape; updates quick-start notes; and ignores CSV artifacts.
ShareGPT alternation and tests
ingest/sharegpt.py, tests/test_ingest.py, docs/data-pipeline.md
Coerces ShareGPT conversations into alternating human/gpt turns and updates tests and ShareGPT-format docs for the new acceptance rules.

Sequence Diagram(s)

sequenceDiagram
  participant User
  participant Main as doppelganger.__main__.main
  participant Steps as doppelganger.steps
  participant Telegram as ingest.adapters.telegram
  participant ShareGPT as ingest.sharegpt.to_sharegpt
  participant Llama as llamafactory-cli
  User->>Main: python -m doppelganger parse / audit / train / auto
  Main->>Steps: dispatch step
  Steps->>Telegram: load Telegram export
  Telegram-->>Steps: NormalizedMessage list
  Steps->>ShareGPT: to_sharegpt(conversations)
  ShareGPT-->>Steps: ShareGPT samples
  Steps->>Llama: llamafactory-cli train / merge / chat
  Llama-->>Steps: exit code
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

I hopped through chats with twitchy nose,
Trimmed the turns where oddness grows.
Human, gpt — now neat and light,
The training carrots taste just right. 🐇

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 30.30% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly covers the ShareGPT drop fix and the added step-based runner.
Linked Issues check ✅ Passed The changes address #42 by coercing ShareGPT turns into valid alternation and adding tests for leading-assistant, non-alternating, and multi-speaker cases.
Out of Scope Changes check ✅ Passed The docs, CLI runner, README, and ignore-rule updates align with the PR objectives and do not show unrelated code changes.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/sharegpt-fix-and-runner

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a step-based orchestrator (doppelganger) with an interactive menu and CLI subcommands to run the ingestion pipeline and LLaMA-Factory training. It also adds a conversation coercion step (_coerce_alternating) to ensure dataset compatibility with LLaMA-Factory, along with comprehensive documentation and tests. The review feedback focuses on improving the robustness of the orchestrator steps by gracefully handling missing files, uninstalled dependencies (like PyYAML), and parsing errors to prevent CLI crashes.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread doppelganger/steps.py
Comment thread doppelganger/steps.py Outdated
Comment thread doppelganger/steps.py Outdated
Comment thread doppelganger/steps.py Outdated
Comment thread doppelganger/steps.py Outdated

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
ingest/sharegpt.py (1)

63-69: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Filter unmapped roles before coercion.

Line 63 still falls back to the raw role string, but _coerce_alternating() only merges duplicates and trims the ends. A sample like user -> system -> assistant now emits human -> system -> gpt, which breaks the “strictly alternate human/gpt” contract and can reintroduce the train-time drop/count mismatch this PR is fixing.

Suggested fix
-            speaker = ROLE_MAP.get(turn.get("role", ""), turn.get("role", ""))
+            raw_role = turn.get("role", "")
+            speaker = ROLE_MAP.get(
+                raw_role,
+                raw_role if raw_role in {"human", "gpt"} else "",
+            )
             value = turn.get("text", "").strip()
             if speaker and value:
                 conversations.append({"from": speaker, "value": value})
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ingest/sharegpt.py` around lines 63 - 69, The ShareGPT conversation
normalization in ingest/sharegpt.py still lets unmapped roles pass through via
the fallback in the turn parsing logic, which can survive _coerce_alternating()
and break the strict human/gpt alternation. Update the turn processing around
ROLE_MAP and _coerce_alternating() so only mapped roles are collected into
conversations, dropping any unmapped entries before coercion and keeping the
existing alternating contract intact.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/data-pipeline.md`:
- Around line 96-99: The redaction report documentation is out of sync with the
actual payload shape: the `redactor` output does not include a raw `value`
field. Update the `data-pipeline.md` section describing
`data/redaction_report.json` so it matches the fields emitted by
`ingest/redactor.py` (`conversation`, `turn`, `role`, `category`, `detector`,
`severity`, and masked `preview`), and remove any mention of `value` from the
documented contract.

In `@doppelganger/__main__.py`:
- Around line 115-118: The interactive menu in the main entrypoint currently
calls input() without handling EOFError or KeyboardInterrupt, which can surface
as a traceback. Update the menu loop around the choice prompt in __main__.py to
catch these exceptions, treat them as a clean exit path, and return a non-error
status code just like the existing quit/empty-input branch. Use the interactive
menu logic near the choice handling in the main function to locate the change.

In `@doppelganger/steps.py`:
- Around line 178-192: The chat() flow is reloading train_config() instead of
using the same config source as train(), which can make it look for the wrong
adapter or model/template after a custom --config run. Update chat() to accept
and use the active training config path/value passed through the CLI, then load
that config before checking adapter_model.safetensors and building the
_llamafactory chat args, keeping its behavior aligned with train().

---

Outside diff comments:
In `@ingest/sharegpt.py`:
- Around line 63-69: The ShareGPT conversation normalization in
ingest/sharegpt.py still lets unmapped roles pass through via the fallback in
the turn parsing logic, which can survive _coerce_alternating() and break the
strict human/gpt alternation. Update the turn processing around ROLE_MAP and
_coerce_alternating() so only mapped roles are collected into conversations,
dropping any unmapped entries before coercion and keeping the existing
alternating contract intact.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: fc17753c-0865-46ef-bffd-831c21a51781

📥 Commits

Reviewing files that changed from the base of the PR and between ae70a6f and 9e1347f.

📒 Files selected for processing (9)
  • .gitignore
  • README.md
  • docs/data-pipeline.md
  • docs/sources/telegram.md
  • doppelganger/__init__.py
  • doppelganger/__main__.py
  • doppelganger/steps.py
  • ingest/sharegpt.py
  • tests/test_ingest.py

Comment thread docs/data-pipeline.md
Comment thread doppelganger/__main__.py Outdated
Comment thread doppelganger/steps.py Outdated
NotYuSheng and others added 2 commits June 26, 2026 00:42
Recommend num_train_epochs from dataset size + effective batch, targeting a
~300-step budget but capping epochs low (and warning) on small datasets — where
hitting the budget would just memorise the same few chats. Printed in `train`
and the `audit` next-step hint; `--epochs auto|N` overrides via a temp config.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Address PR review:
- adapter_dir / chat: catch Exception (missing PyYAML, malformed config), not
  just OSError, so the menu and chat don't crash on a clean environment
- parse: wrap adapter.parse in try/except for a clean error on malformed
  exports or self-name detection failure
- train / merge / chat: fail fast with a clear message when the config is missing
- chat: accept --config so it aligns with the config train ran on
- menu: handle EOFError / KeyboardInterrupt as a clean exit (0 / 130)
- docs: drop the raw `value` field from the redaction_report.json contract
  (it is not emitted by ingest/redactor.py on this branch)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@NotYuSheng

Copy link
Copy Markdown
Owner Author

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a step-based orchestrator for the ingestion pipeline and LLaMA-Factory training, including a new interactive CLI, updated documentation, and a robust conversation coercion mechanism to prevent LLaMA-Factory from silently discarding non-conforming samples. Feedback on the changes highlights several robustness improvements in doppelganger/steps.py, specifically recommending safer type conversions in effective_batch, broader exception handling in audit and train when parsing YAML files, and ensuring temporary configuration files generated during training are properly cleaned up.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread doppelganger/steps.py Outdated
Comment thread doppelganger/steps.py
Comment thread doppelganger/steps.py Outdated
Address second PR review round:
- effective_batch: coerce batch/accum defensively (missing, null, or non-int
  values fall back to 1 instead of raising)
- audit next-step hint: catch Exception (malformed config / missing PyYAML),
  not just OSError, so the hint never crashes a successful audit
- train: guard _read_yaml with try/except, and delete the temp override config
  in a finally block so --epochs runs don't leak temp files

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@NotYuSheng NotYuSheng merged commit 9956865 into main Jun 25, 2026
4 checks passed
@NotYuSheng NotYuSheng deleted the feat/sharegpt-fix-and-runner branch June 25, 2026 16:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant