Fix silent ~72% conversation drop (#42) + step-based runner#43
Conversation
…ntract to_sharegpt only checked that both human and gpt roles were present, but LLaMA-Factory additionally requires each conversation to start with human, strictly alternate, and end with gpt. Real chats violate this constantly (other person messages first; same-role runs survive reply-stitching / multi-speaker), so ~72% of samples were silently dropped at train time (3527 written -> 997 trained) with only WARNING logs. Add _coerce_alternating: merge consecutive same-speaker turns (labels stay in the text), strip a leading gpt turn, strip a trailing human turn. Salvages human,gpt,human -> human,gpt instead of discarding the whole sample. On the same export this raises usable samples from 997 to ~3324. Fixes #42 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…pipeline Add `python -m doppelganger`: an interactive numbered menu (no args) plus named subcommands parse / audit / train / merge / chat, and an `auto` mode that runs parse -> audit end-to-end (like the old one-shot ingest, redact replace by default). parse/audit reuse ingest.*; train/merge/chat shell out to llamafactory-cli. The parse/audit split makes "review before training" a first-class step. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ME/gitignore Separate the source-agnostic pipeline (docs/data-pipeline.md) from source-specific parsing (docs/sources/telegram.md), reflecting the adapter architecture so WhatsApp/Discord docs can slot in. Add README notes on the DataExport folder layout and the small default model. Ignore *.csv (review exports derived from chat data). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
Warning Review limit reached
More reviews will be available in 40 minutes and 25 seconds. Learn how PR review limits work. Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file). ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based credits. 🚦 How do rate limits work?CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan review availability. For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, additional reviews become available more gradually as earlier reviews age out of the rolling window. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (4)
📝 WalkthroughWalkthroughAdds a ChangesIngest runner and ShareGPT shaping
Sequence Diagram(s)sequenceDiagram
participant User
participant Main as doppelganger.__main__.main
participant Steps as doppelganger.steps
participant Telegram as ingest.adapters.telegram
participant ShareGPT as ingest.sharegpt.to_sharegpt
participant Llama as llamafactory-cli
User->>Main: python -m doppelganger parse / audit / train / auto
Main->>Steps: dispatch step
Steps->>Telegram: load Telegram export
Telegram-->>Steps: NormalizedMessage list
Steps->>ShareGPT: to_sharegpt(conversations)
ShareGPT-->>Steps: ShareGPT samples
Steps->>Llama: llamafactory-cli train / merge / chat
Llama-->>Steps: exit code
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces a step-based orchestrator (doppelganger) with an interactive menu and CLI subcommands to run the ingestion pipeline and LLaMA-Factory training. It also adds a conversation coercion step (_coerce_alternating) to ensure dataset compatibility with LLaMA-Factory, along with comprehensive documentation and tests. The review feedback focuses on improving the robustness of the orchestrator steps by gracefully handling missing files, uninstalled dependencies (like PyYAML), and parsing errors to prevent CLI crashes.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
ingest/sharegpt.py (1)
63-69: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick winFilter unmapped roles before coercion.
Line 63 still falls back to the raw role string, but
_coerce_alternating()only merges duplicates and trims the ends. A sample likeuser -> system -> assistantnow emitshuman -> system -> gpt, which breaks the “strictly alternate human/gpt” contract and can reintroduce the train-time drop/count mismatch this PR is fixing.Suggested fix
- speaker = ROLE_MAP.get(turn.get("role", ""), turn.get("role", "")) + raw_role = turn.get("role", "") + speaker = ROLE_MAP.get( + raw_role, + raw_role if raw_role in {"human", "gpt"} else "", + ) value = turn.get("text", "").strip() if speaker and value: conversations.append({"from": speaker, "value": value})🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@ingest/sharegpt.py` around lines 63 - 69, The ShareGPT conversation normalization in ingest/sharegpt.py still lets unmapped roles pass through via the fallback in the turn parsing logic, which can survive _coerce_alternating() and break the strict human/gpt alternation. Update the turn processing around ROLE_MAP and _coerce_alternating() so only mapped roles are collected into conversations, dropping any unmapped entries before coercion and keeping the existing alternating contract intact.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docs/data-pipeline.md`:
- Around line 96-99: The redaction report documentation is out of sync with the
actual payload shape: the `redactor` output does not include a raw `value`
field. Update the `data-pipeline.md` section describing
`data/redaction_report.json` so it matches the fields emitted by
`ingest/redactor.py` (`conversation`, `turn`, `role`, `category`, `detector`,
`severity`, and masked `preview`), and remove any mention of `value` from the
documented contract.
In `@doppelganger/__main__.py`:
- Around line 115-118: The interactive menu in the main entrypoint currently
calls input() without handling EOFError or KeyboardInterrupt, which can surface
as a traceback. Update the menu loop around the choice prompt in __main__.py to
catch these exceptions, treat them as a clean exit path, and return a non-error
status code just like the existing quit/empty-input branch. Use the interactive
menu logic near the choice handling in the main function to locate the change.
In `@doppelganger/steps.py`:
- Around line 178-192: The chat() flow is reloading train_config() instead of
using the same config source as train(), which can make it look for the wrong
adapter or model/template after a custom --config run. Update chat() to accept
and use the active training config path/value passed through the CLI, then load
that config before checking adapter_model.safetensors and building the
_llamafactory chat args, keeping its behavior aligned with train().
---
Outside diff comments:
In `@ingest/sharegpt.py`:
- Around line 63-69: The ShareGPT conversation normalization in
ingest/sharegpt.py still lets unmapped roles pass through via the fallback in
the turn parsing logic, which can survive _coerce_alternating() and break the
strict human/gpt alternation. Update the turn processing around ROLE_MAP and
_coerce_alternating() so only mapped roles are collected into conversations,
dropping any unmapped entries before coercion and keeping the existing
alternating contract intact.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: fc17753c-0865-46ef-bffd-831c21a51781
📒 Files selected for processing (9)
.gitignoreREADME.mddocs/data-pipeline.mddocs/sources/telegram.mddoppelganger/__init__.pydoppelganger/__main__.pydoppelganger/steps.pyingest/sharegpt.pytests/test_ingest.py
Recommend num_train_epochs from dataset size + effective batch, targeting a ~300-step budget but capping epochs low (and warning) on small datasets — where hitting the budget would just memorise the same few chats. Printed in `train` and the `audit` next-step hint; `--epochs auto|N` overrides via a temp config. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Address PR review: - adapter_dir / chat: catch Exception (missing PyYAML, malformed config), not just OSError, so the menu and chat don't crash on a clean environment - parse: wrap adapter.parse in try/except for a clean error on malformed exports or self-name detection failure - train / merge / chat: fail fast with a clear message when the config is missing - chat: accept --config so it aligns with the config train ran on - menu: handle EOFError / KeyboardInterrupt as a clean exit (0 / 130) - docs: drop the raw `value` field from the redaction_report.json contract (it is not emitted by ingest/redactor.py on this branch) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a step-based orchestrator for the ingestion pipeline and LLaMA-Factory training, including a new interactive CLI, updated documentation, and a robust conversation coercion mechanism to prevent LLaMA-Factory from silently discarding non-conforming samples. Feedback on the changes highlights several robustness improvements in doppelganger/steps.py, specifically recommending safer type conversions in effective_batch, broader exception handling in audit and train when parsing YAML files, and ensuring temporary configuration files generated during training are properly cleaned up.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
Address second PR review round: - effective_batch: coerce batch/accum defensively (missing, null, or non-int values fall back to 1 instead of raising) - audit next-step hint: catch Exception (malformed config / missing PyYAML), not just OSError, so the hint never crashes a successful audit - train: guard _read_yaml with try/except, and delete the temp override config in a finally block so --epochs runs don't leak temp files Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
What & why
Two related improvements to the ingestion path, plus docs.
1. Fix: ~72% of conversations were silently dropped at training time (
Fixes #42)to_sharegpt()only checked that a conversation contained both ahumanand agptturn, but LLaMA-Factory's ShareGPT converter additionally requires eachconversation to start with
human, strictly alternate, and end withgpt. Real chats break these rules constantly (the other person messagesfirst; same-role runs survive reply-stitching /
--multi-speaker), sonon-conforming samples were dropped after the pipeline reported success — only
visible as
Invalid role tag/Invalid message countwarnings.On one real export: 3,527 written → 997 actually trained on.
The new
_coerce_alternating()normalizes each conversation instead of onlyrole-presence checking:
gptturn,humanturn.This salvages
human,gpt,human→human,gptrather than discarding it. Sameexport now yields 3,324 usable samples (0 malformed), and the reported count
matches what trains.
Tests: updated the two goldens that encoded the old (droppable) shapes; added
CoerceAlternatingTest(leading-assistant strip, trailing-user strip, same-rolemerge, unusable-pair drop, well-formedness invariant). 52/52 pass.
2. Feature: step-based
doppelgangerrunnerpython -m doppelgangerexposes the pipeline as discrete steps:parse,audit,train,merge,chat;auto→ runsparse → auditend-to-end (like the old one-shot ingest;--redact replaceby default with a warning),--trainto chain training.parse/auditreuseingest.*;train/merge/chatwrapllamafactory-cli.Splitting
parseandauditmakes the "review before training" checkpoint afirst-class step. The existing
python -m ingestone-shot is unchanged.3. Docs
docs/data-pipeline.md) fromsource-specific parsing (
docs/sources/telegram.md), matching the adapterarchitecture so WhatsApp/Discord docs slot in later.
DataExport/export-folder layout and the small defaultmodel. Ignore
*.csv.Notes for reviewers
python -m ingest.data/chat_dataset.jsonl(rawparseoutput) is gitignored alongside theother
data/artifacts.🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Bug Fixes