feat: sensitive-data redaction, smarter grouping, OpenAI-compatible LLM, config hygiene by NotYuSheng · Pull Request #12 · NotYuSheng/Doppelganger

NotYuSheng · 2026-06-23T14:49:47Z

Summary

Adds a privacy/sanitization layer and improves conversation quality before training, plus an OpenAI-compatible LLM client and config hygiene.

🔒 Sensitive-data redaction (new)

ingest/redaction/ — locale-keyed regex detector registry (mirrors the adapter registry). Universal detectors (email, Luhn-checked cards, IP/MAC, API/private keys) always run; locale packs add country IDs. Singapore ships as the reference (NRIC with checksum, local phone, postal). Adding a country = one drop-in module.
ingest/redactor.py — non-destructive scan → data/redaction_report.json (masked previews). Opt-in --redact replace|drop. Optional LLM verbatim-span detection flows into the same report/apply step.
CLI: --redact, --redact-locales, --skip-redact-scan, --llm-redact (local-first cloud-consent guard), and --no-audit / --skip-validation off-switches.

🧵 Conversation grouping

NormalizedMessage gains message_id/reply_to_id; Telegram adapter populates them.
Reply-threading stitches gap-split conversations back together; --multi-speaker preserves and labels group-chat senders (the owner's turns are never labelled).
Validator gains a pairing axis and keep/split/drop repair of over-merged samples.

🤖 LLM client

ingest/llm.py — shared OpenAI-compatible client (OpenAI or local Ollama/vLLM/LM Studio), replacing the Anthropic SDK. Degrades gracefully if the endpoint is down (per-item catch, dataset still written).
Env vars renamed off the old DialogSmith name: LLM_VALIDATE / LLM_MODEL / LLM_API_KEY / LLM_API_BASE_URL (aligns with the TracePcap convention).

⚙️ Config & docs

train_lora.yaml: explicit train_on_prompt: false documenting loss masking (makes --multi-speaker labels safe).
*.local.yaml override pattern (gitignored) keeps personal model/hardware tweaks out of git; .env reconciled; .env.example → example.env.
README restyled to the house style; prominent Caution + Intended/Responsible Use sections.

Testing

python -m unittest discover -s tests -t . → 41 tests pass (new tests/test_redaction.py + grouping/validator cases).
Manually verified: graceful degradation with a dead LLM endpoint, --redact replace/drop, and --no-audit (zero scan/LLM calls, dataset still written).

Notes

The git remote still points at the old DialogSmith URL (GitHub redirects it). Worth updating to the canonical Doppelganger URL separately.
Default base model stays Qwen1.5-1.8B in the committed config; personal 14B setup lives in gitignored *.local.yaml.

🤖 Generated with Claude Code

Summary by CodeRabbit

Release Notes

New Features
- Regex-based sensitive-data detection with locale support and optional LLM-assisted redaction
- Multi-speaker conversation support with speaker labeling
- Message reply threading across conversation gaps
- LLM quality auditor with split/drop/keep actions
- Added startup ASCII banner
Bug Fixes
- Telegram adapter now handles missing sender fields
Documentation
- Substantially updated README with refreshed setup flow, CLI flags, and LLM configuration guidance
Chores
- Switched dependencies from Anthropic to OpenAI SDK
- Removed deprecated legacy scripts
- Added demo assets and sample exports

…LM, config hygiene Adds a privacy/sanitization layer and improves conversation quality before training. Sensitive-data redaction (new): - ingest/redaction/: locale-keyed regex detector registry (universal + a Singapore pack with NRIC checksum, local phone, postal), mirroring the adapter registry so new countries are a single drop-in module. - ingest/redactor.py: non-destructive scan -> data/redaction_report.json (masked previews), opt-in --redact replace|drop, plus optional LLM verbatim-span detection. - CLI: --redact, --redact-locales, --skip-redact-scan, --llm-redact (with a local-first cloud-consent guard), and --no-audit / --skip-validation off-switches. Conversation grouping: - NormalizedMessage gains message_id/reply_to_id; Telegram adapter populates them. - core: reply-threading stitches gap-split conversations back together; --multi-speaker preserves and labels group-chat senders (the owner's turns are never labelled). - validator: adds a pairing axis and keep/split/drop repair of over-merged samples. LLM client: - ingest/llm.py: shared OpenAI-compatible client (OpenAI or local Ollama/vLLM/LM Studio), replacing the Anthropic SDK. Degrades gracefully if the endpoint is down. - Env vars renamed off the old DialogSmith name: LLM_VALIDATE / LLM_MODEL / LLM_API_KEY / LLM_API_BASE_URL. Config & docs: - train_lora.yaml: explicit train_on_prompt: false to document loss masking (makes --multi-speaker labels safe). - *.local.yaml override pattern (gitignored) keeps personal model/hardware tweaks out of git; .env reconciled to current vars; .env.example renamed to example.env. - README restyled to the project house style; prominent caution + intended/responsible use sections. Tests: adds tests/test_redaction.py and new grouping/validator cases (41 total, green). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

gemini-code-assist

Code Review

This pull request introduces a robust sensitive-data redaction and auditing pipeline using both regex-based detectors (with universal and Singapore-specific locales) and optional LLM-assisted scanning. It also refactors the LLM validation to use a shared OpenAI-compatible client supporting local endpoints, and enhances conversation reconstruction with reply-based threading and a multi-speaker mode. The review feedback highlights several key improvement opportunities: prioritizing longer/outermost spans during overlapping redaction to prevent sensitive data leaks, avoiding redundant speaker prefixes on merged turns in multi-speaker mode, implementing consecutive failure thresholds to gracefully handle LLM API outages, and correcting the Singapore postal code regex to match the standard S123456 format.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

The scripts/telegram_extract.py and scripts/convert_to_sharegpt.py shims only delegated to `python -m ingest`; remove them and update the Legacy Workflow note. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

coderabbitai · 2026-06-23T14:54:10Z

Warning

Review limit reached

@NotYuSheng, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 44 minutes and 54 seconds. Learn how PR review limits work.

Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file).

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based credits.

🚦 How do rate limits work?

CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan refill rate.

For paid Pro and Pro+ PR reviews, CodeRabbit uses rolling per-developer review limits. Reviews become available again as older review attempts age out of the rolling limit window.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 3eeff0ab-a5cc-4496-981f-f4166444df70

📥 Commits

Reviewing files that changed from the base of the PR and between 874e305 and 301ce6c.

📒 Files selected for processing (6)

demo/README.md
demo/build_final.py
demo/img2ascii.py
ingest/cli.py
ingest/validator.py
tests/test_ingest.py

📝 Walkthrough

Walkthrough

The PR replaces the Anthropic LLM client with an OpenAI-compatible wrapper (ingest/llm.py), adds a full regex+LLM sensitive-data redaction pipeline (ingest/redaction/, ingest/redactor.py), introduces reply-based conversation stitching and multi-speaker turn assembly in ingest/core.py, extends the CLI with new redaction/audit flags, renames the env template, removes deprecated shim scripts, and adds demo build tooling.

Changes

Redaction, threading, LLM migration, and CLI integration

Layer / File(s)	Summary
NormalizedMessage reply fields and Telegram extraction `ingest/message.py`, `ingest/adapters/telegram.py`	`NormalizedMessage` gains optional `message_id` and `reply_to_id` fields; `TelegramAdapter.parse` populates both from raw export data and defaults missing `"from"` fields to `"Unknown"`.
Reply-based conversation stitching and multi-speaker turn assembly `ingest/core.py`, `tests/test_ingest.py`	`_merge_by_reply` (union-find over reply chains) stitches time-split segments; `_assemble_turns` handles multi-speaker role/text construction with sender-prefixed labels; `build_samples` is updated to call both. Tests cover stitching across time gaps and labeled vs. collapsed speaker output.
Regex redaction core: Detector/Finding dataclasses and scan_text `ingest/redaction/__init__.py`	Defines immutable `Detector`/`Finding` dataclasses, the global registry with `register`/`make`/`iter_detectors`, `mask()`, `scan_text()` with optional named-group and validator support, and `luhn_valid()`; imports bundled locale modules at package load time.
Universal and Singapore locale detector registrations `ingest/redaction/universal.py`, `ingest/redaction/sg.py`	`universal.py` registers email, IPv4/IPv6, MAC, credit card (Luhn-gated), and credential/key pattern detectors. `sg.py` registers NRIC/FIN (with `nric_valid` checksum), partial NRIC, Singapore phone, and postal code detectors.
Redactor: scan, apply, LLM tier, and merge `ingest/redactor.py`	Implements `scan_samples()` for non-destructive audit reports, `write_report`/`print_summary`, `_replace_spans()` with right-to-left overlap resolution, `apply()` for `replace`/`drop` modes, `llm_scan_samples()` for verbatim-span LLM detection, and `merge_llm_findings()` to fold LLM results into regex reports.
Redaction tests `tests/test_redaction.py`	Covers universal and SG detector correctness, registry uniqueness, `scan_samples` report structure, `apply` replace/drop semantics, LLM scan span location and fallback, `merge_llm_findings` totals, and `_replace_spans` overlap/merge handling.
OpenAI-compatible LLM client module `ingest/llm.py`	Introduces `LLM_*` env-var constants, config helpers (`base_url`, `model`, `is_local`, `should_validate`), `get_client()` with lazy `openai` import and hosted-vs-local key enforcement, `endpoint_label()`, and `chat()` for single-turn completions.
Validator refactor to OpenAI-compatible auditor `ingest/validator.py`, `tests/test_ingest.py`	Replaces Anthropic-based scoring with the shared `ingest.llm` client; `_score_sample` expands to three-axis JSON scoring (coherence/quality/pairing) with `keep\|split\|drop` action; adds `_apply_split` and `_has_both_roles` helpers; `validate_samples` uses `llm.should_validate()` gating with consecutive-failure abort.
CLI pipeline: new flags and integrated redaction/validation flow `ingest/cli.py`, `ingest/banner.py`	Adds `_run_llm_redaction()` guard, new parser flags (`--multi-speaker`, `--redact`, `--redact-locales`, `--skip-redact-scan`, `--llm-redact`, `--allow-cloud-redaction`, `--no-audit`, `--skip-validation`), banner printing in `main()`, and the conditional regex scan → LLM scan → apply → validate pipeline. `banner.py` provides the ANSI startup banner with env-var suppression.
Demo assets, env template, config, deps, and legacy script removal `example.env`, `.gitignore`, `configs/train_lora.yaml`, `requirements.txt`, `setup.bat`, `setup.sh`, `demo/*`, `README.md`	Renames `.env.example` to `example.env` with `LLM_` variables; ignores `configs/.local.yaml`; sets `train_on_prompt: false`; replaces `anthropic>=0.39` with `openai>=1.0`; updates setup scripts; removes deprecated shims (`scripts/convert_to_sharegpt.py`, `scripts/telegram_extract.py`); adds `demo/build_final.py`, `demo/img2ascii.py`, `demo/mascot.txt`, `demo/sample_export.json`; substantially rewrites README.

Sequence Diagram(s)

sequenceDiagram
  participant User
  participant CLI as ingest/cli.py
  participant core as ingest/core.py
  participant redactor as ingest/redactor.py
  participant llm as ingest/llm.py
  participant validator as ingest/validator.py

  User->>CLI: python -m ingest --source telegram --redact replace --llm-redact
  CLI->>core: build_samples(messages, multi_speaker=...)
  core-->>CLI: samples (with reply stitching)
  CLI->>redactor: scan_samples(samples, locales)
  redactor-->>CLI: regex audit report
  CLI->>redactor: llm_scan_samples(samples, client, model)
  redactor->>llm: chat(client, model, prompt)
  llm-->>redactor: verbatim findings JSON
  redactor-->>CLI: llm_findings
  CLI->>redactor: merge_llm_findings(report, llm_findings)
  redactor-->>CLI: merged report
  CLI->>redactor: apply(samples, mode="replace", llm_findings=...)
  redactor-->>CLI: redacted samples
  CLI->>validator: validate_samples(redacted_samples)
  validator->>llm: chat(client, model, scoring_prompt)
  llm-->>validator: keep/split/drop JSON
  validator-->>CLI: validated samples
  CLI-->>User: writes output + redaction_report.json

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related issues

Additional Singapore detectors: NRIC M-series checksum + UEN #15: The PR adds ingest/redaction/sg.py with nric_valid covering S/T/F/G series, and the issue proposes extending nric_valid to support M-series checksums and adding a new sg_uen detector — both target the same file with code-level additions to the existing Singapore detector module.

Poem

🐰 Hop hop, the rabbit checks every line,
Regex detectors now scan and align,
Anthropic's gone, OpenAI takes the stage,
Reply threads stitched across many a page,
With redact and split the chats look so fine —
✨ No secrets shall slip through this warren of mine!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 28.72% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately captures the four major changes: sensitive-data redaction, smarter conversation grouping, OpenAI-compatible LLM client, and configuration improvements. It is specific, concise, and directly reflects the PR's primary objectives.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/data-sanitization-and-pipeline-hardening

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

Caution block now owns data-sensitivity + consent + law; Important block owns model-misuse. Removes the contradictory 'never consented' phrasing and the duplicate consent line. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

States the project's intent to explore pre-training, fine-tuning, and alignment, plus persona prompting (#14), more adapters (#9), NER redaction (#13), and wider locale packs (#15). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Frames the project as a learning sandbox spanning training, memory/RAG, agents, MCP, guardrails, and evaluation. Links the new exploration issues (#16-#21) alongside existing tracked work (#9, #13, #14, #15). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Adds multimodal, inference-time control, and interpretability tracks and links the complete exploration backlog; points to the exploration label for the rest. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Keep the README roadmap as a plain statement of exploration areas; the issue tracker holds the live backlog. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…rts, SG postal) - redactor._replace_spans: on overlap keep the outer/longer span so an inner span can't leak the rest (e.g. DOMAIN inside EMAIL) [security-high]. - core._assemble_turns: apply the speaker prefix only once per merged turn. - redactor.llm_scan_samples + validator.validate_samples: abort after 5 consecutive LLM failures instead of flooding the console. - redaction/sg.py: sg_postal also matches the common "S123456" form. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…oud suggestions) - llm.py: remove DEFAULT_MODEL; LLM_MODEL is required to enable the LLM features (clear error otherwise). Docs use the vLLM/LM Studio HF model-id convention. - example.env / README: lead with a LOCAL OpenAI-compatible server (Qwen/Qwen2.5-7B-Instruct), remove gpt-4o-mini/cloud suggestions; drop the roadmap "(today)" qualifier. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

NotYuSheng · 2026-06-23T15:36:01Z

/gemini review

gemini-code-assist

Code Review

This pull request refactors the ingestion pipeline to be source-agnostic, introducing conversation reconstruction via reply-threading, multi-speaker labeling, and a non-destructive sensitive-data redaction system (supporting both regex and local LLM-assisted scanning). It also transitions the quality auditor to a shared OpenAI-compatible client and removes legacy scripts. The review feedback highlights three key improvements: ensuring llm_scan_samples redacts all occurrences of a sensitive span rather than just the first, adding a fallback for missing sender fields in the Telegram adapter to prevent type violations, and appending a negative lookahead to the Singapore postal code regex to avoid false positives on longer digit sequences.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

A trailing negative lookahead stops sg_postal matching the first 6 digits of a longer token (e.g. NRIC S1234567D reading as S123456). Adds a regression test. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

- ingest/banner.py: a parrot-in-a-mirror mascot (it mimics your voice; the mirror is the doppelganger) beside an ansi_shadow "Doppel/ganger" wordmark in truecolor amber, printed at CLI startup. DOPPELGANGER_NO_BANNER silences it. - README: embed demo/demo.gif (ingest + sensitive-data scan) at the top. - demo/: synthetic sample_export.json (gitignored exception) + the mascot source image and the build/convert scripts used to generate the art and GIF. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…sender - redactor.llm_scan_samples: locate EVERY non-overlapping occurrence of a flagged span (text.find loop), not just the first, so repeated names/numbers can't leak [security-high]. - telegram adapter: 'from' may be None (anonymous channel posts) -> default to 'Unknown' so sender_id stays a str and multi-speaker mode has no 'None:' prefix. - Tests for both. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

NotYuSheng · 2026-06-23T22:48:00Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces a major update to the Doppelganger project, adding robust features for data privacy, sensitive-data scanning, and LLM-assisted redaction and auditing. Key additions include a regex-based sensitive-data scanner with universal and Singapore-specific detectors, an optional LLM-assisted redaction pass, and an upgraded conversation quality auditor that can split over-merged conversations. Additionally, the core pipeline is enhanced with reply-threading and multi-speaker labeling, accompanied by updated documentation, tests, and a reproducible demo. The review feedback identifies three valid issues: a bug in the greedy interval selection algorithm in ingest/redactor.py that can leave sensitive data exposed during overlapping span replacements, a potential resource leak in demo/build_final.py due to an unclosed file, and a potential TypeError in ingest/redaction/__init__.py when handling unmatched optional regex groups.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

…roup - redactor._replace_spans: merge overlapping spans and redact the full region (label = longest contributor), so a shorter span can't shadow part of a longer sensitive one [security-high]. - redaction.scan_text: skip a hit whose optional id group didn't match (m.span(id) == (-1,-1)) instead of crashing in mask(). - demo/build_final.py: read mascot.txt via a context manager. - Tests for merge behaviour and the optional-id guard. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 6

🧹 Nitpick comments (6)

ingest/llm.py (2)
62-68: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

Chain the re-raised ImportError.

Re-raising inside the except clause loses the original traceback context. Use from None (the cause is obvious here) to satisfy B904 and keep the message clean.
♻️ Proposed change
     try:
         from openai import OpenAI
-    except ImportError:
+    except ImportError:
         raise ImportError(
             "The 'openai' package is required for LLM features. "
             "Install it with: pip install openai"
-        )
+        ) from None
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ingest/llm.py` around lines 62 - 68, The re-raised ImportError in the except
clause should use `from None` to suppress the exception context chain and
satisfy the B904 linting rule. Modify the `raise ImportError` statement to
include `from None` at the end, so the exception is raised without the implicit
chaining that normally occurs when raising inside an except block.
Source: Linters/SAST tools

94-101: 🩺 Stability & Availability | 🔵 Trivial

Add explicit per-request timeout to prevent stalls on slow endpoints.

The OpenAI SDK 1.x supports per-request timeouts via client.with_options(timeout=...). Without an explicit timeout, a slow or hung endpoint can stall the batch CLI for a long time on each call. Use client.with_options(timeout=X).chat.completions.create(...) where X is a float (seconds) or httpx.Timeout object to fail fast and let the validator's consecutive-failure guard engage sooner.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ingest/llm.py` around lines 94 - 101, The chat function makes an API call to
client.chat.completions.create() without an explicit per-request timeout, which
can cause the validator to stall indefinitely on slow or unresponsive endpoints.
Modify the client.chat.completions.create() call to use
client.with_options(timeout=X).chat.completions.create() pattern, where X is a
float value in seconds representing the timeout duration. This will ensure the
request fails fast if the endpoint is slow or hung, allowing the
consecutive-failure guard mechanism to engage sooner.
tests/test_ingest.py (1)
132-132: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

Optional: prefer next() for single-element lookup.

The linter suggests using next(m for m in msgs if m.timestamp == 100) instead of the list comprehension with indexing, which is slightly more idiomatic for single-element lookups.
♻️ Suggested refactor
-            anon = [m for m in msgs if m.timestamp == 100][0]
+            anon = next(m for m in msgs if m.timestamp == 100)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/test_ingest.py` at line 132, Replace the list comprehension with
indexing `[0]` used to find the single message with timestamp equal to 100 in
the msgs collection with the `next()` function using a generator expression.
This is more idiomatic for single-element lookups and avoids creating an
intermediate list just to access the first element. Change the line where `anon`
is assigned to use `next(m for m in msgs if m.timestamp == 100)` instead of the
current list comprehension approach.
Source: Linters/SAST tools
ingest/core.py (1)
88-97: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Consider extracting union-find to a shared utility module.

The find and union helper functions implement standard union-find logic that is duplicated in ingest/redactor.py (lines 88-92 show an identical find implementation). Extracting this to a common utility module would reduce duplication and make the pattern more discoverable.
♻️ Suggested approach

Create a new ingest/utils.py module:
def union_find_merge(n: int, edges) -> list:
    """Merge items using union-find. Returns grouped indices."""
    parent = list(range(n))
    
    def find(x: int) -> int:
        while parent[x] != x:
            parent[x] = parent[parent[x]]
            x = parent[x]
        return x
    
    def union(a: int, b: int) -> None:
        ra, rb = find(a), find(b)
        if ra != rb:
            parent[max(ra, rb)] = min(ra, rb)
    
    for a, b in edges:
        union(a, b)
    
    groups = {}
    for i in range(n):
        groups.setdefault(find(i), []).append(i)
    return list(groups.values())
Then refactor both _merge_by_reply and redactor._replace_spans to use this shared implementation.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ingest/core.py` around lines 88 - 97, The find and union functions in the
current location are duplicated in ingest/redactor.py, creating maintenance
issues. Create a new ingest/utils.py module with a shared union_find_merge
function that encapsulates the complete union-find logic, accepting the number
of items and a list of edges to merge, and returning the grouped indices. Then
refactor both the find and union function definitions in ingest/core.py and the
duplicate find implementation in ingest/redactor.py to use this new shared
utility function instead of defining their own versions.
ingest/redaction/sg.py (1)
40-40: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Add strict=True to zip() for defensive programming.

While the current code guarantees matching lengths (7 digits and 7 weights), adding strict=True provides a runtime assertion that fails fast if future refactoring breaks this invariant.
🛡️ Proposed fix
-    total = sum(int(d) * w for d, w in zip(digits, _WEIGHTS))
+    total = sum(int(d) * w for d, w in zip(digits, _WEIGHTS, strict=True))
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ingest/redaction/sg.py` at line 40, Add the `strict=True` parameter to the
`zip()` function call in the total variable calculation where digits are being
zipped with _WEIGHTS. This ensures that if the lengths of the two iterables
don't match, a ValueError will be raised immediately rather than silently
truncating, providing defensive programming against future refactoring that
might inadvertently break this invariant.
Source: Linters/SAST tools
ingest/redaction/universal.py (1)
48-48: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

Use HYPHEN-MINUS instead of EN DASH in the comment.

The comment contains an EN DASH (–, U+2013) instead of a HYPHEN-MINUS (-, U+002D). While functionally harmless in a comment, using the standard ASCII hyphen improves consistency and avoids potential copy-paste issues.
📝 Proposed fix
-# Broad 13–19 digit run (optionally space/dash grouped); Luhn rejects the noise.
+# Broad 13-19 digit run (optionally space/dash grouped); Luhn rejects the noise.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ingest/redaction/universal.py` at line 48, The comment containing "Broad
13–19 digit run" uses an EN DASH character (U+2013) between the numbers instead
of a standard HYPHEN-MINUS (U+002D). Replace the EN DASH with a regular
hyphen-minus character to improve consistency and avoid potential copy-paste
issues when the standard ASCII hyphen is expected.
Source: Linters/SAST tools

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@demo/build_final.py`:
- Around line 70-71: Add check=True parameter to both subprocess.run calls (at
the first occurrence and the second one around lines 91-93) to make subprocess
raise an exception when commands return non-zero exit codes, ensuring the script
fails fast on errors. This will prevent the script from continuing to produce
artifacts when ingestion or rendering fails. If you need to capture and report
stderr on failure, wrap the subprocess.run calls in a try-except block to handle
the CalledProcessError exception.
- Line 15: Replace the hardcoded AGG variable assignment with a dynamic lookup
that first checks the AGG environment variable and falls back to searching the
system PATH using shutil.which, then raise an informative error if the
executable is not found. Additionally, locate the subprocess calls around lines
91-93 that handle GIF rendering and modify them to include check=True parameter
and remove DEVNULL output suppression, allowing subprocess errors to be properly
raised and surfaced to the caller rather than silently failing.

In `@demo/img2ascii.py`:
- Around line 35-37: In the main execution block where to_ascii is called, add a
validation guard before accessing args[0] to check that the args list is not
empty. If args is empty, print a usage message that describes the required PATH
argument and optional width parameter, then call sys.exit with a non-zero exit
code (e.g., 1). This prevents the IndexError from being raised and provides
helpful guidance to the user when the script is invoked without the required
PATH argument.
- Line 10: The import statement `from PIL import Image` requires the Pillow
package but this dependency is not declared in requirements.txt or the project's
configuration file (setup.py, pyproject.toml, etc.). Add Pillow as a dependency
to the appropriate project configuration file to ensure that fresh installations
can properly resolve the required package.

In `@ingest/cli.py`:
- Around line 210-230: The redactor.apply() call is currently nested inside the
`if not skip_scan:` block, which means redaction is silently skipped when either
`args.no_audit` or `args.skip_redact_scan` is true, even if the user explicitly
requested redaction via `--redact` flag. Decouple the redaction application from
the scan logic by either moving the redactor.apply() call and related redaction
output logic outside the `if not skip_scan:` block (so redaction is always
applied when requested regardless of scan status), or add validation logic early
in the function to warn or error when args.redact is not "off" while skip_scan
is true. Ensure that --llm-redact is also handled correctly and not silently
dropped.

In `@ingest/validator.py`:
- Around line 155-170: The split repair feature is unreachable because the
low-score drop condition is evaluated first. In the validator logic around the
low variable and the subsequent conditional branches, reorder the if-elif-else
chain to check r["action"] == "split" before the combined condition r["action"]
== "drop" or low. This ensures samples marked for splitting are processed before
being dropped due to low coherence, quality, or pairing scores, allowing the
split repair to function for its intended use case of over-merged samples.

---

Nitpick comments:
In `@ingest/core.py`:
- Around line 88-97: The find and union functions in the current location are
duplicated in ingest/redactor.py, creating maintenance issues. Create a new
ingest/utils.py module with a shared union_find_merge function that encapsulates
the complete union-find logic, accepting the number of items and a list of edges
to merge, and returning the grouped indices. Then refactor both the find and
union function definitions in ingest/core.py and the duplicate find
implementation in ingest/redactor.py to use this new shared utility function
instead of defining their own versions.

In `@ingest/llm.py`:
- Around line 62-68: The re-raised ImportError in the except clause should use
`from None` to suppress the exception context chain and satisfy the B904 linting
rule. Modify the `raise ImportError` statement to include `from None` at the
end, so the exception is raised without the implicit chaining that normally
occurs when raising inside an except block.
- Around line 94-101: The chat function makes an API call to
client.chat.completions.create() without an explicit per-request timeout, which
can cause the validator to stall indefinitely on slow or unresponsive endpoints.
Modify the client.chat.completions.create() call to use
client.with_options(timeout=X).chat.completions.create() pattern, where X is a
float value in seconds representing the timeout duration. This will ensure the
request fails fast if the endpoint is slow or hung, allowing the
consecutive-failure guard mechanism to engage sooner.

In `@ingest/redaction/sg.py`:
- Line 40: Add the `strict=True` parameter to the `zip()` function call in the
total variable calculation where digits are being zipped with _WEIGHTS. This
ensures that if the lengths of the two iterables don't match, a ValueError will
be raised immediately rather than silently truncating, providing defensive
programming against future refactoring that might inadvertently break this
invariant.

In `@ingest/redaction/universal.py`:
- Line 48: The comment containing "Broad 13–19 digit run" uses an EN DASH
character (U+2013) between the numbers instead of a standard HYPHEN-MINUS
(U+002D). Replace the EN DASH with a regular hyphen-minus character to improve
consistency and avoid potential copy-paste issues when the standard ASCII hyphen
is expected.

In `@tests/test_ingest.py`:
- Line 132: Replace the list comprehension with indexing `[0]` used to find the
single message with timestamp equal to 100 in the msgs collection with the
`next()` function using a generator expression. This is more idiomatic for
single-element lookups and avoids creating an intermediate list just to access
the first element. Change the line where `anon` is assigned to use `next(m for m
in msgs if m.timestamp == 100)` instead of the current list comprehension
approach.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 1417d538-60cb-4e8d-b169-5add8b41e5b8

📥 Commits

Reviewing files that changed from the base of the PR and between ef7cd61 and 874e305.

⛔ Files ignored due to path filters (2)

demo/demo.gif is excluded by !**/*.gif
demo/parrot-mirror.jpg is excluded by !**/*.jpg

📒 Files selected for processing (27)

.env.example
.gitignore
README.md
configs/train_lora.yaml
demo/build_final.py
demo/img2ascii.py
demo/mascot.txt
demo/sample_export.json
example.env
ingest/adapters/telegram.py
ingest/banner.py
ingest/cli.py
ingest/core.py
ingest/llm.py
ingest/message.py
ingest/redaction/__init__.py
ingest/redaction/sg.py
ingest/redaction/universal.py
ingest/redactor.py
ingest/validator.py
requirements.txt
scripts/convert_to_sharegpt.py
scripts/telegram_extract.py
setup.bat
setup.sh
tests/test_ingest.py
tests/test_redaction.py

💤 Files with no reviewable changes (3)

.env.example
scripts/convert_to_sharegpt.py
scripts/telegram_extract.py

…en demo scripts - cli: decouple redaction application from the scan/report — --redact is now applied even with --skip-redact-scan/--no-audit, so chat data isn't silently left unredacted [security]. - validator: evaluate the 'split' action before the low-score drop, so over-merged samples are repaired instead of discarded. - demo scripts: configurable AGG path ($AGG), check=True on subprocess, usage guard in img2ascii, and a demo/README documenting the dev-only deps (pillow, pyfiglet, agg). - Tests for the two behaviours. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

gemini-code-assist Bot reviewed Jun 23, 2026

View reviewed changes

Comment thread ingest/redactor.py

Comment thread ingest/core.py

Comment thread ingest/redactor.py

Comment thread ingest/validator.py

Comment thread ingest/redaction/sg.py

chore: remove deprecated script shims; fix remote name in docs

a9367c7

The scripts/telegram_extract.py and scripts/convert_to_sharegpt.py shims only delegated to `python -m ingest`; remove them and update the Legacy Workflow note. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

NotYuSheng and others added 8 commits June 23, 2026 22:56

docs: expand Roadmap to full exploration backlog (#16-#38)

4c207ea

Adds multimodal, inference-time control, and interpretability tracks and links the complete exploration backlog; points to the exploration label for the rest. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

docs: state roadmap techniques without issue links

23b12c5

Keep the README roadmap as a plain statement of exploration areas; the issue tracker holds the live backlog. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

docs: add local LLM (LM Studio) setup + model suggestions by hardware

bf75479

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

gemini-code-assist Bot reviewed Jun 23, 2026

View reviewed changes

Comment thread ingest/redactor.py Outdated

Comment thread ingest/adapters/telegram.py Outdated

Comment thread ingest/redaction/sg.py

NotYuSheng and others added 3 commits June 23, 2026 23:45

fix: SG postal detector no longer matches the leading digits of an NRIC

b395054

A trailing negative lookahead stops sg_postal matching the first 6 digits of a longer token (e.g. NRIC S1234567D reading as S123456). Adds a regression test. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

gemini-code-assist Bot reviewed Jun 23, 2026

View reviewed changes

Comment thread ingest/redactor.py Outdated

Comment thread demo/build_final.py Outdated

Comment thread ingest/redaction/__init__.py Outdated

coderabbitai Bot reviewed Jun 23, 2026

View reviewed changes

Comment thread demo/build_final.py Outdated

Comment thread demo/build_final.py Outdated

Comment thread demo/img2ascii.py

Comment thread demo/img2ascii.py

Comment thread ingest/cli.py

Comment thread ingest/validator.py

NotYuSheng merged commit ab217fc into main Jun 23, 2026
4 checks passed

NotYuSheng deleted the feat/data-sanitization-and-pipeline-hardening branch June 23, 2026 23:15

Conversation

NotYuSheng commented Jun 23, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

🔒 Sensitive-data redaction (new)

🧵 Conversation grouping

🤖 LLM client

⚙️ Config & docs

Testing

Notes

Summary by CodeRabbit

Release Notes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review limit reached

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related issues

Poem

❌ Failed checks (1 warning)

Uh oh!

NotYuSheng commented Jun 23, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

NotYuSheng commented Jun 23, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

NotYuSheng commented Jun 23, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 23, 2026 •

edited

Loading