Skip to content

feat: sensitive-data redaction, smarter grouping, OpenAI-compatible LLM, config hygiene#12

Merged
NotYuSheng merged 15 commits into
mainfrom
feat/data-sanitization-and-pipeline-hardening
Jun 23, 2026
Merged

feat: sensitive-data redaction, smarter grouping, OpenAI-compatible LLM, config hygiene#12
NotYuSheng merged 15 commits into
mainfrom
feat/data-sanitization-and-pipeline-hardening

Conversation

@NotYuSheng

@NotYuSheng NotYuSheng commented Jun 23, 2026

Copy link
Copy Markdown
Owner

Summary

Adds a privacy/sanitization layer and improves conversation quality before training, plus an OpenAI-compatible LLM client and config hygiene.

🔒 Sensitive-data redaction (new)

  • ingest/redaction/ — locale-keyed regex detector registry (mirrors the adapter registry). Universal detectors (email, Luhn-checked cards, IP/MAC, API/private keys) always run; locale packs add country IDs. Singapore ships as the reference (NRIC with checksum, local phone, postal). Adding a country = one drop-in module.
  • ingest/redactor.py — non-destructive scan → data/redaction_report.json (masked previews). Opt-in --redact replace|drop. Optional LLM verbatim-span detection flows into the same report/apply step.
  • CLI: --redact, --redact-locales, --skip-redact-scan, --llm-redact (local-first cloud-consent guard), and --no-audit / --skip-validation off-switches.

🧵 Conversation grouping

  • NormalizedMessage gains message_id/reply_to_id; Telegram adapter populates them.
  • Reply-threading stitches gap-split conversations back together; --multi-speaker preserves and labels group-chat senders (the owner's turns are never labelled).
  • Validator gains a pairing axis and keep/split/drop repair of over-merged samples.

🤖 LLM client

  • ingest/llm.py — shared OpenAI-compatible client (OpenAI or local Ollama/vLLM/LM Studio), replacing the Anthropic SDK. Degrades gracefully if the endpoint is down (per-item catch, dataset still written).
  • Env vars renamed off the old DialogSmith name: LLM_VALIDATE / LLM_MODEL / LLM_API_KEY / LLM_API_BASE_URL (aligns with the TracePcap convention).

⚙️ Config & docs

  • train_lora.yaml: explicit train_on_prompt: false documenting loss masking (makes --multi-speaker labels safe).
  • *.local.yaml override pattern (gitignored) keeps personal model/hardware tweaks out of git; .env reconciled; .env.exampleexample.env.
  • README restyled to the house style; prominent Caution + Intended/Responsible Use sections.

Testing

  • python -m unittest discover -s tests -t .41 tests pass (new tests/test_redaction.py + grouping/validator cases).
  • Manually verified: graceful degradation with a dead LLM endpoint, --redact replace/drop, and --no-audit (zero scan/LLM calls, dataset still written).

Notes

  • The git remote still points at the old DialogSmith URL (GitHub redirects it). Worth updating to the canonical Doppelganger URL separately.
  • Default base model stays Qwen1.5-1.8B in the committed config; personal 14B setup lives in gitignored *.local.yaml.

🤖 Generated with Claude Code

Summary by CodeRabbit

Release Notes

  • New Features

    • Regex-based sensitive-data detection with locale support and optional LLM-assisted redaction
    • Multi-speaker conversation support with speaker labeling
    • Message reply threading across conversation gaps
    • LLM quality auditor with split/drop/keep actions
    • Added startup ASCII banner
  • Bug Fixes

    • Telegram adapter now handles missing sender fields
  • Documentation

    • Substantially updated README with refreshed setup flow, CLI flags, and LLM configuration guidance
  • Chores

    • Switched dependencies from Anthropic to OpenAI SDK
    • Removed deprecated legacy scripts
    • Added demo assets and sample exports

…LM, config hygiene

Adds a privacy/sanitization layer and improves conversation quality before training.

Sensitive-data redaction (new):
- ingest/redaction/: locale-keyed regex detector registry (universal + a Singapore
  pack with NRIC checksum, local phone, postal), mirroring the adapter registry so
  new countries are a single drop-in module.
- ingest/redactor.py: non-destructive scan -> data/redaction_report.json (masked
  previews), opt-in --redact replace|drop, plus optional LLM verbatim-span detection.
- CLI: --redact, --redact-locales, --skip-redact-scan, --llm-redact (with a
  local-first cloud-consent guard), and --no-audit / --skip-validation off-switches.

Conversation grouping:
- NormalizedMessage gains message_id/reply_to_id; Telegram adapter populates them.
- core: reply-threading stitches gap-split conversations back together; --multi-speaker
  preserves and labels group-chat senders (the owner's turns are never labelled).
- validator: adds a pairing axis and keep/split/drop repair of over-merged samples.

LLM client:
- ingest/llm.py: shared OpenAI-compatible client (OpenAI or local Ollama/vLLM/LM
  Studio), replacing the Anthropic SDK. Degrades gracefully if the endpoint is down.
- Env vars renamed off the old DialogSmith name:
  LLM_VALIDATE / LLM_MODEL / LLM_API_KEY / LLM_API_BASE_URL.

Config & docs:
- train_lora.yaml: explicit train_on_prompt: false to document loss masking (makes
  --multi-speaker labels safe).
- *.local.yaml override pattern (gitignored) keeps personal model/hardware tweaks out
  of git; .env reconciled to current vars; .env.example renamed to example.env.
- README restyled to the project house style; prominent caution + intended/responsible
  use sections.

Tests: adds tests/test_redaction.py and new grouping/validator cases (41 total, green).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a robust sensitive-data redaction and auditing pipeline using both regex-based detectors (with universal and Singapore-specific locales) and optional LLM-assisted scanning. It also refactors the LLM validation to use a shared OpenAI-compatible client supporting local endpoints, and enhances conversation reconstruction with reply-based threading and a multi-speaker mode. The review feedback highlights several key improvement opportunities: prioritizing longer/outermost spans during overlapping redaction to prevent sensitive data leaks, avoiding redundant speaker prefixes on merged turns in multi-speaker mode, implementing consecutive failure thresholds to gracefully handle LLM API outages, and correcting the Singapore postal code regex to match the standard S123456 format.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread ingest/redactor.py
Comment thread ingest/core.py
Comment thread ingest/redactor.py
Comment thread ingest/validator.py
Comment thread ingest/redaction/sg.py
The scripts/telegram_extract.py and scripts/convert_to_sharegpt.py shims only
delegated to `python -m ingest`; remove them and update the Legacy Workflow note.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@coderabbitai

coderabbitai Bot commented Jun 23, 2026

Copy link
Copy Markdown

Review Change Stack

Warning

Review limit reached

@NotYuSheng, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 44 minutes and 54 seconds. Learn how PR review limits work.

Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file).

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based credits.

🚦 How do rate limits work?

CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan refill rate.

For paid Pro and Pro+ PR reviews, CodeRabbit uses rolling per-developer review limits. Reviews become available again as older review attempts age out of the rolling limit window.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 3eeff0ab-a5cc-4496-981f-f4166444df70

📥 Commits

Reviewing files that changed from the base of the PR and between 874e305 and 301ce6c.

📒 Files selected for processing (6)
  • demo/README.md
  • demo/build_final.py
  • demo/img2ascii.py
  • ingest/cli.py
  • ingest/validator.py
  • tests/test_ingest.py
📝 Walkthrough

Walkthrough

The PR replaces the Anthropic LLM client with an OpenAI-compatible wrapper (ingest/llm.py), adds a full regex+LLM sensitive-data redaction pipeline (ingest/redaction/, ingest/redactor.py), introduces reply-based conversation stitching and multi-speaker turn assembly in ingest/core.py, extends the CLI with new redaction/audit flags, renames the env template, removes deprecated shim scripts, and adds demo build tooling.

Changes

Redaction, threading, LLM migration, and CLI integration

Layer / File(s) Summary
NormalizedMessage reply fields and Telegram extraction
ingest/message.py, ingest/adapters/telegram.py
NormalizedMessage gains optional message_id and reply_to_id fields; TelegramAdapter.parse populates both from raw export data and defaults missing "from" fields to "Unknown".
Reply-based conversation stitching and multi-speaker turn assembly
ingest/core.py, tests/test_ingest.py
_merge_by_reply (union-find over reply chains) stitches time-split segments; _assemble_turns handles multi-speaker role/text construction with sender-prefixed labels; build_samples is updated to call both. Tests cover stitching across time gaps and labeled vs. collapsed speaker output.
Regex redaction core: Detector/Finding dataclasses and scan_text
ingest/redaction/__init__.py
Defines immutable Detector/Finding dataclasses, the global registry with register/make/iter_detectors, mask(), scan_text() with optional named-group and validator support, and luhn_valid(); imports bundled locale modules at package load time.
Universal and Singapore locale detector registrations
ingest/redaction/universal.py, ingest/redaction/sg.py
universal.py registers email, IPv4/IPv6, MAC, credit card (Luhn-gated), and credential/key pattern detectors. sg.py registers NRIC/FIN (with nric_valid checksum), partial NRIC, Singapore phone, and postal code detectors.
Redactor: scan, apply, LLM tier, and merge
ingest/redactor.py
Implements scan_samples() for non-destructive audit reports, write_report/print_summary, _replace_spans() with right-to-left overlap resolution, apply() for replace/drop modes, llm_scan_samples() for verbatim-span LLM detection, and merge_llm_findings() to fold LLM results into regex reports.
Redaction tests
tests/test_redaction.py
Covers universal and SG detector correctness, registry uniqueness, scan_samples report structure, apply replace/drop semantics, LLM scan span location and fallback, merge_llm_findings totals, and _replace_spans overlap/merge handling.
OpenAI-compatible LLM client module
ingest/llm.py
Introduces LLM_* env-var constants, config helpers (base_url, model, is_local, should_validate), get_client() with lazy openai import and hosted-vs-local key enforcement, endpoint_label(), and chat() for single-turn completions.
Validator refactor to OpenAI-compatible auditor
ingest/validator.py, tests/test_ingest.py
Replaces Anthropic-based scoring with the shared ingest.llm client; _score_sample expands to three-axis JSON scoring (coherence/quality/pairing) with keep|split|drop action; adds _apply_split and _has_both_roles helpers; validate_samples uses llm.should_validate() gating with consecutive-failure abort.
CLI pipeline: new flags and integrated redaction/validation flow
ingest/cli.py, ingest/banner.py
Adds _run_llm_redaction() guard, new parser flags (--multi-speaker, --redact, --redact-locales, --skip-redact-scan, --llm-redact, --allow-cloud-redaction, --no-audit, --skip-validation), banner printing in main(), and the conditional regex scan → LLM scan → apply → validate pipeline. banner.py provides the ANSI startup banner with env-var suppression.
Demo assets, env template, config, deps, and legacy script removal
example.env, .gitignore, configs/train_lora.yaml, requirements.txt, setup.bat, setup.sh, demo/*, README.md
Renames .env.example to example.env with LLM_* variables; ignores configs/*.local.yaml; sets train_on_prompt: false; replaces anthropic>=0.39 with openai>=1.0; updates setup scripts; removes deprecated shims (scripts/convert_to_sharegpt.py, scripts/telegram_extract.py); adds demo/build_final.py, demo/img2ascii.py, demo/mascot.txt, demo/sample_export.json; substantially rewrites README.

Sequence Diagram(s)

sequenceDiagram
  participant User
  participant CLI as ingest/cli.py
  participant core as ingest/core.py
  participant redactor as ingest/redactor.py
  participant llm as ingest/llm.py
  participant validator as ingest/validator.py

  User->>CLI: python -m ingest --source telegram --redact replace --llm-redact
  CLI->>core: build_samples(messages, multi_speaker=...)
  core-->>CLI: samples (with reply stitching)
  CLI->>redactor: scan_samples(samples, locales)
  redactor-->>CLI: regex audit report
  CLI->>redactor: llm_scan_samples(samples, client, model)
  redactor->>llm: chat(client, model, prompt)
  llm-->>redactor: verbatim findings JSON
  redactor-->>CLI: llm_findings
  CLI->>redactor: merge_llm_findings(report, llm_findings)
  redactor-->>CLI: merged report
  CLI->>redactor: apply(samples, mode="replace", llm_findings=...)
  redactor-->>CLI: redacted samples
  CLI->>validator: validate_samples(redacted_samples)
  validator->>llm: chat(client, model, scoring_prompt)
  llm-->>validator: keep/split/drop JSON
  validator-->>CLI: validated samples
  CLI-->>User: writes output + redaction_report.json
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related issues

  • Additional Singapore detectors: NRIC M-series checksum + UEN #15: The PR adds ingest/redaction/sg.py with nric_valid covering S/T/F/G series, and the issue proposes extending nric_valid to support M-series checksums and adding a new sg_uen detector — both target the same file with code-level additions to the existing Singapore detector module.

Poem

🐰 Hop hop, the rabbit checks every line,
Regex detectors now scan and align,
Anthropic's gone, OpenAI takes the stage,
Reply threads stitched across many a page,
With redact and split the chats look so fine —
✨ No secrets shall slip through this warren of mine!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 28.72% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately captures the four major changes: sensitive-data redaction, smarter conversation grouping, OpenAI-compatible LLM client, and configuration improvements. It is specific, concise, and directly reflects the PR's primary objectives.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/data-sanitization-and-pipeline-hardening

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

NotYuSheng and others added 8 commits June 23, 2026 22:56
Caution block now owns data-sensitivity + consent + law; Important block owns
model-misuse. Removes the contradictory 'never consented' phrasing and the
duplicate consent line.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
States the project's intent to explore pre-training, fine-tuning, and
alignment, plus persona prompting (#14), more adapters (#9), NER redaction
(#13), and wider locale packs (#15).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Frames the project as a learning sandbox spanning training, memory/RAG, agents,
MCP, guardrails, and evaluation. Links the new exploration issues (#16-#21)
alongside existing tracked work (#9, #13, #14, #15).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Adds multimodal, inference-time control, and interpretability tracks and links
the complete exploration backlog; points to the exploration label for the rest.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Keep the README roadmap as a plain statement of exploration areas; the issue
tracker holds the live backlog.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…rts, SG postal)

- redactor._replace_spans: on overlap keep the outer/longer span so an inner
  span can't leak the rest (e.g. DOMAIN inside EMAIL) [security-high].
- core._assemble_turns: apply the speaker prefix only once per merged turn.
- redactor.llm_scan_samples + validator.validate_samples: abort after 5
  consecutive LLM failures instead of flooding the console.
- redaction/sg.py: sg_postal also matches the common "S123456" form.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…oud suggestions)

- llm.py: remove DEFAULT_MODEL; LLM_MODEL is required to enable the LLM features
  (clear error otherwise). Docs use the vLLM/LM Studio HF model-id convention.
- example.env / README: lead with a LOCAL OpenAI-compatible server
  (Qwen/Qwen2.5-7B-Instruct), remove gpt-4o-mini/cloud suggestions; drop the
  roadmap "(today)" qualifier.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@NotYuSheng

Copy link
Copy Markdown
Owner Author

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the ingestion pipeline to be source-agnostic, introducing conversation reconstruction via reply-threading, multi-speaker labeling, and a non-destructive sensitive-data redaction system (supporting both regex and local LLM-assisted scanning). It also transitions the quality auditor to a shared OpenAI-compatible client and removes legacy scripts. The review feedback highlights three key improvements: ensuring llm_scan_samples redacts all occurrences of a sensitive span rather than just the first, adding a fallback for missing sender fields in the Telegram adapter to prevent type violations, and appending a negative lookahead to the Singapore postal code regex to avoid false positives on longer digit sequences.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread ingest/redactor.py Outdated
Comment thread ingest/adapters/telegram.py Outdated
Comment thread ingest/redaction/sg.py
NotYuSheng and others added 3 commits June 23, 2026 23:45
A trailing negative lookahead stops sg_postal matching the first 6 digits of a
longer token (e.g. NRIC S1234567D reading as S123456). Adds a regression test.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- ingest/banner.py: a parrot-in-a-mirror mascot (it mimics your voice; the
  mirror is the doppelganger) beside an ansi_shadow "Doppel/ganger" wordmark in
  truecolor amber, printed at CLI startup. DOPPELGANGER_NO_BANNER silences it.
- README: embed demo/demo.gif (ingest + sensitive-data scan) at the top.
- demo/: synthetic sample_export.json (gitignored exception) + the mascot source
  image and the build/convert scripts used to generate the art and GIF.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…sender

- redactor.llm_scan_samples: locate EVERY non-overlapping occurrence of a
  flagged span (text.find loop), not just the first, so repeated names/numbers
  can't leak [security-high].
- telegram adapter: 'from' may be None (anonymous channel posts) -> default to
  'Unknown' so sender_id stays a str and multi-speaker mode has no 'None:' prefix.
- Tests for both.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@NotYuSheng

Copy link
Copy Markdown
Owner Author

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a major update to the Doppelganger project, adding robust features for data privacy, sensitive-data scanning, and LLM-assisted redaction and auditing. Key additions include a regex-based sensitive-data scanner with universal and Singapore-specific detectors, an optional LLM-assisted redaction pass, and an upgraded conversation quality auditor that can split over-merged conversations. Additionally, the core pipeline is enhanced with reply-threading and multi-speaker labeling, accompanied by updated documentation, tests, and a reproducible demo. The review feedback identifies three valid issues: a bug in the greedy interval selection algorithm in ingest/redactor.py that can leave sensitive data exposed during overlapping span replacements, a potential resource leak in demo/build_final.py due to an unclosed file, and a potential TypeError in ingest/redaction/__init__.py when handling unmatched optional regex groups.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread ingest/redactor.py Outdated
Comment thread demo/build_final.py Outdated
Comment thread ingest/redaction/__init__.py Outdated
…roup

- redactor._replace_spans: merge overlapping spans and redact the full region
  (label = longest contributor), so a shorter span can't shadow part of a longer
  sensitive one [security-high].
- redaction.scan_text: skip a hit whose optional id group didn't match
  (m.span(id) == (-1,-1)) instead of crashing in mask().
- demo/build_final.py: read mascot.txt via a context manager.
- Tests for merge behaviour and the optional-id guard.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🧹 Nitpick comments (6)
ingest/llm.py (2)

62-68: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

Chain the re-raised ImportError.

Re-raising inside the except clause loses the original traceback context. Use from None (the cause is obvious here) to satisfy B904 and keep the message clean.

♻️ Proposed change
     try:
         from openai import OpenAI
-    except ImportError:
+    except ImportError:
         raise ImportError(
             "The 'openai' package is required for LLM features. "
             "Install it with: pip install openai"
-        )
+        ) from None
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ingest/llm.py` around lines 62 - 68, The re-raised ImportError in the except
clause should use `from None` to suppress the exception context chain and
satisfy the B904 linting rule. Modify the `raise ImportError` statement to
include `from None` at the end, so the exception is raised without the implicit
chaining that normally occurs when raising inside an except block.

Source: Linters/SAST tools


94-101: 🩺 Stability & Availability | 🔵 Trivial

Add explicit per-request timeout to prevent stalls on slow endpoints.

The OpenAI SDK 1.x supports per-request timeouts via client.with_options(timeout=...). Without an explicit timeout, a slow or hung endpoint can stall the batch CLI for a long time on each call. Use client.with_options(timeout=X).chat.completions.create(...) where X is a float (seconds) or httpx.Timeout object to fail fast and let the validator's consecutive-failure guard engage sooner.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ingest/llm.py` around lines 94 - 101, The chat function makes an API call to
client.chat.completions.create() without an explicit per-request timeout, which
can cause the validator to stall indefinitely on slow or unresponsive endpoints.
Modify the client.chat.completions.create() call to use
client.with_options(timeout=X).chat.completions.create() pattern, where X is a
float value in seconds representing the timeout duration. This will ensure the
request fails fast if the endpoint is slow or hung, allowing the
consecutive-failure guard mechanism to engage sooner.
tests/test_ingest.py (1)

132-132: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

Optional: prefer next() for single-element lookup.

The linter suggests using next(m for m in msgs if m.timestamp == 100) instead of the list comprehension with indexing, which is slightly more idiomatic for single-element lookups.

♻️ Suggested refactor
-            anon = [m for m in msgs if m.timestamp == 100][0]
+            anon = next(m for m in msgs if m.timestamp == 100)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/test_ingest.py` at line 132, Replace the list comprehension with
indexing `[0]` used to find the single message with timestamp equal to 100 in
the msgs collection with the `next()` function using a generator expression.
This is more idiomatic for single-element lookups and avoids creating an
intermediate list just to access the first element. Change the line where `anon`
is assigned to use `next(m for m in msgs if m.timestamp == 100)` instead of the
current list comprehension approach.

Source: Linters/SAST tools

ingest/core.py (1)

88-97: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Consider extracting union-find to a shared utility module.

The find and union helper functions implement standard union-find logic that is duplicated in ingest/redactor.py (lines 88-92 show an identical find implementation). Extracting this to a common utility module would reduce duplication and make the pattern more discoverable.

♻️ Suggested approach

Create a new ingest/utils.py module:

def union_find_merge(n: int, edges) -> list:
    """Merge items using union-find. Returns grouped indices."""
    parent = list(range(n))
    
    def find(x: int) -> int:
        while parent[x] != x:
            parent[x] = parent[parent[x]]
            x = parent[x]
        return x
    
    def union(a: int, b: int) -> None:
        ra, rb = find(a), find(b)
        if ra != rb:
            parent[max(ra, rb)] = min(ra, rb)
    
    for a, b in edges:
        union(a, b)
    
    groups = {}
    for i in range(n):
        groups.setdefault(find(i), []).append(i)
    return list(groups.values())

Then refactor both _merge_by_reply and redactor._replace_spans to use this shared implementation.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ingest/core.py` around lines 88 - 97, The find and union functions in the
current location are duplicated in ingest/redactor.py, creating maintenance
issues. Create a new ingest/utils.py module with a shared union_find_merge
function that encapsulates the complete union-find logic, accepting the number
of items and a list of edges to merge, and returning the grouped indices. Then
refactor both the find and union function definitions in ingest/core.py and the
duplicate find implementation in ingest/redactor.py to use this new shared
utility function instead of defining their own versions.
ingest/redaction/sg.py (1)

40-40: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Add strict=True to zip() for defensive programming.

While the current code guarantees matching lengths (7 digits and 7 weights), adding strict=True provides a runtime assertion that fails fast if future refactoring breaks this invariant.

🛡️ Proposed fix
-    total = sum(int(d) * w for d, w in zip(digits, _WEIGHTS))
+    total = sum(int(d) * w for d, w in zip(digits, _WEIGHTS, strict=True))
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ingest/redaction/sg.py` at line 40, Add the `strict=True` parameter to the
`zip()` function call in the total variable calculation where digits are being
zipped with _WEIGHTS. This ensures that if the lengths of the two iterables
don't match, a ValueError will be raised immediately rather than silently
truncating, providing defensive programming against future refactoring that
might inadvertently break this invariant.

Source: Linters/SAST tools

ingest/redaction/universal.py (1)

48-48: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

Use HYPHEN-MINUS instead of EN DASH in the comment.

The comment contains an EN DASH (, U+2013) instead of a HYPHEN-MINUS (-, U+002D). While functionally harmless in a comment, using the standard ASCII hyphen improves consistency and avoids potential copy-paste issues.

📝 Proposed fix
-# Broad 13–19 digit run (optionally space/dash grouped); Luhn rejects the noise.
+# Broad 13-19 digit run (optionally space/dash grouped); Luhn rejects the noise.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ingest/redaction/universal.py` at line 48, The comment containing "Broad
13–19 digit run" uses an EN DASH character (U+2013) between the numbers instead
of a standard HYPHEN-MINUS (U+002D). Replace the EN DASH with a regular
hyphen-minus character to improve consistency and avoid potential copy-paste
issues when the standard ASCII hyphen is expected.

Source: Linters/SAST tools

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@demo/build_final.py`:
- Around line 70-71: Add check=True parameter to both subprocess.run calls (at
the first occurrence and the second one around lines 91-93) to make subprocess
raise an exception when commands return non-zero exit codes, ensuring the script
fails fast on errors. This will prevent the script from continuing to produce
artifacts when ingestion or rendering fails. If you need to capture and report
stderr on failure, wrap the subprocess.run calls in a try-except block to handle
the CalledProcessError exception.
- Line 15: Replace the hardcoded AGG variable assignment with a dynamic lookup
that first checks the AGG environment variable and falls back to searching the
system PATH using shutil.which, then raise an informative error if the
executable is not found. Additionally, locate the subprocess calls around lines
91-93 that handle GIF rendering and modify them to include check=True parameter
and remove DEVNULL output suppression, allowing subprocess errors to be properly
raised and surfaced to the caller rather than silently failing.

In `@demo/img2ascii.py`:
- Around line 35-37: In the main execution block where to_ascii is called, add a
validation guard before accessing args[0] to check that the args list is not
empty. If args is empty, print a usage message that describes the required PATH
argument and optional width parameter, then call sys.exit with a non-zero exit
code (e.g., 1). This prevents the IndexError from being raised and provides
helpful guidance to the user when the script is invoked without the required
PATH argument.
- Line 10: The import statement `from PIL import Image` requires the Pillow
package but this dependency is not declared in requirements.txt or the project's
configuration file (setup.py, pyproject.toml, etc.). Add Pillow as a dependency
to the appropriate project configuration file to ensure that fresh installations
can properly resolve the required package.

In `@ingest/cli.py`:
- Around line 210-230: The redactor.apply() call is currently nested inside the
`if not skip_scan:` block, which means redaction is silently skipped when either
`args.no_audit` or `args.skip_redact_scan` is true, even if the user explicitly
requested redaction via `--redact` flag. Decouple the redaction application from
the scan logic by either moving the redactor.apply() call and related redaction
output logic outside the `if not skip_scan:` block (so redaction is always
applied when requested regardless of scan status), or add validation logic early
in the function to warn or error when args.redact is not "off" while skip_scan
is true. Ensure that --llm-redact is also handled correctly and not silently
dropped.

In `@ingest/validator.py`:
- Around line 155-170: The split repair feature is unreachable because the
low-score drop condition is evaluated first. In the validator logic around the
low variable and the subsequent conditional branches, reorder the if-elif-else
chain to check r["action"] == "split" before the combined condition r["action"]
== "drop" or low. This ensures samples marked for splitting are processed before
being dropped due to low coherence, quality, or pairing scores, allowing the
split repair to function for its intended use case of over-merged samples.

---

Nitpick comments:
In `@ingest/core.py`:
- Around line 88-97: The find and union functions in the current location are
duplicated in ingest/redactor.py, creating maintenance issues. Create a new
ingest/utils.py module with a shared union_find_merge function that encapsulates
the complete union-find logic, accepting the number of items and a list of edges
to merge, and returning the grouped indices. Then refactor both the find and
union function definitions in ingest/core.py and the duplicate find
implementation in ingest/redactor.py to use this new shared utility function
instead of defining their own versions.

In `@ingest/llm.py`:
- Around line 62-68: The re-raised ImportError in the except clause should use
`from None` to suppress the exception context chain and satisfy the B904 linting
rule. Modify the `raise ImportError` statement to include `from None` at the
end, so the exception is raised without the implicit chaining that normally
occurs when raising inside an except block.
- Around line 94-101: The chat function makes an API call to
client.chat.completions.create() without an explicit per-request timeout, which
can cause the validator to stall indefinitely on slow or unresponsive endpoints.
Modify the client.chat.completions.create() call to use
client.with_options(timeout=X).chat.completions.create() pattern, where X is a
float value in seconds representing the timeout duration. This will ensure the
request fails fast if the endpoint is slow or hung, allowing the
consecutive-failure guard mechanism to engage sooner.

In `@ingest/redaction/sg.py`:
- Line 40: Add the `strict=True` parameter to the `zip()` function call in the
total variable calculation where digits are being zipped with _WEIGHTS. This
ensures that if the lengths of the two iterables don't match, a ValueError will
be raised immediately rather than silently truncating, providing defensive
programming against future refactoring that might inadvertently break this
invariant.

In `@ingest/redaction/universal.py`:
- Line 48: The comment containing "Broad 13–19 digit run" uses an EN DASH
character (U+2013) between the numbers instead of a standard HYPHEN-MINUS
(U+002D). Replace the EN DASH with a regular hyphen-minus character to improve
consistency and avoid potential copy-paste issues when the standard ASCII hyphen
is expected.

In `@tests/test_ingest.py`:
- Line 132: Replace the list comprehension with indexing `[0]` used to find the
single message with timestamp equal to 100 in the msgs collection with the
`next()` function using a generator expression. This is more idiomatic for
single-element lookups and avoids creating an intermediate list just to access
the first element. Change the line where `anon` is assigned to use `next(m for m
in msgs if m.timestamp == 100)` instead of the current list comprehension
approach.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 1417d538-60cb-4e8d-b169-5add8b41e5b8

📥 Commits

Reviewing files that changed from the base of the PR and between ef7cd61 and 874e305.

⛔ Files ignored due to path filters (2)
  • demo/demo.gif is excluded by !**/*.gif
  • demo/parrot-mirror.jpg is excluded by !**/*.jpg
📒 Files selected for processing (27)
  • .env.example
  • .gitignore
  • README.md
  • configs/train_lora.yaml
  • demo/build_final.py
  • demo/img2ascii.py
  • demo/mascot.txt
  • demo/sample_export.json
  • example.env
  • ingest/adapters/telegram.py
  • ingest/banner.py
  • ingest/cli.py
  • ingest/core.py
  • ingest/llm.py
  • ingest/message.py
  • ingest/redaction/__init__.py
  • ingest/redaction/sg.py
  • ingest/redaction/universal.py
  • ingest/redactor.py
  • ingest/validator.py
  • requirements.txt
  • scripts/convert_to_sharegpt.py
  • scripts/telegram_extract.py
  • setup.bat
  • setup.sh
  • tests/test_ingest.py
  • tests/test_redaction.py
💤 Files with no reviewable changes (3)
  • .env.example
  • scripts/convert_to_sharegpt.py
  • scripts/telegram_extract.py

Comment thread demo/build_final.py Outdated
Comment thread demo/build_final.py Outdated
Comment thread demo/img2ascii.py
Comment thread demo/img2ascii.py
Comment thread ingest/cli.py
Comment thread ingest/validator.py
…en demo scripts

- cli: decouple redaction application from the scan/report — --redact is now
  applied even with --skip-redact-scan/--no-audit, so chat data isn't silently
  left unredacted [security].
- validator: evaluate the 'split' action before the low-score drop, so
  over-merged samples are repaired instead of discarded.
- demo scripts: configurable AGG path ($AGG), check=True on subprocess, usage
  guard in img2ascii, and a demo/README documenting the dev-only deps (pillow,
  pyfiglet, agg).
- Tests for the two behaviours.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@NotYuSheng NotYuSheng merged commit ab217fc into main Jun 23, 2026
4 checks passed
@NotYuSheng NotYuSheng deleted the feat/data-sanitization-and-pipeline-hardening branch June 23, 2026 23:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant