feat: sensitive-data redaction, smarter grouping, OpenAI-compatible LLM, config hygiene#12
Conversation
…LM, config hygiene Adds a privacy/sanitization layer and improves conversation quality before training. Sensitive-data redaction (new): - ingest/redaction/: locale-keyed regex detector registry (universal + a Singapore pack with NRIC checksum, local phone, postal), mirroring the adapter registry so new countries are a single drop-in module. - ingest/redactor.py: non-destructive scan -> data/redaction_report.json (masked previews), opt-in --redact replace|drop, plus optional LLM verbatim-span detection. - CLI: --redact, --redact-locales, --skip-redact-scan, --llm-redact (with a local-first cloud-consent guard), and --no-audit / --skip-validation off-switches. Conversation grouping: - NormalizedMessage gains message_id/reply_to_id; Telegram adapter populates them. - core: reply-threading stitches gap-split conversations back together; --multi-speaker preserves and labels group-chat senders (the owner's turns are never labelled). - validator: adds a pairing axis and keep/split/drop repair of over-merged samples. LLM client: - ingest/llm.py: shared OpenAI-compatible client (OpenAI or local Ollama/vLLM/LM Studio), replacing the Anthropic SDK. Degrades gracefully if the endpoint is down. - Env vars renamed off the old DialogSmith name: LLM_VALIDATE / LLM_MODEL / LLM_API_KEY / LLM_API_BASE_URL. Config & docs: - train_lora.yaml: explicit train_on_prompt: false to document loss masking (makes --multi-speaker labels safe). - *.local.yaml override pattern (gitignored) keeps personal model/hardware tweaks out of git; .env reconciled to current vars; .env.example renamed to example.env. - README restyled to the project house style; prominent caution + intended/responsible use sections. Tests: adds tests/test_redaction.py and new grouping/validator cases (41 total, green). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
There was a problem hiding this comment.
Code Review
This pull request introduces a robust sensitive-data redaction and auditing pipeline using both regex-based detectors (with universal and Singapore-specific locales) and optional LLM-assisted scanning. It also refactors the LLM validation to use a shared OpenAI-compatible client supporting local endpoints, and enhances conversation reconstruction with reply-based threading and a multi-speaker mode. The review feedback highlights several key improvement opportunities: prioritizing longer/outermost spans during overlapping redaction to prevent sensitive data leaks, avoiding redundant speaker prefixes on merged turns in multi-speaker mode, implementing consecutive failure thresholds to gracefully handle LLM API outages, and correcting the Singapore postal code regex to match the standard S123456 format.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
The scripts/telegram_extract.py and scripts/convert_to_sharegpt.py shims only delegated to `python -m ingest`; remove them and update the Legacy Workflow note. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
Warning Review limit reached
More reviews will be available in 44 minutes and 54 seconds. Learn how PR review limits work. Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file). ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based credits. 🚦 How do rate limits work?CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan refill rate. For paid Pro and Pro+ PR reviews, CodeRabbit uses rolling per-developer review limits. Reviews become available again as older review attempts age out of the rolling limit window. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (6)
📝 WalkthroughWalkthroughThe PR replaces the Anthropic LLM client with an OpenAI-compatible wrapper ( ChangesRedaction, threading, LLM migration, and CLI integration
Sequence Diagram(s)sequenceDiagram
participant User
participant CLI as ingest/cli.py
participant core as ingest/core.py
participant redactor as ingest/redactor.py
participant llm as ingest/llm.py
participant validator as ingest/validator.py
User->>CLI: python -m ingest --source telegram --redact replace --llm-redact
CLI->>core: build_samples(messages, multi_speaker=...)
core-->>CLI: samples (with reply stitching)
CLI->>redactor: scan_samples(samples, locales)
redactor-->>CLI: regex audit report
CLI->>redactor: llm_scan_samples(samples, client, model)
redactor->>llm: chat(client, model, prompt)
llm-->>redactor: verbatim findings JSON
redactor-->>CLI: llm_findings
CLI->>redactor: merge_llm_findings(report, llm_findings)
redactor-->>CLI: merged report
CLI->>redactor: apply(samples, mode="replace", llm_findings=...)
redactor-->>CLI: redacted samples
CLI->>validator: validate_samples(redacted_samples)
validator->>llm: chat(client, model, scoring_prompt)
llm-->>validator: keep/split/drop JSON
validator-->>CLI: validated samples
CLI-->>User: writes output + redaction_report.json
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes Possibly related issues
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Caution block now owns data-sensitivity + consent + law; Important block owns model-misuse. Removes the contradictory 'never consented' phrasing and the duplicate consent line. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Keep the README roadmap as a plain statement of exploration areas; the issue tracker holds the live backlog. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…rts, SG postal) - redactor._replace_spans: on overlap keep the outer/longer span so an inner span can't leak the rest (e.g. DOMAIN inside EMAIL) [security-high]. - core._assemble_turns: apply the speaker prefix only once per merged turn. - redactor.llm_scan_samples + validator.validate_samples: abort after 5 consecutive LLM failures instead of flooding the console. - redaction/sg.py: sg_postal also matches the common "S123456" form. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…oud suggestions) - llm.py: remove DEFAULT_MODEL; LLM_MODEL is required to enable the LLM features (clear error otherwise). Docs use the vLLM/LM Studio HF model-id convention. - example.env / README: lead with a LOCAL OpenAI-compatible server (Qwen/Qwen2.5-7B-Instruct), remove gpt-4o-mini/cloud suggestions; drop the roadmap "(today)" qualifier. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request refactors the ingestion pipeline to be source-agnostic, introducing conversation reconstruction via reply-threading, multi-speaker labeling, and a non-destructive sensitive-data redaction system (supporting both regex and local LLM-assisted scanning). It also transitions the quality auditor to a shared OpenAI-compatible client and removes legacy scripts. The review feedback highlights three key improvements: ensuring llm_scan_samples redacts all occurrences of a sensitive span rather than just the first, adding a fallback for missing sender fields in the Telegram adapter to prevent type violations, and appending a negative lookahead to the Singapore postal code regex to avoid false positives on longer digit sequences.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
A trailing negative lookahead stops sg_postal matching the first 6 digits of a longer token (e.g. NRIC S1234567D reading as S123456). Adds a regression test. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- ingest/banner.py: a parrot-in-a-mirror mascot (it mimics your voice; the mirror is the doppelganger) beside an ansi_shadow "Doppel/ganger" wordmark in truecolor amber, printed at CLI startup. DOPPELGANGER_NO_BANNER silences it. - README: embed demo/demo.gif (ingest + sensitive-data scan) at the top. - demo/: synthetic sample_export.json (gitignored exception) + the mascot source image and the build/convert scripts used to generate the art and GIF. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…sender - redactor.llm_scan_samples: locate EVERY non-overlapping occurrence of a flagged span (text.find loop), not just the first, so repeated names/numbers can't leak [security-high]. - telegram adapter: 'from' may be None (anonymous channel posts) -> default to 'Unknown' so sender_id stays a str and multi-speaker mode has no 'None:' prefix. - Tests for both. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a major update to the Doppelganger project, adding robust features for data privacy, sensitive-data scanning, and LLM-assisted redaction and auditing. Key additions include a regex-based sensitive-data scanner with universal and Singapore-specific detectors, an optional LLM-assisted redaction pass, and an upgraded conversation quality auditor that can split over-merged conversations. Additionally, the core pipeline is enhanced with reply-threading and multi-speaker labeling, accompanied by updated documentation, tests, and a reproducible demo. The review feedback identifies three valid issues: a bug in the greedy interval selection algorithm in ingest/redactor.py that can leave sensitive data exposed during overlapping span replacements, a potential resource leak in demo/build_final.py due to an unclosed file, and a potential TypeError in ingest/redaction/__init__.py when handling unmatched optional regex groups.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
…roup - redactor._replace_spans: merge overlapping spans and redact the full region (label = longest contributor), so a shorter span can't shadow part of a longer sensitive one [security-high]. - redaction.scan_text: skip a hit whose optional id group didn't match (m.span(id) == (-1,-1)) instead of crashing in mask(). - demo/build_final.py: read mascot.txt via a context manager. - Tests for merge behaviour and the optional-id guard. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 6
🧹 Nitpick comments (6)
ingest/llm.py (2)
62-68: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low valueChain the re-raised
ImportError.Re-raising inside the
exceptclause loses the original traceback context. Usefrom None(the cause is obvious here) to satisfy B904 and keep the message clean.♻️ Proposed change
try: from openai import OpenAI - except ImportError: + except ImportError: raise ImportError( "The 'openai' package is required for LLM features. " "Install it with: pip install openai" - ) + ) from None🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@ingest/llm.py` around lines 62 - 68, The re-raised ImportError in the except clause should use `from None` to suppress the exception context chain and satisfy the B904 linting rule. Modify the `raise ImportError` statement to include `from None` at the end, so the exception is raised without the implicit chaining that normally occurs when raising inside an except block.Source: Linters/SAST tools
94-101: 🩺 Stability & Availability | 🔵 TrivialAdd explicit per-request timeout to prevent stalls on slow endpoints.
The OpenAI SDK 1.x supports per-request timeouts via
client.with_options(timeout=...). Without an explicit timeout, a slow or hung endpoint can stall the batch CLI for a long time on each call. Useclient.with_options(timeout=X).chat.completions.create(...)where X is a float (seconds) or httpx.Timeout object to fail fast and let the validator's consecutive-failure guard engage sooner.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@ingest/llm.py` around lines 94 - 101, The chat function makes an API call to client.chat.completions.create() without an explicit per-request timeout, which can cause the validator to stall indefinitely on slow or unresponsive endpoints. Modify the client.chat.completions.create() call to use client.with_options(timeout=X).chat.completions.create() pattern, where X is a float value in seconds representing the timeout duration. This will ensure the request fails fast if the endpoint is slow or hung, allowing the consecutive-failure guard mechanism to engage sooner.tests/test_ingest.py (1)
132-132: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low valueOptional: prefer
next()for single-element lookup.The linter suggests using
next(m for m in msgs if m.timestamp == 100)instead of the list comprehension with indexing, which is slightly more idiomatic for single-element lookups.♻️ Suggested refactor
- anon = [m for m in msgs if m.timestamp == 100][0] + anon = next(m for m in msgs if m.timestamp == 100)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/test_ingest.py` at line 132, Replace the list comprehension with indexing `[0]` used to find the single message with timestamp equal to 100 in the msgs collection with the `next()` function using a generator expression. This is more idiomatic for single-element lookups and avoids creating an intermediate list just to access the first element. Change the line where `anon` is assigned to use `next(m for m in msgs if m.timestamp == 100)` instead of the current list comprehension approach.Source: Linters/SAST tools
ingest/core.py (1)
88-97: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winConsider extracting union-find to a shared utility module.
The
findandunionhelper functions implement standard union-find logic that is duplicated iningest/redactor.py(lines 88-92 show an identicalfindimplementation). Extracting this to a common utility module would reduce duplication and make the pattern more discoverable.♻️ Suggested approach
Create a new
ingest/utils.pymodule:def union_find_merge(n: int, edges) -> list: """Merge items using union-find. Returns grouped indices.""" parent = list(range(n)) def find(x: int) -> int: while parent[x] != x: parent[x] = parent[parent[x]] x = parent[x] return x def union(a: int, b: int) -> None: ra, rb = find(a), find(b) if ra != rb: parent[max(ra, rb)] = min(ra, rb) for a, b in edges: union(a, b) groups = {} for i in range(n): groups.setdefault(find(i), []).append(i) return list(groups.values())Then refactor both
_merge_by_replyandredactor._replace_spansto use this shared implementation.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@ingest/core.py` around lines 88 - 97, The find and union functions in the current location are duplicated in ingest/redactor.py, creating maintenance issues. Create a new ingest/utils.py module with a shared union_find_merge function that encapsulates the complete union-find logic, accepting the number of items and a list of edges to merge, and returning the grouped indices. Then refactor both the find and union function definitions in ingest/core.py and the duplicate find implementation in ingest/redactor.py to use this new shared utility function instead of defining their own versions.ingest/redaction/sg.py (1)
40-40: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winAdd
strict=Truetozip()for defensive programming.While the current code guarantees matching lengths (7 digits and 7 weights), adding
strict=Trueprovides a runtime assertion that fails fast if future refactoring breaks this invariant.🛡️ Proposed fix
- total = sum(int(d) * w for d, w in zip(digits, _WEIGHTS)) + total = sum(int(d) * w for d, w in zip(digits, _WEIGHTS, strict=True))🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@ingest/redaction/sg.py` at line 40, Add the `strict=True` parameter to the `zip()` function call in the total variable calculation where digits are being zipped with _WEIGHTS. This ensures that if the lengths of the two iterables don't match, a ValueError will be raised immediately rather than silently truncating, providing defensive programming against future refactoring that might inadvertently break this invariant.Source: Linters/SAST tools
ingest/redaction/universal.py (1)
48-48: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low valueUse HYPHEN-MINUS instead of EN DASH in the comment.
The comment contains an EN DASH (
–, U+2013) instead of a HYPHEN-MINUS (-, U+002D). While functionally harmless in a comment, using the standard ASCII hyphen improves consistency and avoids potential copy-paste issues.📝 Proposed fix
-# Broad 13–19 digit run (optionally space/dash grouped); Luhn rejects the noise. +# Broad 13-19 digit run (optionally space/dash grouped); Luhn rejects the noise.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@ingest/redaction/universal.py` at line 48, The comment containing "Broad 13–19 digit run" uses an EN DASH character (U+2013) between the numbers instead of a standard HYPHEN-MINUS (U+002D). Replace the EN DASH with a regular hyphen-minus character to improve consistency and avoid potential copy-paste issues when the standard ASCII hyphen is expected.Source: Linters/SAST tools
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@demo/build_final.py`:
- Around line 70-71: Add check=True parameter to both subprocess.run calls (at
the first occurrence and the second one around lines 91-93) to make subprocess
raise an exception when commands return non-zero exit codes, ensuring the script
fails fast on errors. This will prevent the script from continuing to produce
artifacts when ingestion or rendering fails. If you need to capture and report
stderr on failure, wrap the subprocess.run calls in a try-except block to handle
the CalledProcessError exception.
- Line 15: Replace the hardcoded AGG variable assignment with a dynamic lookup
that first checks the AGG environment variable and falls back to searching the
system PATH using shutil.which, then raise an informative error if the
executable is not found. Additionally, locate the subprocess calls around lines
91-93 that handle GIF rendering and modify them to include check=True parameter
and remove DEVNULL output suppression, allowing subprocess errors to be properly
raised and surfaced to the caller rather than silently failing.
In `@demo/img2ascii.py`:
- Around line 35-37: In the main execution block where to_ascii is called, add a
validation guard before accessing args[0] to check that the args list is not
empty. If args is empty, print a usage message that describes the required PATH
argument and optional width parameter, then call sys.exit with a non-zero exit
code (e.g., 1). This prevents the IndexError from being raised and provides
helpful guidance to the user when the script is invoked without the required
PATH argument.
- Line 10: The import statement `from PIL import Image` requires the Pillow
package but this dependency is not declared in requirements.txt or the project's
configuration file (setup.py, pyproject.toml, etc.). Add Pillow as a dependency
to the appropriate project configuration file to ensure that fresh installations
can properly resolve the required package.
In `@ingest/cli.py`:
- Around line 210-230: The redactor.apply() call is currently nested inside the
`if not skip_scan:` block, which means redaction is silently skipped when either
`args.no_audit` or `args.skip_redact_scan` is true, even if the user explicitly
requested redaction via `--redact` flag. Decouple the redaction application from
the scan logic by either moving the redactor.apply() call and related redaction
output logic outside the `if not skip_scan:` block (so redaction is always
applied when requested regardless of scan status), or add validation logic early
in the function to warn or error when args.redact is not "off" while skip_scan
is true. Ensure that --llm-redact is also handled correctly and not silently
dropped.
In `@ingest/validator.py`:
- Around line 155-170: The split repair feature is unreachable because the
low-score drop condition is evaluated first. In the validator logic around the
low variable and the subsequent conditional branches, reorder the if-elif-else
chain to check r["action"] == "split" before the combined condition r["action"]
== "drop" or low. This ensures samples marked for splitting are processed before
being dropped due to low coherence, quality, or pairing scores, allowing the
split repair to function for its intended use case of over-merged samples.
---
Nitpick comments:
In `@ingest/core.py`:
- Around line 88-97: The find and union functions in the current location are
duplicated in ingest/redactor.py, creating maintenance issues. Create a new
ingest/utils.py module with a shared union_find_merge function that encapsulates
the complete union-find logic, accepting the number of items and a list of edges
to merge, and returning the grouped indices. Then refactor both the find and
union function definitions in ingest/core.py and the duplicate find
implementation in ingest/redactor.py to use this new shared utility function
instead of defining their own versions.
In `@ingest/llm.py`:
- Around line 62-68: The re-raised ImportError in the except clause should use
`from None` to suppress the exception context chain and satisfy the B904 linting
rule. Modify the `raise ImportError` statement to include `from None` at the
end, so the exception is raised without the implicit chaining that normally
occurs when raising inside an except block.
- Around line 94-101: The chat function makes an API call to
client.chat.completions.create() without an explicit per-request timeout, which
can cause the validator to stall indefinitely on slow or unresponsive endpoints.
Modify the client.chat.completions.create() call to use
client.with_options(timeout=X).chat.completions.create() pattern, where X is a
float value in seconds representing the timeout duration. This will ensure the
request fails fast if the endpoint is slow or hung, allowing the
consecutive-failure guard mechanism to engage sooner.
In `@ingest/redaction/sg.py`:
- Line 40: Add the `strict=True` parameter to the `zip()` function call in the
total variable calculation where digits are being zipped with _WEIGHTS. This
ensures that if the lengths of the two iterables don't match, a ValueError will
be raised immediately rather than silently truncating, providing defensive
programming against future refactoring that might inadvertently break this
invariant.
In `@ingest/redaction/universal.py`:
- Line 48: The comment containing "Broad 13–19 digit run" uses an EN DASH
character (U+2013) between the numbers instead of a standard HYPHEN-MINUS
(U+002D). Replace the EN DASH with a regular hyphen-minus character to improve
consistency and avoid potential copy-paste issues when the standard ASCII hyphen
is expected.
In `@tests/test_ingest.py`:
- Line 132: Replace the list comprehension with indexing `[0]` used to find the
single message with timestamp equal to 100 in the msgs collection with the
`next()` function using a generator expression. This is more idiomatic for
single-element lookups and avoids creating an intermediate list just to access
the first element. Change the line where `anon` is assigned to use `next(m for m
in msgs if m.timestamp == 100)` instead of the current list comprehension
approach.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: 1417d538-60cb-4e8d-b169-5add8b41e5b8
⛔ Files ignored due to path filters (2)
demo/demo.gifis excluded by!**/*.gifdemo/parrot-mirror.jpgis excluded by!**/*.jpg
📒 Files selected for processing (27)
.env.example.gitignoreREADME.mdconfigs/train_lora.yamldemo/build_final.pydemo/img2ascii.pydemo/mascot.txtdemo/sample_export.jsonexample.envingest/adapters/telegram.pyingest/banner.pyingest/cli.pyingest/core.pyingest/llm.pyingest/message.pyingest/redaction/__init__.pyingest/redaction/sg.pyingest/redaction/universal.pyingest/redactor.pyingest/validator.pyrequirements.txtscripts/convert_to_sharegpt.pyscripts/telegram_extract.pysetup.batsetup.shtests/test_ingest.pytests/test_redaction.py
💤 Files with no reviewable changes (3)
- .env.example
- scripts/convert_to_sharegpt.py
- scripts/telegram_extract.py
…en demo scripts - cli: decouple redaction application from the scan/report — --redact is now applied even with --skip-redact-scan/--no-audit, so chat data isn't silently left unredacted [security]. - validator: evaluate the 'split' action before the low-score drop, so over-merged samples are repaired instead of discarded. - demo scripts: configurable AGG path ($AGG), check=True on subprocess, usage guard in img2ascii, and a demo/README documenting the dev-only deps (pillow, pyfiglet, agg). - Tests for the two behaviours. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Summary
Adds a privacy/sanitization layer and improves conversation quality before training, plus an OpenAI-compatible LLM client and config hygiene.
🔒 Sensitive-data redaction (new)
ingest/redaction/— locale-keyed regex detector registry (mirrors the adapter registry). Universal detectors (email, Luhn-checked cards, IP/MAC, API/private keys) always run; locale packs add country IDs. Singapore ships as the reference (NRIC with checksum, local phone, postal). Adding a country = one drop-in module.ingest/redactor.py— non-destructive scan →data/redaction_report.json(masked previews). Opt-in--redact replace|drop. Optional LLM verbatim-span detection flows into the same report/apply step.--redact,--redact-locales,--skip-redact-scan,--llm-redact(local-first cloud-consent guard), and--no-audit/--skip-validationoff-switches.🧵 Conversation grouping
NormalizedMessagegainsmessage_id/reply_to_id; Telegram adapter populates them.--multi-speakerpreserves and labels group-chat senders (the owner's turns are never labelled).🤖 LLM client
ingest/llm.py— shared OpenAI-compatible client (OpenAI or local Ollama/vLLM/LM Studio), replacing the Anthropic SDK. Degrades gracefully if the endpoint is down (per-item catch, dataset still written).LLM_VALIDATE/LLM_MODEL/LLM_API_KEY/LLM_API_BASE_URL(aligns with the TracePcap convention).⚙️ Config & docs
train_lora.yaml: explicittrain_on_prompt: falsedocumenting loss masking (makes--multi-speakerlabels safe).*.local.yamloverride pattern (gitignored) keeps personal model/hardware tweaks out of git;.envreconciled;.env.example→example.env.Testing
python -m unittest discover -s tests -t .→ 41 tests pass (newtests/test_redaction.py+ grouping/validator cases).--redact replace/drop, and--no-audit(zero scan/LLM calls, dataset still written).Notes
DialogSmithURL (GitHub redirects it). Worth updating to the canonicalDoppelgangerURL separately.*.local.yaml.🤖 Generated with Claude Code
Summary by CodeRabbit
Release Notes
New Features
Bug Fixes
Documentation
Chores