From d263530fb25abb7a28d19c493b9214531aa66891 Mon Sep 17 00:00:00 2001
From: NotYuSheng
Date: Tue, 23 Jun 2026 22:49:21 +0800
Subject: [PATCH 01/15] feat: sensitive-data redaction, smarter grouping,
OpenAI-compatible LLM, config hygiene
Adds a privacy/sanitization layer and improves conversation quality before training.
Sensitive-data redaction (new):
- ingest/redaction/: locale-keyed regex detector registry (universal + a Singapore
pack with NRIC checksum, local phone, postal), mirroring the adapter registry so
new countries are a single drop-in module.
- ingest/redactor.py: non-destructive scan -> data/redaction_report.json (masked
previews), opt-in --redact replace|drop, plus optional LLM verbatim-span detection.
- CLI: --redact, --redact-locales, --skip-redact-scan, --llm-redact (with a
local-first cloud-consent guard), and --no-audit / --skip-validation off-switches.
Conversation grouping:
- NormalizedMessage gains message_id/reply_to_id; Telegram adapter populates them.
- core: reply-threading stitches gap-split conversations back together; --multi-speaker
preserves and labels group-chat senders (the owner's turns are never labelled).
- validator: adds a pairing axis and keep/split/drop repair of over-merged samples.
LLM client:
- ingest/llm.py: shared OpenAI-compatible client (OpenAI or local Ollama/vLLM/LM
Studio), replacing the Anthropic SDK. Degrades gracefully if the endpoint is down.
- Env vars renamed off the old DialogSmith name:
LLM_VALIDATE / LLM_MODEL / LLM_API_KEY / LLM_API_BASE_URL.
Config & docs:
- train_lora.yaml: explicit train_on_prompt: false to document loss masking (makes
--multi-speaker labels safe).
- *.local.yaml override pattern (gitignored) keeps personal model/hardware tweaks out
of git; .env reconciled to current vars; .env.example renamed to example.env.
- README restyled to the project house style; prominent caution + intended/responsible
use sections.
Tests: adds tests/test_redaction.py and new grouping/validator cases (41 total, green).
Co-Authored-By: Claude Opus 4.8
---
.env.example | 12 --
.gitignore | 4 +
README.md | 314 +++++++++++++++++++++++-----------
configs/train_lora.yaml | 4 +
example.env | 31 ++++
ingest/adapters/telegram.py | 4 +
ingest/cli.py | 118 ++++++++++++-
ingest/core.py | 114 ++++++++++--
ingest/llm.py | 97 +++++++++++
ingest/message.py | 9 +
ingest/redaction/__init__.py | 170 ++++++++++++++++++
ingest/redaction/sg.py | 92 ++++++++++
ingest/redaction/universal.py | 76 ++++++++
ingest/redactor.py | 256 +++++++++++++++++++++++++++
ingest/validator.py | 183 +++++++++++---------
requirements.txt | 7 +-
setup.bat | 4 +-
setup.sh | 4 +-
tests/test_ingest.py | 76 +++++++-
tests/test_redaction.py | 167 ++++++++++++++++++
20 files changed, 1527 insertions(+), 215 deletions(-)
delete mode 100644 .env.example
create mode 100644 example.env
create mode 100644 ingest/llm.py
create mode 100644 ingest/redaction/__init__.py
create mode 100644 ingest/redaction/sg.py
create mode 100644 ingest/redaction/universal.py
create mode 100644 ingest/redactor.py
create mode 100644 tests/test_redaction.py
diff --git a/.env.example b/.env.example
deleted file mode 100644
index 1179511..0000000
--- a/.env.example
+++ /dev/null
@@ -1,12 +0,0 @@
-# ── LLM Validation ────────────────────────────────────────────────────────────
-# Validates extracted conversation samples for coherence and quality before
-# writing the dataset. Enabled by default when ANTHROPIC_API_KEY is set.
-# Set to false to skip validation entirely (faster, no API calls).
-DIALOGSMITH_LLM_VALIDATE=true
-
-# Model used for validation scoring (defaults to claude-haiku-4-5-20251001).
-# A fast, cheap model is recommended here — the validator runs once per sample.
-DIALOGSMITH_LLM_MODEL=claude-haiku-4-5-20251001
-
-# Your Anthropic API key. Required when DIALOGSMITH_LLM_VALIDATE=true.
-ANTHROPIC_API_KEY=your_api_key_here
diff --git a/.gitignore b/.gitignore
index 542add2..2e45380 100644
--- a/.gitignore
+++ b/.gitignore
@@ -10,6 +10,10 @@ ChatExport*/
# Tracked project config (re-include despite the broad *.json rule above)
!configs/dataset_info.json
+# Personal training/export overrides — copy a tracked config to *.local.yaml and
+# edit that for your own model/hardware; it stays out of git.
+configs/*.local.yaml
+
# Python cache / bytecode
__pycache__/
*.py[cod]
diff --git a/README.md b/README.md
index 292694b..8cf44a7 100644
--- a/README.md
+++ b/README.md
@@ -1,135 +1,221 @@
-# Doppelganger – Fine-Tune Models on Your Chat History
+Doppelganger
-**Doppelganger** lets you fine-tune large language models (LLMs) like Qwen on your own chat
-conversations. Built on top of [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory), it
-formats your data into the ShareGPT format for supervised fine-tuning (SFT).
+
+ Fine-tune an LLM on your own chat history to mimic how you write
+
-Ingestion is **source-agnostic**: a small adapter parses each platform's export into a normalized
-message stream, and the rest of the pipeline (sessionizing, turn-merging, optional quality
-validation, ShareGPT formatting) is shared. **Telegram** is supported today; other sources
-(WhatsApp, etc.) are planned and slot in as drop-in adapters — see [issue #9](https://github.com/NotYuSheng/Doppelganger/issues/9).
+
+ Features •
+ Quick Start •
+ Usage •
+ Fine-Tuning •
+ Privacy
+
-## Purpose
+
+
+
+
+
+
+
-Fine-tuning on chat data can capture aspects of your text style, including:
+---
-* Writing tone, vocabulary, and phrasing
-* Typical response lengths and structure
-* Repeated expressions or idioms
-* Conversational flow and habits
+Doppelganger fine-tunes large language models (like Qwen) on your own chat conversations, capturing how *you* write. Built on top of [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory), it turns a raw chat export into a [ShareGPT](https://github.com/hiyouga/LLaMA-Factory/blob/main/data/README.md)-formatted dataset for supervised fine-tuning (SFT), then trains a LoRA adapter on it.
-However, this method **won’t replicate your deeper beliefs, private memories, or behavior outside the chat**. It reflects how you write — not necessarily how you think.
+Ingestion is **source-agnostic**: a small adapter parses each platform's export into a normalized message stream, and the rest of the pipeline (sessionizing, turn-merging, sensitive-data scanning, optional quality auditing, ShareGPT formatting) is shared. **Telegram** is supported today, with **WhatsApp**, **Discord**, and other platforms planned — each slots in as a drop-in adapter.
-For stronger emulation, consider incorporating:
+> [!CAUTION]
+> **Your chat history is sensitive data, and you are responsible for it.** A model fine-tuned on it can memorize and later reproduce personal identifiers, private conversations, credentials, and things said by other people who never consented. The built-in [sensitive-data scanning](#privacy--sensitive-data) is a **safety net, not a guarantee** — both regex and LLM detection miss real cases and raise false positives. Before training, sharing, or deploying anything, **review the dataset yourself**, obtain any consent you need, and ensure you comply with applicable privacy laws. Treat trained adapters and merged checkpoints as sensitive too — they can leak the data they were trained on.
-* Additional sources like emails or forum posts
-* Clear prompt instructions during inference
-* Domain-specific datasets (e.g., technical messages, inside jokes)
+> [!IMPORTANT]
+> **This is a for-fun, experimental project — not a production tool.** A model that imitates a real person can be misused for impersonation, deception, or social engineering, and it will happily generate convincing messages that person never actually wrote. Don't present its output as genuinely from anyone, don't train on someone else's chats without their knowledge, and don't rely on it for anything that matters. Enjoy it responsibly.
-## Warning: Risk of Sensitive Data Exposure
+Fine-tuning on your chats can capture your:
-Fine-tuning on real chat history may unintentionally encode:
+- **Writing tone, vocabulary, and phrasing**
+- **Typical response lengths and structure**
+- **Repeated expressions and idioms**
+- **Conversational flow and habits**
-* Personal identifiers (names, locations, contact info)
-* Confidential conversations
-* Sensitive or offensive content
+> **Note**: This reflects *how you write*, not how you think — it **won't** replicate your deeper beliefs, private memories, or behaviour outside the chat. For stronger emulation, add other sources (emails, forum posts), clear prompt instructions at inference, and domain-specific data (technical messages, inside jokes).
-> **Always review and sanitize your exported dataset (`result.json`) before training.**
-> You are responsible for ensuring compliance with privacy laws and personal data protection.
+## Features
-### Keeping your data out of git
+| Feature | Description |
+|---------|-------------|
+| **Source-agnostic ingestion** | One adapter per platform parses an export into a normalized message stream; the rest of the pipeline is shared. Telegram today; others drop in without touching the core. |
+| **Conversation reconstruction** | Sessionizes messages by silence gaps **and** reply links, merges consecutive turns, and (optionally) preserves per-speaker labels in group chats. |
+| **Sensitive-data scan** | Non-destructive regex scan over the built conversations — email, payment cards (checksum-validated), IP/MAC, API keys, plus pluggable country ID packs. Writes an audit report; you decide what to remove. |
+| **LLM redaction** *(optional)* | An OpenAI-compatible model flags context-dependent PII (names, secrets) regex misses, into the same report and apply step. Local-first by design. |
+| **LLM quality auditor** *(optional)* | Scores each conversation for coherence, quality, and pairing; drops weak samples and splits over-merged ones. |
+| **ShareGPT output** | Emits exactly the format LLaMA-Factory consumes for SFT, with loss masked to your own turns. |
+| **LoRA fine-tuning** | Ready-made train / export / chat configs; swap the base model in one place. |
-Your chat export and any generated datasets are ignored by `.gitignore`
-(`result.json`, `*.json`, `*.jsonl`, `DataExport*/`, `*.session`, `.env`, plus Telegram
-media/contacts such as `*.vcard`, `*.tgs`, `*.webp`, `*.ogg`/`*.oga`). Generic
-media (`.jpg`, `.mp4`, …) lives inside `DataExport*/`, which is ignored
-wholesale. As an extra safeguard, a pre-commit hook refuses to commit these
-files even if they are force-added. Enable it once per clone:
+## Quick Start
-```bash
-git config core.hooksPath hooks
-```
+### Prerequisites
-To deliberately commit a blocked file, bypass the hook with `git commit --no-verify`.
+| Software | Version | Purpose |
+|----------|---------|---------|
+| Python | 3.11–3.13 | Required by LLaMA-Factory 0.9.4 |
+| PyTorch | CUDA build for your GPU | Training (see the [install matrix](https://pytorch.org/get-started/locally/)) |
+| git | Latest | Clone + the dataset-hygiene pre-commit hook |
+| LLM server | Any OpenAI-compatible API | **Optional** — quality auditing & LLM redaction (Ollama, vLLM, LM Studio, OpenAI) |
-## Requirements
+A CUDA-capable GPU is needed for training. Ingestion (parsing → dataset) runs fine on CPU.
-* **Python 3.11–3.13** (required by LLaMA-Factory 0.9.4)
-* A CUDA-capable GPU for training, with a matching [PyTorch build](https://pytorch.org/get-started/locally/)
-* `git`
+### Installation
-## Export Your Telegram Chat
+**1. Export your Telegram chat**
-1. Open **Telegram Desktop**.
-2. Go to: `Settings > Advanced > Export Telegram Data`.
-3. Select your personal chat or group to export.
-4. Ensure **JSON** format is selected (not HTML).
-5. Place the exported `result.json` file into:
+In **Telegram Desktop**: `Settings > Advanced > Export Telegram Data`. Select your chat(s), choose **JSON** format (not HTML), and place the result here:
```
Doppelganger/
-├── data/
-│ └── result.json ← Place here
+└── data/
+ └── result.json ← place your export here
```
-## Setup
-
-The setup scripts create a virtual environment, install pinned dependencies
-(LLaMA-Factory **0.9.4**), and process your export into `data/chat_sharegpt.json`.
+**2. Clone and run setup**
-**Linux / macOS:**
+The setup scripts create a virtual environment, install pinned dependencies (LLaMA-Factory **0.9.4**), create your `.env`, and process the export into `data/chat_sharegpt.json`.
```bash
-./setup.sh
-```
-
-**Windows** (from **Command Prompt**, not PowerShell):
+git clone https://github.com/NotYuSheng/Doppelganger.git
+cd Doppelganger
-```cmd
-setup.bat
+./setup.sh # Linux / macOS
+setup.bat # Windows (from Command Prompt, not PowerShell)
```
-Prefer to do it manually? The scripts are thin wrappers around:
+
+Prefer to run it manually?
```bash
python -m venv venv
-# activate: source venv/bin/activate (Windows: venv\Scripts\activate)
+source venv/bin/activate # Windows: venv\Scripts\activate
pip install -r requirements.txt
python -m ingest --source telegram
```
+
-### Ingestion options
+**3. (Optional) Configure LLM features**
-`python -m ingest` turns a raw export into a dataset. Useful flags:
+Copy `example.env` to `.env` (the setup scripts do this for you) and fill it in to enable the quality auditor and LLM redaction. Local endpoints keep your chat data on your machine:
-| Flag | Default | Description |
-| --------------------- | ------------------------- | ------------------------------------------------------ |
-| `--source` | `telegram` | Chat source to parse (more planned) |
-| `--input` | `./data/result.json` | Path to the raw export |
-| `--format` | `sharegpt` | `sharegpt` (for training) or `jsonl` (intermediate) |
-| `--self-name` | auto-detected | Override which sender is "you" |
-| `--conversation-gap` | `3600` | Seconds of silence that start a new conversation |
-| `--message-chain` | `30` | Max seconds between same-sender messages to merge |
+```dotenv
+LLM_VALIDATE=true
+LLM_MODEL=gpt-4o-mini
+LLM_API_KEY=your_api_key_here
+# For a local model instead (key can be any value):
+# LLM_API_BASE_URL=http://localhost:11434/v1
+# LLM_MODEL=qwen2.5
+```
+
+**4. Fine-tune**
+
+```bash
+source venv/bin/activate
+llamafactory-cli train configs/train_lora.yaml
+```
-### Optional: LLM quality validation
+## Usage
-Each extracted conversation can be scored for coherence and quality, dropping weak samples before
-training. It is enabled automatically when `ANTHROPIC_API_KEY` is set. Copy `.env.example` to `.env`
-and fill it in (the setup scripts do this for you):
+`python -m ingest` turns a raw export into a training-ready dataset. Useful flags:
-```dotenv
-DIALOGSMITH_LLM_VALIDATE=true
-DIALOGSMITH_LLM_MODEL=claude-haiku-4-5-20251001
-ANTHROPIC_API_KEY=your_api_key_here
+| Flag | Default | Description |
+|------|---------|-------------|
+| `--source` | `telegram` | Chat source to parse (more planned) |
+| `--input` | `./data/result.json` | Path to the raw export |
+| `--format` | `sharegpt` | `sharegpt` (for training) or `jsonl` (intermediate) |
+| `--self-name` | auto-detected | Override which sender is "you" |
+| `--conversation-gap` | `3600` | Seconds of silence that start a new conversation |
+| `--message-chain` | `30` | Max seconds between same-sender messages to merge into one turn |
+| `--multi-speaker` | off | In group chats, keep and label each sender on the human side (your turns are never labelled) |
+| `--no-audit` | off | Master off-switch: skip **all** auditing (regex scan + LLM validation) and just build the dataset |
+| `--skip-redact-scan` | off | Skip only the regex sensitive-data scan |
+| `--skip-validation` | off | Skip only the LLM quality validation |
+
+### Optional: LLM quality auditing
+
+Each extracted conversation can be scored for **coherence, quality, and pairing**, dropping or splitting weak samples before training. It uses the OpenAI-compatible API, so it works with OpenAI **or any local server** (Ollama, vLLM, LM Studio). It's enabled automatically when `LLM_API_KEY` or `LLM_API_BASE_URL` is set (configure it in `.env`, step 3 above).
+
+To turn it off, set `LLM_VALIDATE=false` in `.env` (persistent) or pass `--skip-validation` for a single run. To disable **all** auditing at once — both this and the regex scan — use `--no-audit`.
+
+## Privacy & Sensitive Data
+
+Fine-tuning on real chat history may unintentionally encode personal identifiers, confidential conversations, or sensitive content.
+
+> **Always review and sanitize your dataset before training.** You are responsible for compliance with privacy laws and personal data protection.
+
+### Automated sensitive-data scan
+
+To make that review practical, ingestion runs a **regex-based scan** over the built conversations by default. It is **non-destructive** — it only flags and warns, writing `data/redaction_report.json` (with masked previews) and printing a summary so you can decide what to do:
+
+```
+[redactor] WARNING: 3 potential sensitive item(s) detected across 2 conversations:
+ EMAIL 2 hit(s) in 2 conversation(s) [medium]
+ CARD_NUMBER 1 hit(s) in 1 conversation(s) [high]
+ API_KEY 1 hit(s) in 1 conversation(s) [high]
+```
+
+Detection works everywhere out of the box. **Universal detectors** — email, payment cards (checksum-validated), IP/MAC addresses, API keys and private keys — aren't tied to any country and always run. On top of those, optional **locale packs** add country-specific identifiers (national IDs, local phone/postal formats).
+
+Once you've reviewed the report, act on it:
+
+```bash
+python -m ingest --source telegram --redact replace # swap spans for [CATEGORY]
+python -m ingest --source telegram --redact drop # drop flagged conversations
+python -m ingest --source telegram --skip-redact-scan # opt out entirely
+```
+
+### Add coverage for your country
+
+Locale packs are built to be community-contributed: each is a single drop-in module under [`ingest/redaction/`](ingest/redaction/), needing no changes to the scanner or pipeline. Adding one is three steps:
+
+1. Copy an existing pack to `ingest/redaction/.py` (your ISO country code).
+2. Register detectors with `make(...)` and `locale=""`. Back each pattern with a checksum/validator where the identifier has one — that precision is what keeps the report trustworthy instead of noisy.
+3. Import your module in `ingest/redaction/__init__.py`.
+
+Singapore ships as the worked reference ([`sg.py`](ingest/redaction/sg.py): national ID with checksum, local phone, postal code) — but the recipe is the same for any country, and **PRs for new locales are welcome**. Choose which packs run with `--redact-locales` (universal detectors always run regardless).
+
+### LLM-assisted redaction
+
+Regex can't catch everything (names, context-dependent secrets). With `--llm-redact`, an LLM additionally flags such spans into the **same report and the same `--redact` step** — it points at verbatim spans, never rewriting your text. To protect your data it **prefers a local endpoint**: set `LLM_API_BASE_URL` to a local OpenAI-compatible server; without one it refuses to use a hosted API unless you pass `--allow-cloud-redaction`.
+
+```bash
+LLM_API_BASE_URL=http://localhost:11434/v1 LLM_MODEL=qwen2.5 \
+ python -m ingest --source telegram --llm-redact --redact replace
```
-Set `DIALOGSMITH_LLM_VALIDATE=false` to skip validation entirely (no API calls).
+### Keeping your data out of git
+
+Your chat export and any generated datasets are ignored by `.gitignore` (`result.json`, `*.json`, `*.jsonl`, `DataExport*/`, `*.session`, `.env`, plus Telegram media/contacts such as `*.vcard`, `*.tgs`, `*.webp`, `*.ogg`/`*.oga`). Generic media (`.jpg`, `.mp4`, …) lives inside `DataExport*/`, which is ignored wholesale. As an extra safeguard, a pre-commit hook refuses to commit these files even if they are force-added. Enable it once per clone:
+
+```bash
+git config core.hooksPath hooks
+```
+
+To deliberately commit a blocked file, bypass the hook with `git commit --no-verify`.
+
+## Intended Use & Responsible Use
+
+Doppelganger is a **personal, educational project** — built for individuals to experiment with fine-tuning on **their own** chat history, for fun and learning. It is not a product, and it is **not** intended for profiling or surveilling other people, or for any commercial or deceptive use.
+
+If you use it, please:
+
+- **Use your own data.** Train on chats you're a participant in. Group chats include other people's messages — be considerate, and don't publish models trained on them.
+- **Keep it local.** Don't publish the dataset, the trained adapter, or merged checkpoints — they can leak the conversations they were trained on.
+- **Don't impersonate or deceive.** Never present generated text as something a real person actually said or wrote.
+- **Respect the law.** You are responsible for complying with the privacy and data-protection laws in your jurisdiction.
+
+In short: it's a toy for exploring how *you* write — please keep it that way.
## Fine-Tune Your Model (LoRA)
-Training is configured by [`configs/train_lora.yaml`](configs/train_lora.yaml), which defaults to
-**Qwen1.5-1.8B-Chat** and the `chat_sharegpt` dataset registered in
-[`configs/dataset_info.json`](configs/dataset_info.json). Activate your venv, then run:
+Training is configured by [`configs/train_lora.yaml`](configs/train_lora.yaml), which defaults to **Qwen1.5-1.8B-Chat** and the `chat_sharegpt` dataset registered in [`configs/dataset_info.json`](configs/dataset_info.json). Activate your venv, then run:
```bash
llamafactory-cli train configs/train_lora.yaml
@@ -139,16 +225,27 @@ llamafactory-cli train configs/train_lora.yaml
Edit `configs/train_lora.yaml`:
-| Field | Description |
-| ---------------------- | -------------------------------------------------------- |
-| `model_name_or_path` | Hugging Face model ID or local model path |
-| `template` | Prompt template type (e.g., `qwen`, `chatml`, `default`) |
-| `lora_target` | LoRA target modules (`all` works across architectures) |
-| `output_dir` | Destination to save the LoRA checkpoints |
+| Field | Description |
+|-------|-------------|
+| `model_name_or_path` | Hugging Face model ID or local model path |
+| `template` | Prompt template type (e.g. `qwen`, `chatml`, `default`) |
+| `lora_target` | LoRA target modules (`all` works across architectures) |
+| `output_dir` | Destination to save the LoRA checkpoints |
+
+For example, to use `mistralai/Mistral-7B-Instruct-v0.2`, set `model_name_or_path` accordingly and `template: chatml`. Refer to the [LLaMA-Factory model table](https://github.com/hiyouga/LLaMA-Factory#supported-models) for recommended values.
+
+> **Note**: Training masks the loss to your own (assistant) turns — `train_on_prompt: false`. That's why `--multi-speaker` labels on the human side are safe: the model reads them as context but never learns to produce them.
+
+#### Keep personal tweaks out of git
-For example, to use `mistralai/Mistral-7B-Instruct-v0.2`, set `model_name_or_path` accordingly and
-`template: chatml`. Refer to the
-[LLaMA-Factory model table](https://github.com/hiyouga/LLaMA-Factory#supported-models) for recommended values.
+The configs above are committed defaults — editing them in place shows up in `git status` and risks committing your machine-specific model/hyperparameters. To customize **without touching tracked files**, copy a config to a `*.local.yaml` name and edit that instead. Any `configs/*.local.yaml` is gitignored:
+
+```bash
+cp configs/train_lora.yaml configs/train_lora.local.yaml # edit model, batch size, etc.
+llamafactory-cli train configs/train_lora.local.yaml
+```
+
+The same works for `export_lora.local.yaml`. Your overrides stay local; the repo's defaults stay clean.
### Resume training
@@ -176,7 +273,7 @@ llamafactory-cli chat \
Update `--template` to match the one used during training.
-## Activating the environment later
+## Activating the Environment Later
After running setup once, reactivate the venv in future sessions before running any commands:
@@ -185,21 +282,30 @@ source venv/bin/activate # Linux / macOS
venv\Scripts\activate # Windows (Command Prompt)
```
-## Running the tests
+## Running the Tests
-The ingestion pipeline (parsing, sessionizing, turn-merging, ShareGPT formatting) is covered by a
-fast unit suite — no GPU, network, or API key required:
+The ingestion pipeline (parsing, sessionizing, turn-merging, reply-threading, sensitive-data detection, ShareGPT formatting) is covered by a fast unit suite — no GPU, network, or API key required:
```bash
python -m unittest discover -s tests -t .
```
-It runs in well under a second and locks in the conversion behaviour, so you can verify a change
-without running the full pipeline.
+It runs in well under a second and locks in the conversion behaviour, so you can verify a change without running the full pipeline.
+
+## Legacy Workflow
+
+The pre-refactor, Windows-only workflow (which cloned LLaMA-Factory at HEAD) is preserved at the [`v0.1.0`](https://github.com/NotYuSheng/Doppelganger/releases/tag/v0.1.0) tag. The old `scripts/telegram_extract.py` and `scripts/convert_to_sharegpt.py` still work as thin deprecated wrappers around `python -m ingest`, but will be removed in a future release.
+
+## Star History
+
+
+
+
+
+
+
+
-## Legacy workflow
+## License
-The pre-refactor, Windows-only workflow (which cloned LLaMA-Factory at HEAD) is preserved at the
-[`v0.1.0`](https://github.com/NotYuSheng/Doppelganger/releases/tag/v0.1.0) tag. The old
-`scripts/telegram_extract.py` and `scripts/convert_to_sharegpt.py` still work as thin deprecated
-wrappers around `python -m ingest`, but will be removed in a future release.
+This project is licensed under the MIT License. See [LICENSE](LICENSE) for details.
diff --git a/configs/train_lora.yaml b/configs/train_lora.yaml
index d1499e5..ff4db98 100644
--- a/configs/train_lora.yaml
+++ b/configs/train_lora.yaml
@@ -33,6 +33,10 @@ plot_loss: true
overwrite_output_dir: true
### train
+# Loss is computed only on your (assistant/"gpt") turns; human turns are masked.
+# This is the SFT default, set explicitly here so --multi-speaker speaker labels
+# on the human side are safe — they condition the model but are never generated.
+train_on_prompt: false
per_device_train_batch_size: 2
gradient_accumulation_steps: 4
learning_rate: 5.0e-5
diff --git a/example.env b/example.env
new file mode 100644
index 0000000..22f142a
--- /dev/null
+++ b/example.env
@@ -0,0 +1,31 @@
+# Copy to .env and fill in. .env is gitignored — never commit your keys.
+# Every value here is OPTIONAL; with none set, ingestion still runs (the LLM
+# features just stay off).
+
+# ── Optional LLM features ─────────────────────────────────────────────────────
+# Used by the conversation quality auditor and the optional LLM redaction pass.
+# Both speak the OpenAI-compatible API, so they work with OpenAI or any local
+# server (Ollama, vLLM, LM Studio, llama.cpp). Running a LOCAL model keeps your
+# chat data on your machine — the recommended setup for private data.
+
+# Enable/disable the quality auditor. Default: enabled when LLM_API_KEY or
+# LLM_API_BASE_URL is set. Set to false to skip it entirely (no API calls).
+LLM_VALIDATE=true
+
+# Model id. For a local server use whatever it serves (e.g. qwen2.5, llama3.1).
+LLM_MODEL=gpt-4o-mini
+
+# API key. Required for hosted APIs; local servers usually accept any value.
+LLM_API_KEY=your_api_key_here
+
+# OpenAI-compatible endpoint. Set this to use a local model, e.g.
+# http://localhost:11434/v1 (Ollama)
+# http://localhost:8000/v1 (vLLM)
+# Leave unset to use OpenAI's hosted API.
+# LLM_API_BASE_URL=http://localhost:11434/v1
+
+# ── Optional: Hugging Face ────────────────────────────────────────────────────
+# Only needed to download GATED models during training (e.g. Gemma). The default
+# Qwen model in configs/train_lora.yaml is open and needs no token. Read by the
+# training stack (huggingface_hub), not by this repo's ingestion code.
+# HF_TOKEN=
diff --git a/ingest/adapters/telegram.py b/ingest/adapters/telegram.py
index dabb2b8..7d0bbae 100644
--- a/ingest/adapters/telegram.py
+++ b/ingest/adapters/telegram.py
@@ -65,6 +65,8 @@ def parse(
if not _is_valid(msg):
continue
sender = msg.get("from")
+ reply_to = msg.get("reply_to_message_id")
+ msg_id = msg.get("id")
messages.append(
NormalizedMessage(
chat_id=chat_id,
@@ -72,6 +74,8 @@ def parse(
sender_id=sender,
sender_is_self=(sender == self_name),
text=_get_text(msg),
+ message_id=str(msg_id) if msg_id is not None else None,
+ reply_to_id=str(reply_to) if reply_to is not None else None,
)
)
return messages
diff --git a/ingest/cli.py b/ingest/cli.py
index b666f5c..2cf2a5f 100644
--- a/ingest/cli.py
+++ b/ingest/cli.py
@@ -10,7 +10,9 @@
import os
import sys
-from ingest import core, sharegpt
+import os.path
+
+from ingest import core, redactor, sharegpt
from ingest.adapters import available_sources, get_adapter
from ingest.validator import validate_samples
@@ -38,6 +40,36 @@ def _load_dotenv(path: str = ".env") -> None:
os.environ[key] = value
+def _run_llm_redaction(samples, allow_cloud: bool):
+ """Run the optional LLM redaction pass, guarding against accidental cloud use.
+
+ Returns a (possibly empty) list of LLM findings. Prefers a local endpoint;
+ if none is configured and cloud use wasn't explicitly allowed, it warns and
+ skips rather than silently shipping chat data to a third party.
+ """
+ from ingest import llm
+
+ if not llm.is_local() and not allow_cloud:
+ print(
+ "[redactor] --llm-redact set but no local endpoint configured. "
+ "Refusing to send chat data to a hosted API by default. Set "
+ f"{llm.BASE_URL_ENV} to a local OpenAI-compatible server (Ollama, "
+ "vLLM, LM Studio, ...), or pass --allow-cloud-redaction to override. "
+ "Skipping LLM pass."
+ )
+ return []
+
+ try:
+ client = llm.get_client()
+ except (ImportError, EnvironmentError) as e:
+ print(f"[redactor] LLM redaction unavailable: {e}. Skipping LLM pass.")
+ return []
+
+ model = llm.model()
+ print(f"[redactor] LLM redaction scan via {model} ({llm.endpoint_label()})...")
+ return redactor.llm_scan_samples(samples, client, model)
+
+
def build_parser() -> argparse.ArgumentParser:
parser = argparse.ArgumentParser(
prog="python -m ingest",
@@ -85,6 +117,58 @@ def build_parser() -> argparse.ArgumentParser:
help=f"Max seconds between same-sender messages to merge into one turn "
f"(default: {core.DEFAULT_MESSAGE_CHAIN}).",
)
+ parser.add_argument(
+ "--multi-speaker",
+ action="store_true",
+ help="In group chats, keep individual senders and label each user turn "
+ "with their name (e.g. 'Bob: ...'). Your own turns are never labelled. "
+ "Default collapses the other side into one speaker.",
+ )
+ parser.add_argument(
+ "--redact",
+ choices=["off", "replace", "drop"],
+ default="off",
+ help="What to do with detected sensitive data. 'off' (default) only "
+ "scans and writes a report. 'replace' swaps spans for [CATEGORY] "
+ "placeholders; 'drop' removes conversations containing detections.",
+ )
+ parser.add_argument(
+ "--redact-locales",
+ default="SG",
+ help="Comma-separated locales for sensitive-data detection (universal "
+ "patterns always run). Default: SG.",
+ )
+ parser.add_argument(
+ "--skip-redact-scan",
+ action="store_true",
+ help="Skip the sensitive-data scan/report entirely.",
+ )
+ parser.add_argument(
+ "--llm-redact",
+ action="store_true",
+ help="Additionally use an LLM to flag context-dependent sensitive data "
+ "(names, secrets regex misses). Prefers a local endpoint: set "
+ "LLM_API_BASE_URL, or pass --allow-cloud-redaction to use a hosted API "
+ "(which sends chat text to a third party).",
+ )
+ parser.add_argument(
+ "--allow-cloud-redaction",
+ action="store_true",
+ help="Permit LLM redaction against a hosted API when no local "
+ "LLM_API_BASE_URL is configured.",
+ )
+ parser.add_argument(
+ "--no-audit",
+ action="store_true",
+ help="Master off-switch: skip ALL auditing — the regex sensitive-data "
+ "scan and the LLM quality validation. Just build the dataset.",
+ )
+ parser.add_argument(
+ "--skip-validation",
+ action="store_true",
+ help="Skip only the LLM quality validation (the regex scan still runs "
+ "unless --skip-redact-scan / --no-audit is also given).",
+ )
return parser
@@ -110,10 +194,40 @@ def main(argv=None) -> int:
messages,
conversation_gap=args.conversation_gap,
message_chain=args.message_chain,
+ multi_speaker=args.multi_speaker,
)
print(f"Extracted {len(samples)} conversation samples.")
- samples = validate_samples(samples)
+ # --no-audit is the master off-switch; the granular flags disable one half.
+ skip_scan = args.no_audit or args.skip_redact_scan
+ skip_validation = args.no_audit or args.skip_validation
+ if args.no_audit:
+ print("[audit] All auditing disabled (--no-audit) — building dataset as-is.")
+
+ if not skip_scan:
+ locales = [s.strip() for s in args.redact_locales.split(",") if s.strip()]
+ report = redactor.scan_samples(samples, locales=locales)
+
+ llm_findings = []
+ if args.llm_redact:
+ llm_findings = _run_llm_redaction(samples, args.allow_cloud_redaction)
+ redactor.merge_llm_findings(report, llm_findings)
+
+ report_path = os.path.join(os.path.dirname(output) or ".", "redaction_report.json")
+ redactor.write_report(report, report_path)
+ redactor.print_summary(report, report_path, mode=args.redact)
+ if args.redact != "off":
+ before = len(samples)
+ samples = redactor.apply(
+ samples, args.redact, locales=locales, llm_findings=llm_findings
+ )
+ print(
+ f"[redactor] Applied --redact {args.redact}: "
+ f"{before} -> {len(samples)} samples."
+ )
+
+ if not skip_validation:
+ samples = validate_samples(samples)
if args.format == "sharegpt":
written = sharegpt.write_sharegpt(samples, output)
diff --git a/ingest/core.py b/ingest/core.py
index da50061..c27b6fe 100644
--- a/ingest/core.py
+++ b/ingest/core.py
@@ -56,6 +56,61 @@ def _split_into_conversations(
return conversations
+def _merge_by_reply(
+ conversations: List[List[NormalizedMessage]],
+) -> List[List[NormalizedMessage]]:
+ """Stitch back conversations that a silence gap split but a reply connects.
+
+ A time gap is a guess at where one conversation ends. An explicit reply link
+ is ground truth: if a message replies to one in an earlier (same-chat)
+ conversation, they belong together. We union such conversations and re-sort
+ each merged group chronologically.
+
+ When no message carries reply metadata (``message_id``/``reply_to_id`` all
+ ``None``), there is nothing to union and the input is returned unchanged — so
+ sources without reply data keep the pure time-based behaviour.
+ """
+ n = len(conversations)
+ if n <= 1:
+ return conversations
+
+ id_to_conv = {
+ m.message_id: ci
+ for ci, conv in enumerate(conversations)
+ for m in conv
+ if m.message_id is not None
+ }
+ if not id_to_conv:
+ return conversations
+
+ parent = list(range(n))
+
+ def find(x: int) -> int:
+ while parent[x] != x:
+ parent[x] = parent[parent[x]]
+ x = parent[x]
+ return x
+
+ def union(a: int, b: int) -> None:
+ ra, rb = find(a), find(b)
+ if ra != rb:
+ parent[max(ra, rb)] = min(ra, rb)
+
+ for ci, conv in enumerate(conversations):
+ for m in conv:
+ target = id_to_conv.get(m.reply_to_id) if m.reply_to_id else None
+ if target is not None and target != ci:
+ union(ci, target)
+
+ groups: "Dict[int, List[NormalizedMessage]]" = {}
+ for ci in range(n):
+ groups.setdefault(find(ci), []).extend(conversations[ci])
+
+ # Order merged groups by their earliest message so output stays chronological.
+ ordered_roots = sorted(groups, key=lambda r: min(m.timestamp for m in groups[r]))
+ return [sorted(groups[r], key=lambda m: m.timestamp) for r in ordered_roots]
+
+
def _collect_turn(
conversation: List[NormalizedMessage], start_idx: int, chain_threshold: int
):
@@ -82,35 +137,70 @@ def _collect_turn(
return texts, j
+def _assemble_turns(raw_turns, multi_speaker: bool) -> Sample:
+ """Turn ``(sender_id, is_self, text)`` runs into role/text turns.
+
+ Roles: the dataset owner is ``assistant`` (this is what the doppelganger
+ learns to produce, so it is *never* labelled), everyone else is ``user``.
+
+ Default mode merges adjacent same-role runs, so in a group chat several
+ people on the "other side" collapse into one ``user`` turn. ``multi_speaker``
+ instead keeps each speaker distinct and prefixes ``user`` turns with the
+ sender (``"Bob: ..."``), only merging consecutive runs from the *same*
+ sender — preserving who-said-what as conditioning context.
+ """
+ turns: Sample = []
+ last_sender = None
+
+ for sender_id, is_self, text in raw_turns:
+ role = "assistant" if is_self else "user"
+ value = f"{sender_id}: {text}" if (multi_speaker and role == "user") else text
+
+ same_role = bool(turns) and turns[-1]["role"] == role
+ # In multi-speaker mode a user turn only merges with the previous turn
+ # when it is the same speaker; otherwise distinct speakers stay distinct.
+ mergeable = same_role and not (
+ multi_speaker and role == "user" and last_sender != sender_id
+ )
+ if mergeable:
+ turns[-1]["text"] += "\n" + value
+ else:
+ turns.append({"role": role, "text": value})
+ last_sender = sender_id
+
+ return turns
+
+
def build_samples(
messages: Iterable[NormalizedMessage],
conversation_gap: int = DEFAULT_CONVERSATION_GAP,
message_chain: int = DEFAULT_MESSAGE_CHAIN,
+ multi_speaker: bool = False,
) -> List[Sample]:
"""Turn normalized messages into multi-turn conversation samples.
- Splits each chat into conversations, merges consecutive same-sender messages
- into turns, and keeps only conversations containing at least one user turn
- and one assistant turn.
+ Splits each chat into conversations (stitching reply-linked ones back
+ together), merges consecutive same-sender messages into turns, and keeps
+ only conversations containing at least one user turn and one assistant turn.
+
+ ``multi_speaker`` preserves and labels individual senders in group chats
+ (see :func:`_assemble_turns`); the default collapses the other side.
"""
samples: List[Sample] = []
for chat_messages in _group_by_chat(messages):
- for conversation in _split_into_conversations(chat_messages, conversation_gap):
- turns: Sample = []
+ time_convs = _split_into_conversations(chat_messages, conversation_gap)
+ for conversation in _merge_by_reply(time_convs):
+ raw_turns = []
i = 0
while i < len(conversation):
texts, next_i = _collect_turn(conversation, i, message_chain)
if texts:
- role = "assistant" if conversation[i].sender_is_self else "user"
- turn_text = "\n".join(texts)
- # Merge with previous turn if same role (e.g. gap split a block).
- if turns and turns[-1]["role"] == role:
- turns[-1]["text"] += "\n" + turn_text
- else:
- turns.append({"role": role, "text": turn_text})
+ m = conversation[i]
+ raw_turns.append((m.sender_id, m.sender_is_self, "\n".join(texts)))
i = next_i
+ turns = _assemble_turns(raw_turns, multi_speaker)
roles = {t["role"] for t in turns}
if "user" in roles and "assistant" in roles:
samples.append(turns)
diff --git a/ingest/llm.py b/ingest/llm.py
new file mode 100644
index 0000000..b862acf
--- /dev/null
+++ b/ingest/llm.py
@@ -0,0 +1,97 @@
+"""Shared OpenAI-compatible LLM client.
+
+One client for every optional LLM feature (quality validation, LLM redaction).
+It speaks the OpenAI Chat Completions API, so it works against OpenAI itself
+*and* any local/self-hosted server that exposes that API — Ollama, vLLM, LM
+Studio, llama.cpp's server, LiteLLM, etc. Running a local endpoint is the
+privacy-preserving way to use these features, since your chat text never leaves
+your machine.
+
+Environment variables:
+ LLM_VALIDATE true/false. Default: enabled when LLM_API_KEY or
+ LLM_API_BASE_URL is set, disabled otherwise.
+ LLM_API_BASE_URL OpenAI-compatible base URL. Set this for a local model, e.g.
+ http://localhost:11434/v1 (Ollama) or http://localhost:8000/v1
+ (vLLM). Unset → OpenAI's hosted API.
+ LLM_MODEL Model id (default: gpt-4o-mini). For a local server use whatever
+ it serves, e.g. "qwen2.5" or "llama3.1".
+ LLM_API_KEY API key. Local servers usually accept any value; falls back to
+ OPENAI_API_KEY if unset.
+"""
+
+import os
+
+VALIDATE_ENV = "LLM_VALIDATE"
+MODEL_ENV = "LLM_MODEL"
+BASE_URL_ENV = "LLM_API_BASE_URL"
+API_KEY_ENV = "LLM_API_KEY"
+DEFAULT_MODEL = "gpt-4o-mini"
+
+
+def base_url() -> str:
+ return os.environ.get(BASE_URL_ENV, "").strip()
+
+
+def model() -> str:
+ return os.environ.get(MODEL_ENV, "").strip() or DEFAULT_MODEL
+
+
+def is_local() -> bool:
+ """True when a custom (presumably local/self-hosted) endpoint is configured."""
+ return bool(base_url())
+
+
+def _api_key() -> str:
+ return (
+ os.environ.get(API_KEY_ENV, "").strip()
+ or os.environ.get("OPENAI_API_KEY", "").strip()
+ )
+
+
+def should_validate() -> bool:
+ val = os.environ.get(VALIDATE_ENV, "").strip().lower()
+ if val == "false":
+ return False
+ if val == "true":
+ return True
+ # Default: enable when there's something to talk to.
+ return bool(_api_key() or base_url())
+
+
+def get_client():
+ """Build an OpenAI-compatible client. Raises if unusable (caller handles)."""
+ try:
+ from openai import OpenAI
+ except ImportError:
+ raise ImportError(
+ "The 'openai' package is required for LLM features. "
+ "Install it with: pip install openai"
+ )
+ url = base_url()
+ key = _api_key()
+ if not key:
+ if url:
+ key = "not-needed" # local servers ignore it, but the SDK requires a value
+ else:
+ raise EnvironmentError(
+ f"{API_KEY_ENV} is not set. Set it, point {BASE_URL_ENV} at a "
+ f"local endpoint, or set {VALIDATE_ENV}=false."
+ )
+ kwargs = {"api_key": key}
+ if url:
+ kwargs["base_url"] = url
+ return OpenAI(**kwargs)
+
+
+def endpoint_label() -> str:
+ return base_url() or "OpenAI API"
+
+
+def chat(client, model_name: str, prompt: str, max_tokens: int = 256) -> str:
+ """Single-prompt completion; returns the assistant message text."""
+ resp = client.chat.completions.create(
+ model=model_name,
+ max_tokens=max_tokens,
+ messages=[{"role": "user", "content": prompt}],
+ )
+ return (resp.choices[0].message.content or "").strip()
diff --git a/ingest/message.py b/ingest/message.py
index edde67d..0db1df7 100644
--- a/ingest/message.py
+++ b/ingest/message.py
@@ -1,6 +1,7 @@
"""The normalized, source-agnostic message shape shared by the pipeline."""
from dataclasses import dataclass
+from typing import Optional
@dataclass
@@ -24,6 +25,12 @@ class NormalizedMessage:
("you"). Drives the user/assistant role assignment downstream.
text: The plain-text message body (already extracted/cleaned by the
adapter). Adapters should only emit messages with non-empty text.
+ message_id: Source-stable id for this message, used to resolve reply
+ links. ``None`` if the source has no message ids.
+ reply_to_id: ``message_id`` of the message this one replies to, or
+ ``None``. Lets the pipeline thread replies instead of relying on
+ time order alone. Adapters that lack reply data leave both ``None``,
+ and grouping falls back to its time-based behaviour.
"""
chat_id: str
@@ -31,3 +38,5 @@ class NormalizedMessage:
sender_id: str
sender_is_self: bool
text: str
+ message_id: Optional[str] = None
+ reply_to_id: Optional[str] = None
diff --git a/ingest/redaction/__init__.py b/ingest/redaction/__init__.py
new file mode 100644
index 0000000..a1f93d7
--- /dev/null
+++ b/ingest/redaction/__init__.py
@@ -0,0 +1,170 @@
+"""Regex-based sensitive-data detection.
+
+A *detector* is a named, locale-tagged regex (optionally backed by a checksum
+validator) that flags one category of sensitive data — an email, a credit card,
+a Singapore NRIC, etc. Detectors register themselves at import time via
+:func:`register`, exactly like source adapters do, so adding coverage for a new
+country is a single drop-in module under ``ingest/redaction/`` — no changes to
+the scanner or the pipeline.
+
+Detection is **non-destructive**: :func:`scan_text` and :func:`scan_samples`
+only *report* matches (as :class:`Finding` objects). Whether to redact is the
+user's decision, taken later against the audit report.
+
+Want to add your country? Copy ``sg.py``, swap in your locale's patterns +
+checksum validators, and register them. See ``CONTRIBUTING`` notes in ``sg.py``.
+"""
+
+import re
+from dataclasses import dataclass
+from typing import Callable, Dict, Iterable, List, Optional, Pattern
+
+UNIVERSAL = "universal" # locale tag for patterns that are the same worldwide
+
+
+@dataclass(frozen=True)
+class Detector:
+ """One category of sensitive data and how to recognise it.
+
+ Attributes:
+ name: Unique id, e.g. ``"sg_nric"`` or ``"email"``.
+ category: Human-facing label shown in reports, e.g. ``"NRIC"``.
+ locale: ``"universal"`` or an ISO 3166-1 alpha-2 code (``"SG"``).
+ pattern: Compiled regex. Every full match is a candidate.
+ severity: ``"low" | "medium" | "high"`` — drives the suggested action.
+ validator: Optional extra check on the matched string (e.g. Luhn,
+ NRIC checksum). A candidate is only flagged if it returns True.
+ This is what turns a noisy regex into a high-precision detector.
+ """
+
+ name: str
+ category: str
+ locale: str
+ pattern: Pattern
+ severity: str = "medium"
+ validator: Optional[Callable[[str], bool]] = None
+
+
+@dataclass(frozen=True)
+class Finding:
+ """A single detected span of sensitive data within one text."""
+
+ detector: str
+ category: str
+ locale: str
+ severity: str
+ start: int
+ end: int
+ value: str
+ preview: str # masked, safe to print/log
+
+
+_REGISTRY: "List[Detector]" = []
+
+
+def register(detector: Detector) -> Detector:
+ """Register a detector. Duplicate ``name`` is a programming error."""
+ if any(d.name == detector.name for d in _REGISTRY):
+ raise ValueError(f"Duplicate detector name: {detector.name!r}")
+ _REGISTRY.append(detector)
+ return detector
+
+
+def make(
+ name: str,
+ category: str,
+ locale: str,
+ regex: str,
+ *,
+ severity: str = "medium",
+ flags: int = 0,
+ validator: Optional[Callable[[str], bool]] = None,
+) -> Detector:
+ """Compile a regex and register it as a detector in one call."""
+ return register(
+ Detector(
+ name=name,
+ category=category,
+ locale=locale,
+ pattern=re.compile(regex, flags),
+ severity=severity,
+ validator=validator,
+ )
+ )
+
+
+def available_locales() -> "List[str]":
+ return sorted({d.locale for d in _REGISTRY})
+
+
+def iter_detectors(locales: Optional[Iterable[str]] = None) -> "List[Detector]":
+ """Detectors for the given locales. ``None`` means all.
+
+ ``UNIVERSAL`` detectors are always included — email/card/IP look the same
+ everywhere, so they run regardless of which country was selected.
+ """
+ if locales is None:
+ return list(_REGISTRY)
+ wanted = {UNIVERSAL, *locales}
+ return [d for d in _REGISTRY if d.locale in wanted]
+
+
+def mask(value: str) -> str:
+ """Mask a value for safe display in a report (keep shape, hide content)."""
+ if "@" in value: # email: keep first char + domain
+ local, _, domain = value.partition("@")
+ head = local[0] if local else ""
+ return f"{head}***@{domain}"
+ stripped = value.strip()
+ if len(stripped) <= 4:
+ return "*" * len(stripped)
+ return f"{stripped[:2]}{'*' * (len(stripped) - 3)}{stripped[-1]}"
+
+
+def scan_text(text: str, locales: Optional[Iterable[str]] = None) -> "List[Finding]":
+ """Return all sensitive-data findings in ``text`` (non-destructive)."""
+ findings: List[Finding] = []
+ for det in iter_detectors(locales):
+ # A detector may match surrounding context but expose only the sensitive
+ # span via a named ``id`` group (e.g. require "NRIC" before the number,
+ # but report just the number). Otherwise the whole match is the value.
+ report_id = "id" in det.pattern.groupindex
+ for m in det.pattern.finditer(text):
+ value = m.group("id") if report_id else m.group()
+ if det.validator and not det.validator(value):
+ continue
+ start, end = m.span("id") if report_id else m.span()
+ findings.append(
+ Finding(
+ detector=det.name,
+ category=det.category,
+ locale=det.locale,
+ severity=det.severity,
+ start=start,
+ end=end,
+ value=value,
+ preview=mask(value),
+ )
+ )
+ return findings
+
+
+def luhn_valid(number: str) -> bool:
+ """Luhn checksum — filters most non-card digit runs (phone/IDs/etc)."""
+ digits = [int(c) for c in number if c.isdigit()]
+ if len(digits) < 13 or len(digits) > 19:
+ return False
+ total = 0
+ for i, d in enumerate(reversed(digits)):
+ if i % 2 == 1:
+ d *= 2
+ if d > 9:
+ d -= 9
+ total += d
+ return total % 10 == 0
+
+
+# Importing the package registers the bundled detectors. Add a new locale module
+# here (and as a file) and its detectors light up everywhere automatically.
+from ingest.redaction import universal as _universal # noqa: E402,F401
+from ingest.redaction import sg as _sg # noqa: E402,F401
diff --git a/ingest/redaction/sg.py b/ingest/redaction/sg.py
new file mode 100644
index 0000000..7b57cdb
--- /dev/null
+++ b/ingest/redaction/sg.py
@@ -0,0 +1,92 @@
+"""Singapore (SG) sensitive-data detectors.
+
+This is the reference locale module — copy it to add your own country.
+
+A good locale detector is *precise*: a bare regex over chat text fires on
+everything, so back it with a checksum/validator wherever the identifier has one
+(see :func:`nric_valid`). High precision is what keeps the audit report
+trustworthy instead of a wall of false positives.
+
+CONTRIBUTING
+------------
+Add a country by creating ``ingest/redaction/.py`` (``cc`` = ISO 3166-1
+alpha-2, lower-case), registering detectors with :func:`ingest.redaction.make`
+and ``locale=""``, then importing it in ``ingest/redaction/__init__``.
+Open items for SG that make good first contributions:
+ * NRIC **M-series** (introduced 2022) uses a different checksum table — the
+ regex below intentionally matches only S/T/F/G so it never flags an
+ M-series number it can't verify. Add the M table + tests.
+ * UEN (business registration number) detector.
+"""
+
+import re
+
+from ingest.redaction import make
+
+# NRIC/FIN checksum tables, indexed by (weighted_sum + offset) % 11.
+_ST_SUFFIX = "JZIHGFEDCBA" # S (citizen) and T (citizen, 2000+)
+_FG_SUFFIX = "XWUTRQPNMLK" # F and G (foreigner / long-term pass)
+_WEIGHTS = (2, 7, 6, 5, 4, 3, 2)
+
+
+def nric_valid(value: str) -> bool:
+ """Validate a Singapore NRIC/FIN by its check digit (S/T/F/G series)."""
+ value = value.strip().upper()
+ if len(value) != 9:
+ return False
+ prefix, digits, suffix = value[0], value[1:8], value[8]
+ if prefix not in "STFG" or not digits.isdigit():
+ return False
+ total = sum(int(d) * w for d, w in zip(digits, _WEIGHTS))
+ if prefix in "TG": # T and G shift the weighted sum by 4
+ total += 4
+ table = _ST_SUFFIX if prefix in "ST" else _FG_SUFFIX
+ return table[total % 11] == suffix
+
+
+# Long form: full S/T/F/G + 7 digits + check letter, verified by checksum.
+# Case-insensitive so "s1234567a" typed in lower-case is still caught (the
+# validator upper-cases before checking).
+make(
+ "sg_nric",
+ "NRIC/FIN",
+ "SG",
+ r"\b[STFG]\d{7}[A-Z]\b",
+ severity="high",
+ flags=re.IGNORECASE,
+ validator=nric_valid,
+)
+
+# Short form: the last 3 digits + check letter (e.g. "123A"), the way people
+# quote "the last 4 of my IC". It has no self-contained checksum and "123A"
+# alone matches every block/unit number, so precision comes from REQUIRING an
+# NRIC/IC/FIN keyword just before it. Only the ID span (named group) is
+# reported, not the keyword.
+make(
+ "sg_nric_short",
+ "NRIC/FIN (partial)",
+ "SG",
+ r"(?:nric|fin|\bic\b)\D{0,8}?(?P(?" or "S(code)" context to keep precision up.
+make(
+ "sg_postal",
+ "POSTAL_CODE",
+ "SG",
+ r"(?:[Ss]ingapore\s+|\bS\()\d{6}\)?",
+ severity="low",
+)
diff --git a/ingest/redaction/universal.py b/ingest/redaction/universal.py
new file mode 100644
index 0000000..7db1444
--- /dev/null
+++ b/ingest/redaction/universal.py
@@ -0,0 +1,76 @@
+"""Locale-independent detectors: same format the world over.
+
+Email, payment cards, IP/MAC addresses, and vendor API keys don't change by
+country, so they live here and always run. Country-specific catches (national
+IDs, local phone formats, postal codes) belong in a per-locale module instead.
+"""
+
+import re
+
+from ingest.redaction import UNIVERSAL, luhn_valid, make
+
+# --- Contact / network -------------------------------------------------------
+
+make(
+ "email",
+ "EMAIL",
+ UNIVERSAL,
+ r"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}\b",
+ severity="medium",
+)
+
+make(
+ "ipv4",
+ "IP_ADDRESS",
+ UNIVERSAL,
+ r"\b(?:(?:25[0-5]|2[0-4]\d|1?\d?\d)\.){3}(?:25[0-5]|2[0-4]\d|1?\d?\d)\b",
+ severity="low",
+)
+
+make(
+ "ipv6",
+ "IP_ADDRESS",
+ UNIVERSAL,
+ r"\b(?:[A-Fa-f0-9]{1,4}:){2,7}[A-Fa-f0-9]{1,4}\b",
+ severity="low",
+)
+
+make(
+ "mac",
+ "MAC_ADDRESS",
+ UNIVERSAL,
+ r"\b(?:[0-9A-Fa-f]{2}[:-]){5}[0-9A-Fa-f]{2}\b",
+ severity="low",
+)
+
+# --- Financial ---------------------------------------------------------------
+
+# Broad 13–19 digit run (optionally space/dash grouped); Luhn rejects the noise.
+make(
+ "credit_card",
+ "CARD_NUMBER",
+ UNIVERSAL,
+ r"\b(?:\d[ -]?){13,19}\b",
+ severity="high",
+ validator=luhn_valid,
+)
+
+# --- Secrets / credentials ---------------------------------------------------
+
+make("openai_key", "API_KEY", UNIVERSAL, r"\bsk-[A-Za-z0-9]{20,}\b", severity="high")
+make("aws_access_key", "API_KEY", UNIVERSAL, r"\bAKIA[0-9A-Z]{16}\b", severity="high")
+make(
+ "github_token",
+ "API_KEY",
+ UNIVERSAL,
+ r"\bgh[pousr]_[A-Za-z0-9]{36,}\b",
+ severity="high",
+)
+make(
+ "private_key_block",
+ "PRIVATE_KEY",
+ UNIVERSAL,
+ r"-----BEGIN (?:RSA |EC |OPENSSH |DSA |PGP )?PRIVATE KEY-----",
+ severity="high",
+ flags=re.IGNORECASE,
+)
diff --git a/ingest/redactor.py b/ingest/redactor.py
new file mode 100644
index 0000000..39786fa
--- /dev/null
+++ b/ingest/redactor.py
@@ -0,0 +1,256 @@
+"""Non-destructive sensitive-data audit over conversation samples.
+
+Runs the regex detectors in :mod:`ingest.redaction` across every turn, writes an
+audit report, and prints a warning summary. By default **nothing is changed** —
+the user reviews the report and decides whether to act. Acting is opt-in via
+:func:`apply` (wired to the CLI's ``--redact`` flag):
+
+ - "replace": swap each detected span for a ``[CATEGORY]`` placeholder, keeping
+ conversational structure intact for training.
+ - "drop": discard any conversation that contains a detected item.
+
+Detection is regex-based and locale-aware (Singapore-first); see
+``ingest/redaction`` to add coverage for more countries.
+"""
+
+import json
+import re
+from collections import defaultdict
+from typing import Iterable, List, Optional
+
+from ingest import redaction
+
+DEFAULT_LOCALES = ["SG"] # universal detectors always run in addition to these
+
+
+def scan_samples(samples, locales: Optional[Iterable[str]] = None) -> dict:
+ """Scan every turn and return an audit report (no mutation)."""
+ if locales is None:
+ locales = DEFAULT_LOCALES
+
+ findings = []
+ for ci, turns in enumerate(samples):
+ for ti, turn in enumerate(turns):
+ for f in redaction.scan_text(turn.get("text", ""), locales):
+ findings.append({
+ "conversation": ci,
+ "turn": ti,
+ "role": turn.get("role"),
+ "category": f.category,
+ "detector": f.detector,
+ "severity": f.severity,
+ "preview": f.preview,
+ })
+
+ summary = {}
+ convs_per_cat = defaultdict(set)
+ for f in findings:
+ s = summary.setdefault(
+ f["category"], {"hits": 0, "conversations": 0, "severity": f["severity"]}
+ )
+ s["hits"] += 1
+ convs_per_cat[f["category"]].add(f["conversation"])
+ for cat, s in summary.items():
+ s["conversations"] = len(convs_per_cat[cat])
+
+ return {
+ "conversations_scanned": len(samples),
+ "total_findings": len(findings),
+ "locales": list(locales),
+ "summary": summary,
+ "findings": findings,
+ }
+
+
+def write_report(report: dict, path: str) -> None:
+ with open(path, "w", encoding="utf-8") as f:
+ json.dump(report, f, indent=2, ensure_ascii=False)
+
+
+def print_summary(report: dict, report_path: str, mode: str = "off") -> None:
+ n = report["total_findings"]
+ if n == 0:
+ print("[redactor] No sensitive data detected by regex scan.")
+ return
+ print(
+ f"[redactor] WARNING: {n} potential sensitive item(s) detected across "
+ f"{report['conversations_scanned']} conversations:"
+ )
+ for cat, s in sorted(report["summary"].items(), key=lambda kv: -kv[1]["hits"]):
+ print(
+ f" {cat:22s} {s['hits']:4d} hit(s) in {s['conversations']:3d} "
+ f"conversation(s) [{s['severity']}]"
+ )
+ print(f"[redactor] Full report: {report_path}")
+ if mode == "off":
+ print(
+ "[redactor] Nothing was removed. Review it, then re-run with "
+ "--redact replace (placeholder) or --redact drop (remove conversations)."
+ )
+
+
+def _replace_spans(text: str, spans) -> str:
+ """Replace ``(start, end, category)`` spans with ``[CATEGORY]`` placeholders.
+
+ Drops overlapping spans (keeping the right-most) and replaces right-to-left
+ so each replacement leaves earlier offsets valid.
+ """
+ chosen = []
+ boundary = len(text) + 1 # left edge of the span accepted to our right
+ for start, end, cat in sorted(set(spans), key=lambda s: -s[0]):
+ if end <= boundary:
+ chosen.append((start, end, cat))
+ boundary = start
+ for start, end, cat in chosen: # already right-to-left
+ text = text[:start] + f"[{cat}]" + text[end:]
+ return text
+
+
+def apply(samples, mode: str, locales: Optional[Iterable[str]] = None,
+ llm_findings: Optional[List[dict]] = None) -> List:
+ """Return samples with detected data handled per ``mode``.
+
+ ``mode`` is "replace" (swap spans for ``[CATEGORY]``) or "drop" (remove any
+ conversation containing a detection). Regex spans are re-derived per turn;
+ optional ``llm_findings`` (which carry their own offsets) are applied too.
+ """
+ if locales is None:
+ locales = DEFAULT_LOCALES
+
+ llm_by_turn = defaultdict(list)
+ for f in llm_findings or []:
+ llm_by_turn[(f["conversation"], f["turn"])].append(
+ (f["start"], f["end"], f["category"])
+ )
+
+ out = []
+ for ci, turns in enumerate(samples):
+ new_turns = []
+ drop = False
+ for ti, turn in enumerate(turns):
+ text = turn.get("text", "")
+ spans = [(f.start, f.end, f.category) for f in redaction.scan_text(text, locales)]
+ spans += llm_by_turn.get((ci, ti), [])
+ if not spans:
+ new_turns.append(turn)
+ continue
+ if mode == "drop":
+ drop = True
+ break
+ replaced = dict(turn)
+ replaced["text"] = _replace_spans(text, spans)
+ new_turns.append(replaced)
+ if not drop:
+ out.append(new_turns)
+ return out
+
+
+# --- Optional LLM detector (Tier 3) ------------------------------------------
+#
+# Regex can't catch names or context-dependent secrets. When enabled, the LLM
+# reads each conversation and points at sensitive spans *verbatim* (it never
+# rewrites the text — that stays the user's decision). Findings flow into the
+# same report and the same apply() step as the regex tier. The client/endpoint
+# plumbing (incl. LLM_API_BASE_URL for local servers) is shared with the quality
+# validator via ingest.llm.
+
+_LLM_PROMPT = """You are a privacy auditor. Identify spans of SENSITIVE or
+PERSONALLY IDENTIFYING information in the conversation below: real people's
+names, contact details, addresses, financial or government IDs, credentials,
+or health/legal/financial specifics that could identify someone.
+
+Each turn is numbered "[i] ROLE: text". Do NOT rewrite anything. For each
+finding, copy the offending substring EXACTLY as it appears so it can be located.
+
+Respond with ONLY this JSON:
+{{"findings": [{{"turn": , "text": "", "category": "", "severity": "low|medium|high"}}]}}
+
+Conversation:
+{conversation}"""
+
+
+def _format_conversation(turns) -> str:
+ return "\n".join(
+ f"[{i}] {t.get('role', '?').upper()}: {t.get('text', '').strip()}"
+ for i, t in enumerate(turns)
+ )
+
+
+def _llm_audit_conversation(client, model, turns) -> List[dict]:
+ from ingest import llm
+
+ prompt = _LLM_PROMPT.format(conversation=_format_conversation(turns))
+ raw = llm.chat(client, model, prompt, max_tokens=512)
+ match = re.search(r"\{.*\}", raw, re.DOTALL)
+ if not match:
+ raise ValueError(f"No JSON object in LLM response: {raw!r}")
+ return json.loads(match.group()).get("findings", [])
+
+
+def llm_scan_samples(samples, client, model) -> List[dict]:
+ """LLM pass returning verbatim-located findings (with offsets, in memory).
+
+ Each finding is verified by locating the model's span in the turn text; a
+ paraphrased span that can't be found is reported as a soft-miss and skipped
+ rather than trusting an offset we can't confirm.
+ """
+ findings = []
+ for ci, turns in enumerate(samples):
+ try:
+ raw = _llm_audit_conversation(client, model, turns)
+ except Exception as e:
+ print(f"[redactor] LLM scan failed on conversation {ci}: {e}")
+ continue
+ for rf in raw:
+ try:
+ ti = int(rf["turn"])
+ span = str(rf["text"])
+ except (KeyError, ValueError, TypeError):
+ continue
+ if not (0 <= ti < len(turns)) or not span:
+ continue
+ text = turns[ti].get("text", "")
+ idx = text.find(span)
+ if idx < 0:
+ print(f"[redactor] LLM span not found verbatim (conv {ci}, turn {ti}): {span!r}")
+ continue
+ findings.append({
+ "conversation": ci,
+ "turn": ti,
+ "role": turns[ti].get("role"),
+ "category": str(rf.get("category", "PII")),
+ "detector": "llm",
+ "severity": str(rf.get("severity", "medium")),
+ "start": idx,
+ "end": idx + len(span),
+ "preview": redaction.mask(span),
+ })
+ return findings
+
+
+def merge_llm_findings(report: dict, llm_findings: List[dict]) -> dict:
+ """Fold LLM findings into a regex report (masked previews only; no raw spans)."""
+ for f in llm_findings:
+ report["findings"].append({
+ "conversation": f["conversation"],
+ "turn": f["turn"],
+ "role": f["role"],
+ "category": f["category"],
+ "detector": "llm",
+ "severity": f["severity"],
+ "preview": f["preview"],
+ })
+ convs_per_cat = defaultdict(set)
+ for f in report["findings"]:
+ convs_per_cat[f["category"]].add(f["conversation"])
+ summary = {}
+ for f in report["findings"]:
+ s = summary.setdefault(
+ f["category"], {"hits": 0, "conversations": 0, "severity": f["severity"]}
+ )
+ s["hits"] += 1
+ for cat, s in summary.items():
+ s["conversations"] = len(convs_per_cat[cat])
+ report["summary"] = summary
+ report["total_findings"] = len(report["findings"])
+ return report
diff --git a/ingest/validator.py b/ingest/validator.py
index b702ddd..7a684c2 100644
--- a/ingest/validator.py
+++ b/ingest/validator.py
@@ -1,149 +1,178 @@
"""
-Optional LLM-based conversation quality validator.
+Optional LLM-based conversation auditor.
-Controlled via environment variables:
- DIALOGSMITH_LLM_VALIDATE=true/false (default: true if ANTHROPIC_API_KEY is set)
- DIALOGSMITH_LLM_MODEL=... (default: claude-haiku-4-5-20251001)
- ANTHROPIC_API_KEY=...
+Uses the shared OpenAI-compatible client (see :mod:`ingest.llm`), so it runs
+against OpenAI or any local server. Controlled by the ``LLM_*`` environment
+variables documented there (``LLM_VALIDATE``, ``LLM_API_BASE_URL``, ``LLM_MODEL``,
+``LLM_API_KEY``).
-Each conversation sample is scored on two axes:
+Each conversation sample is audited on three axes:
- coherence: does this read as a natural, continuous conversation?
- quality: is this a meaningful exchange worth training on?
+ - pairing: does each assistant turn actually respond to what came before?
-Samples that fail either check are excluded from the output.
-A summary of filtered samples is printed so the user can audit decisions.
+Because the heuristic grouper can over-merge, the auditor may also *repair* a
+sample by proposing split points rather than only keeping or dropping it:
+ - action "keep": use as-is
+ - action "split": cut after the given turn indices into independent samples
+ - action "drop": discard entirely
+
+A summary of every decision is printed so the user can audit the auditor.
"""
import json
-import os
import re
-VALIDATE_ENV = "DIALOGSMITH_LLM_VALIDATE"
-MODEL_ENV = "DIALOGSMITH_LLM_MODEL"
-DEFAULT_MODEL = "claude-haiku-4-5-20251001"
-
-COHERENCE_THRESHOLD = 0.5 # 0–1, below this the conversation is considered incoherent
-QUALITY_THRESHOLD = 0.5 # 0–1, below this the sample is considered low-quality
-
-
-def _should_validate():
- val = os.environ.get(VALIDATE_ENV, "").strip().lower()
- if val == "false":
- return False
- if val == "true":
- return True
- # Default: enable if API key is present
- return bool(os.environ.get("ANTHROPIC_API_KEY", "").strip())
+from ingest import llm
-
-def _get_client():
- try:
- import anthropic
- except ImportError:
- raise ImportError(
- "The 'anthropic' package is required for LLM validation. "
- "Install it with: pip install anthropic"
- )
- api_key = os.environ.get("ANTHROPIC_API_KEY", "").strip()
- if not api_key:
- raise EnvironmentError(
- "ANTHROPIC_API_KEY is not set. "
- f"Set {VALIDATE_ENV}=false to disable validation."
- )
- return anthropic.Anthropic(api_key=api_key)
+COHERENCE_THRESHOLD = 0.5 # below this the conversation is considered incoherent
+QUALITY_THRESHOLD = 0.5 # below this the sample is considered low-quality
+PAIRING_THRESHOLD = 0.5 # below this the turns don't respond to each other
def _format_conversation(turns):
+ """Number every turn so the model can reference split points by index."""
lines = []
- for turn in turns:
+ for i, turn in enumerate(turns):
role = turn.get("role", "unknown").upper()
text = turn.get("text", "").strip()
- lines.append(f"{role}: {text}")
+ lines.append(f"[{i}] {role}: {text}")
return "\n".join(lines)
def _score_sample(client, model, turns):
- """
- Ask the LLM to score a conversation sample.
- Returns (coherence: float, quality: float, reason: str).
+ """Ask the LLM to audit a conversation sample.
+
+ Returns a dict: coherence, quality, pairing (floats), action
+ ("keep"|"split"|"drop"), split_after (list[int]), reason (str).
"""
conversation_text = _format_conversation(turns)
- prompt = f"""You are evaluating a conversation sample for use in fine-tuning a language model.
+ prompt = f"""You are auditing a conversation sample for fine-tuning a language model
+to imitate the ASSISTANT speaker. The conversation was segmented by a heuristic
+that can wrongly merge unrelated exchanges, so judge it carefully.
-Rate the following conversation on two dimensions, each from 0.0 to 1.0:
+Each turn is numbered like "[i] ROLE: text".
-1. coherence: Does this read as a natural, continuous conversation where each message follows logically from the previous? (0 = completely disjointed, 1 = perfectly coherent)
-2. quality: Is this a meaningful, substantive exchange worth training on? Penalise one-word replies, pure greetings, or exchanges with no informational content. (0 = worthless, 1 = highly valuable)
+Rate from 0.0 to 1.0:
+1. coherence: does this read as one natural, continuous conversation?
+2. quality: is this a meaningful exchange worth training on? Penalise pure
+ greetings, one-word replies, and content-free chatter.
+3. pairing: does each ASSISTANT turn actually respond to the USER turn(s) before
+ it? (0 = replies are mismatched/non-sequiturs, 1 = every reply clearly fits)
-Respond with ONLY a JSON object in this exact format:
-{{"coherence": , "quality": , "reason": ""}}
+Then choose an action:
+- "keep": the sample is good as one conversation.
+- "split": it is really two or more separate conversations. Give "split_after"
+ as the list of turn indices AFTER which to cut (e.g. [3] cuts between turn 3
+ and 4).
+- "drop": it is not usable.
+
+Respond with ONLY this JSON:
+{{"coherence": , "quality": , "pairing": ,
+ "action": "keep"|"split"|"drop", "split_after": [...], "reason": ""}}
Conversation:
{conversation_text}"""
- response = client.messages.create(
- model=model,
- max_tokens=128,
- messages=[{"role": "user", "content": prompt}],
- )
-
- raw = response.content[0].text.strip()
- # The model may wrap the JSON in markdown fences or prose; extract the object.
+ raw = llm.chat(client, model, prompt, max_tokens=200)
match = re.search(r"\{.*\}", raw, re.DOTALL)
if not match:
raise ValueError(f"No JSON object found in LLM response: {raw!r}")
result = json.loads(match.group())
- return float(result["coherence"]), float(result["quality"]), result.get("reason", "")
+ return {
+ "coherence": float(result["coherence"]),
+ "quality": float(result["quality"]),
+ "pairing": float(result.get("pairing", 1.0)),
+ "action": str(result.get("action", "keep")).lower(),
+ "split_after": [int(i) for i in result.get("split_after", []) or []],
+ "reason": result.get("reason", ""),
+ }
+
+
+def _apply_split(turns, split_after):
+ """Cut ``turns`` after each given index into independent samples."""
+ cuts = sorted({i for i in split_after if 0 <= i < len(turns) - 1})
+ if not cuts:
+ return [turns]
+ pieces, start = [], 0
+ for idx in cuts:
+ pieces.append(turns[start:idx + 1])
+ start = idx + 1
+ pieces.append(turns[start:])
+ return [p for p in pieces if p]
+
+
+def _has_both_roles(turns):
+ roles = {t["role"] for t in turns}
+ return "user" in roles and "assistant" in roles
def validate_samples(samples):
"""
- Validate a list of conversation samples.
+ Audit a list of conversation samples.
Each sample is a list of {"role": ..., "text": ...} dicts (as produced by
ingest.core.build_samples).
- Returns filtered list of samples that pass validation.
- If validation is disabled or unavailable, returns all samples unchanged.
+ Returns the filtered/repaired list of samples. If validation is disabled or
+ unavailable, returns all samples unchanged.
"""
- if not _should_validate():
+ if not llm.should_validate():
print("[validator] LLM validation disabled — skipping.")
return samples
try:
- client = _get_client()
+ client = llm.get_client()
except (ImportError, EnvironmentError) as e:
print(f"[validator] WARNING: {e}")
print("[validator] Skipping LLM validation and returning all samples.")
return samples
- model = os.environ.get(MODEL_ENV, DEFAULT_MODEL).strip()
- print(f"[validator] Running LLM validation with model: {model}")
+ model = llm.model()
+ print(f"[validator] Auditing with model: {model} via {llm.endpoint_label()}")
passed = []
filtered = []
+ split_count = 0
for i, turns in enumerate(samples):
try:
- coherence, quality, reason = _score_sample(client, model, turns)
+ r = _score_sample(client, model, turns)
except Exception as e:
print(f"[validator] Sample {i}: scoring failed ({e}), keeping sample.")
passed.append(turns)
continue
- if coherence < COHERENCE_THRESHOLD:
- filtered.append((i, "incoherent", coherence, quality, reason))
- elif quality < QUALITY_THRESHOLD:
- filtered.append((i, "low-quality", coherence, quality, reason))
+ low = (
+ r["coherence"] < COHERENCE_THRESHOLD or
+ r["quality"] < QUALITY_THRESHOLD or
+ r["pairing"] < PAIRING_THRESHOLD
+ )
+ if r["action"] == "drop" or low:
+ filtered.append((i, "dropped", r))
+ elif r["action"] == "split":
+ pieces = [p for p in _apply_split(turns, r["split_after"]) if _has_both_roles(p)]
+ if pieces:
+ passed.extend(pieces)
+ split_count += 1
+ else:
+ filtered.append((i, "split-empty", r))
else:
passed.append(turns)
- print(f"[validator] {len(passed)} passed, {len(filtered)} filtered out of {len(samples)} total.")
+ print(
+ f"[validator] {len(passed)} samples kept ({split_count} from splits), "
+ f"{len(filtered)} dropped, from {len(samples)} input samples."
+ )
if filtered:
- print("[validator] Filtered samples:")
- for idx, reason_type, coh, qual, reason in filtered:
- print(f" sample {idx:4d} | {reason_type:12s} | coherence={coh:.2f} quality={qual:.2f} | {reason}")
+ print("[validator] Dropped samples:")
+ for idx, kind, r in filtered:
+ print(
+ f" sample {idx:4d} | {kind:11s} | "
+ f"coh={r['coherence']:.2f} qual={r['quality']:.2f} pair={r['pairing']:.2f} "
+ f"| {r['reason']}"
+ )
return passed
diff --git a/requirements.txt b/requirements.txt
index 44d86bc..6d21e72 100644
--- a/requirements.txt
+++ b/requirements.txt
@@ -6,6 +6,7 @@ llamafactory==0.9.4
# (see https://pytorch.org/get-started/locally/). Installing llamafactory pulls
# a torch build, but it may not match your CUDA version.
-# Optional — only needed when LLM-based dataset validation is enabled
-# (DIALOGSMITH_LLM_VALIDATE / ANTHROPIC_API_KEY). Safe to remove otherwise.
-anthropic>=0.39
+# Optional — only needed for the LLM features (quality validation, LLM
+# redaction). Uses the OpenAI-compatible API, so it works with OpenAI or any
+# local server (Ollama, vLLM, LM Studio, ...). Safe to remove otherwise.
+openai>=1.0
diff --git a/setup.bat b/setup.bat
index 48449bf..60f78d8 100644
--- a/setup.bat
+++ b/setup.bat
@@ -19,8 +19,8 @@ if errorlevel 1 (echo Failed to install dependencies. & exit /b 1)
echo [3/4] Preparing .env...
if not exist ".env" (
- copy ".env.example" ".env" >nul
- echo Created .env from .env.example - edit it to enable optional LLM validation.
+ copy "example.env" ".env" >nul
+ echo Created .env from example.env - edit it to enable optional LLM features.
)
echo [4/4] Processing Telegram export (data\result.json -^> data\chat_sharegpt.json)...
diff --git a/setup.sh b/setup.sh
index 41ad5bd..e6afab3 100755
--- a/setup.sh
+++ b/setup.sh
@@ -20,8 +20,8 @@ echo "[2/4] Installing dependencies (this can take a while)..."
echo "[3/4] Preparing .env..."
if [ ! -f .env ]; then
- cp .env.example .env
- echo " Created .env from .env.example — edit it to enable optional LLM validation."
+ cp example.env .env
+ echo " Created .env from example.env — edit it to enable optional LLM features."
fi
echo "[4/4] Processing Telegram export (data/result.json -> data/chat_sharegpt.json)..."
diff --git a/tests/test_ingest.py b/tests/test_ingest.py
index 72b05fb..5fedbab 100644
--- a/tests/test_ingest.py
+++ b/tests/test_ingest.py
@@ -21,6 +21,7 @@
from ingest import core, sharegpt
from ingest.adapters import available_sources, get_adapter
from ingest.adapters.telegram import TelegramAdapter
+from ingest.message import NormalizedMessage
SELF = "Yu Sheng"
@@ -148,6 +149,57 @@ def test_gap_splits_conversations(self):
self.assertEqual(first_two, EXPECTED_SHAREGPT[:2])
+def _nm(chat, ts, sender, is_self, text, mid=None, reply=None):
+ return NormalizedMessage(
+ chat_id=chat, timestamp=ts, sender_id=sender, sender_is_self=is_self,
+ text=text, message_id=mid, reply_to_id=reply,
+ )
+
+
+class ReplyThreadingTest(unittest.TestCase):
+ def test_reply_stitches_gap_split_conversations(self):
+ # Two messages an hour+ apart would split into two conversations, but the
+ # second replies to the first -> they must end up in one sample.
+ msgs = [
+ _nm("c", 1000, "Alice", False, "you free this weekend?", mid="1"),
+ _nm("c", 1000 + 8000, "Yu", True, "yeah sun works", mid="2", reply="1"),
+ ]
+ samples = core.build_samples(msgs)
+ self.assertEqual(len(samples), 1)
+ self.assertEqual([t["role"] for t in samples[0]], ["user", "assistant"])
+
+ def test_no_reply_data_keeps_time_split(self):
+ # Same timing, no reply link -> still two conversations (one is one-sided
+ # and dropped), proving threading is a no-op without reply metadata.
+ msgs = [
+ _nm("c", 1000, "Alice", False, "you free this weekend?"),
+ _nm("c", 1000 + 8000, "Yu", True, "yeah sun works"),
+ ]
+ self.assertEqual(core.build_samples(msgs), [])
+
+
+class MultiSpeakerTest(unittest.TestCase):
+ def _group(self):
+ return [
+ _nm("g", 1, "Bob", False, "q1"),
+ _nm("g", 2, "Carol", False, "q2"),
+ _nm("g", 3, "Yu", True, "answer"),
+ ]
+
+ def test_default_collapses_other_side(self):
+ out = sharegpt.to_sharegpt(core.build_samples(self._group()))
+ self.assertEqual(out[0]["conversations"][0], {"from": "human", "value": "q1\nq2"})
+
+ def test_multi_speaker_labels_users_not_assistant(self):
+ out = sharegpt.to_sharegpt(core.build_samples(self._group(), multi_speaker=True))
+ convs = out[0]["conversations"]
+ # Distinct speakers stay distinct and are labelled...
+ self.assertEqual(convs[0], {"from": "human", "value": "Bob: q1"})
+ self.assertEqual(convs[1], {"from": "human", "value": "Carol: q2"})
+ # ...but the owner's (assistant) turn is never labelled.
+ self.assertEqual(convs[2], {"from": "gpt", "value": "answer"})
+
+
class ShareGptTest(unittest.TestCase):
def test_role_mapping_and_drop_one_sided(self):
samples = [
@@ -167,6 +219,28 @@ def test_jsonl_roundtrip(self):
self.assertEqual(sharegpt.load_jsonl_samples(p), samples)
+class ValidatorSplitTest(unittest.TestCase):
+ def test_apply_split_cuts_after_indices(self):
+ from ingest.validator import _apply_split
+ turns = [{"role": "user", "text": "a"}, {"role": "assistant", "text": "b"},
+ {"role": "user", "text": "c"}, {"role": "assistant", "text": "d"}]
+ pieces = _apply_split(turns, [1])
+ self.assertEqual(len(pieces), 2)
+ self.assertEqual(pieces[0], turns[:2])
+ self.assertEqual(pieces[1], turns[2:])
+
+ def test_apply_split_ignores_out_of_range(self):
+ from ingest.validator import _apply_split
+ turns = [{"role": "user", "text": "a"}, {"role": "assistant", "text": "b"}]
+ # Index at/after the last turn is meaningless -> no split.
+ self.assertEqual(_apply_split(turns, [1, 9]), [turns])
+
+ def test_has_both_roles(self):
+ from ingest.validator import _has_both_roles
+ self.assertTrue(_has_both_roles([{"role": "user"}, {"role": "assistant"}]))
+ self.assertFalse(_has_both_roles([{"role": "user"}, {"role": "user"}]))
+
+
class RegistryTest(unittest.TestCase):
def test_telegram_registered(self):
self.assertIn("telegram", available_sources())
@@ -180,7 +254,7 @@ def test_unknown_source_raises(self):
class CliTest(unittest.TestCase):
def test_end_to_end_sharegpt(self):
from ingest.cli import main
- os.environ["DIALOGSMITH_LLM_VALIDATE"] = "false" # no API calls
+ os.environ["LLM_VALIDATE"] = "false" # no API calls
with tempfile.TemporaryDirectory() as d:
inp = _write_fixture(d)
out = os.path.join(d, "chat_sharegpt.json")
diff --git a/tests/test_redaction.py b/tests/test_redaction.py
new file mode 100644
index 0000000..f7415c9
--- /dev/null
+++ b/tests/test_redaction.py
@@ -0,0 +1,167 @@
+"""Unit tests for regex-based sensitive-data detection (stdlib, no network)."""
+
+import os
+import sys
+import types
+import unittest
+
+sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
+
+from ingest import redaction, redactor
+from ingest.redaction.sg import nric_valid
+
+
+def _categories(text, locales=None):
+ return {f.category for f in redaction.scan_text(text, locales)}
+
+
+class UniversalTest(unittest.TestCase):
+ def test_email(self):
+ finds = redaction.scan_text("ping me at john.doe@acme.co please")
+ self.assertEqual([f.category for f in finds], ["EMAIL"])
+ self.assertEqual(finds[0].preview, "j***@acme.co") # masked, not raw
+
+ def test_credit_card_luhn(self):
+ # Valid Visa test number passes; same length with bad checksum does not.
+ self.assertIn("CARD_NUMBER", _categories("card 4111 1111 1111 1111"))
+ self.assertNotIn("CARD_NUMBER", _categories("ref 4111 1111 1111 1112"))
+
+ def test_api_keys(self):
+ self.assertIn("API_KEY", _categories("token sk-abcdefghij0123456789xyz"))
+ self.assertIn("API_KEY", _categories("AKIAIOSFODNN7EXAMPLE"))
+
+ def test_ipv4(self):
+ self.assertIn("IP_ADDRESS", _categories("server at 192.168.1.10"))
+ self.assertNotIn("IP_ADDRESS", _categories("version 999.999.1.1"))
+
+
+class SingaporeTest(unittest.TestCase):
+ def test_nric_checksum(self):
+ # S0000001I is a well-formed example; flipping the suffix must fail.
+ self.assertTrue(nric_valid("S0000001I"))
+ self.assertFalse(nric_valid("S0000001A"))
+
+ def test_nric_detected_only_when_valid(self):
+ self.assertIn("NRIC/FIN", _categories("my ic is S0000001I", ["SG"]))
+ self.assertNotIn("NRIC/FIN", _categories("code S0000001A", ["SG"]))
+
+ def test_nric_case_insensitive(self):
+ self.assertIn("NRIC/FIN", _categories("ic s0000001i", ["SG"]))
+
+ def test_nric_short_form_requires_context(self):
+ # With an NRIC/IC keyword nearby it's flagged...
+ self.assertIn("NRIC/FIN (partial)", _categories("NRIC 123A", ["SG"]))
+ self.assertIn("NRIC/FIN (partial)", _categories("my IC is 567B", ["SG"]))
+ # ...but a bare block/unit number is not.
+ self.assertNotIn("NRIC/FIN (partial)", _categories("Blk 123A Clementi", ["SG"]))
+
+ def test_nric_short_form_reports_only_the_id(self):
+ finds = [
+ f for f in redaction.scan_text("NRIC 123A", ["SG"])
+ if f.category == "NRIC/FIN (partial)"
+ ]
+ self.assertEqual(finds[0].value, "123A") # keyword excluded from the span
+
+ def test_phone(self):
+ self.assertIn("PHONE", _categories("call 9123 4567", ["SG"]))
+ self.assertIn("PHONE", _categories("call +65 9123 4567", ["SG"]))
+
+ def test_locale_filtering(self):
+ # SG detectors don't run when only universal locale is requested.
+ self.assertNotIn("NRIC/FIN", _categories("ic S0000001I", []))
+
+
+class RegistryTest(unittest.TestCase):
+ def test_no_duplicate_names(self):
+ names = [d.name for d in redaction.iter_detectors()]
+ self.assertEqual(len(names), len(set(names)))
+
+ def test_locales_available(self):
+ self.assertIn("SG", redaction.available_locales())
+ self.assertIn("universal", redaction.available_locales())
+
+
+class RedactorStageTest(unittest.TestCase):
+ def _samples(self):
+ return [
+ [{"role": "user", "text": "email me at a@b.com"},
+ {"role": "assistant", "text": "sure thing"}],
+ [{"role": "user", "text": "nothing sensitive here"},
+ {"role": "assistant", "text": "ok"}],
+ ]
+
+ def test_scan_is_nondestructive_and_reports(self):
+ samples = self._samples()
+ report = redactor.scan_samples(samples)
+ self.assertEqual(report["total_findings"], 1)
+ self.assertIn("EMAIL", report["summary"])
+ # Original samples untouched.
+ self.assertEqual(samples[0][0]["text"], "email me at a@b.com")
+
+ def test_apply_replace_uses_placeholder(self):
+ out = redactor.apply(self._samples(), "replace")
+ self.assertEqual(out[0][0]["text"], "email me at [EMAIL]")
+ self.assertEqual(out[1][0]["text"], "nothing sensitive here") # untouched
+
+ def test_apply_drop_removes_conversation(self):
+ out = redactor.apply(self._samples(), "drop")
+ self.assertEqual(len(out), 1) # the one with an email is dropped
+ self.assertEqual(out[0][0]["text"], "nothing sensitive here")
+
+
+class _FakeClient:
+ """Stub OpenAI-compatible client returning a canned JSON body (no network)."""
+
+ def __init__(self, text):
+ message = types.SimpleNamespace(content=text)
+ resp = types.SimpleNamespace(choices=[types.SimpleNamespace(message=message)])
+ completions = types.SimpleNamespace(create=lambda **kw: resp)
+ self.chat = types.SimpleNamespace(completions=completions)
+
+
+class LlmRedactionTest(unittest.TestCase):
+ def _samples(self):
+ return [[{"role": "user", "text": "hi I'm Alice from Acme"},
+ {"role": "assistant", "text": "hello"}]]
+
+ def test_verbatim_span_is_located_and_masked(self):
+ client = _FakeClient(
+ '{"findings":[{"turn":0,"text":"Alice","category":"NAME","severity":"high"}]}'
+ )
+ finds = redactor.llm_scan_samples(self._samples(), client, "model")
+ self.assertEqual(len(finds), 1)
+ self.assertEqual(finds[0]["category"], "NAME")
+ self.assertEqual(finds[0]["start"], 7) # offset of "Alice"
+ self.assertEqual(finds[0]["end"], 12)
+ self.assertNotIn("Alice", finds[0]["preview"]) # masked
+
+ def test_unlocatable_span_is_dropped(self):
+ # Model paraphrased instead of copying -> can't verify -> skipped.
+ client = _FakeClient(
+ '{"findings":[{"turn":0,"text":"Bob","category":"NAME","severity":"high"}]}'
+ )
+ self.assertEqual(redactor.llm_scan_samples(self._samples(), client, "model"), [])
+
+ def test_merge_into_report(self):
+ report = redactor.scan_samples(self._samples()) # 0 regex findings
+ llm = [{"conversation": 0, "turn": 0, "role": "user", "category": "NAME",
+ "severity": "high", "start": 6, "end": 11, "preview": "Al**e"}]
+ redactor.merge_llm_findings(report, llm)
+ self.assertEqual(report["total_findings"], 1)
+ self.assertIn("NAME", report["summary"])
+ self.assertNotIn("value", report["findings"][0]) # no raw span persisted
+
+ def test_apply_replace_uses_llm_offsets(self):
+ llm = [{"conversation": 0, "turn": 0, "category": "NAME",
+ "start": 7, "end": 12}]
+ out = redactor.apply(self._samples(), "replace", llm_findings=llm)
+ self.assertEqual(out[0][0]["text"], "hi I'm [NAME] from Acme")
+
+ def test_replace_spans_drops_overlap(self):
+ from ingest.redactor import _replace_spans
+ # Two overlapping spans -> only the right-most is applied.
+ self.assertEqual(_replace_spans("abcdef", [(0, 3, "X"), (2, 5, "Y")]), "ab[Y]f")
+
+
+if __name__ == "__main__":
+ unittest.main()
From a9367c7f1b520cfe27b668b5b9acd8dc9315013e Mon Sep 17 00:00:00 2001
From: NotYuSheng
Date: Tue, 23 Jun 2026 22:52:47 +0800
Subject: [PATCH 02/15] chore: remove deprecated script shims; fix remote name
in docs
The scripts/telegram_extract.py and scripts/convert_to_sharegpt.py shims only
delegated to `python -m ingest`; remove them and update the Legacy Workflow note.
Co-Authored-By: Claude Opus 4.8
---
README.md | 2 +-
scripts/convert_to_sharegpt.py | 29 -----------------------------
scripts/telegram_extract.py | 28 ----------------------------
3 files changed, 1 insertion(+), 58 deletions(-)
delete mode 100644 scripts/convert_to_sharegpt.py
delete mode 100644 scripts/telegram_extract.py
diff --git a/README.md b/README.md
index 8cf44a7..b635033 100644
--- a/README.md
+++ b/README.md
@@ -294,7 +294,7 @@ It runs in well under a second and locks in the conversion behaviour, so you can
## Legacy Workflow
-The pre-refactor, Windows-only workflow (which cloned LLaMA-Factory at HEAD) is preserved at the [`v0.1.0`](https://github.com/NotYuSheng/Doppelganger/releases/tag/v0.1.0) tag. The old `scripts/telegram_extract.py` and `scripts/convert_to_sharegpt.py` still work as thin deprecated wrappers around `python -m ingest`, but will be removed in a future release.
+The pre-refactor, Windows-only workflow (which cloned LLaMA-Factory at HEAD) is preserved at the [`v0.1.0`](https://github.com/NotYuSheng/Doppelganger/releases/tag/v0.1.0) tag. The old `scripts/telegram_extract.py` and `scripts/convert_to_sharegpt.py` shims have been removed — use `python -m ingest` instead.
## Star History
diff --git a/scripts/convert_to_sharegpt.py b/scripts/convert_to_sharegpt.py
deleted file mode 100644
index 7d99723..0000000
--- a/scripts/convert_to_sharegpt.py
+++ /dev/null
@@ -1,29 +0,0 @@
-#!/usr/bin/env python3
-"""DEPRECATED shim — kept for backwards compatibility, removed in a future release.
-
-The new pipeline writes ShareGPT directly:
-
- python -m ingest --source telegram --format sharegpt
-
-This shim still converts an existing data/chat_dataset.jsonl into
-data/chat_sharegpt.json, delegating to the new ``ingest`` package.
-"""
-
-import os
-import sys
-
-sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
-
-from ingest import sharegpt # noqa: E402
-
-INPUT_PATH = "./data/chat_dataset.jsonl"
-OUTPUT_PATH = "./data/chat_sharegpt.json"
-
-if __name__ == "__main__":
- sys.stderr.write(
- "[deprecated] scripts/convert_to_sharegpt.py -> use: "
- "python -m ingest --source telegram --format sharegpt\n"
- )
- samples = sharegpt.load_jsonl_samples(INPUT_PATH)
- written = sharegpt.write_sharegpt(samples, OUTPUT_PATH)
- print(f"Converted {written} valid conversation samples to ShareGPT format.")
diff --git a/scripts/telegram_extract.py b/scripts/telegram_extract.py
deleted file mode 100644
index 548d004..0000000
--- a/scripts/telegram_extract.py
+++ /dev/null
@@ -1,28 +0,0 @@
-#!/usr/bin/env python3
-"""DEPRECATED shim — kept for backwards compatibility, removed in a future release.
-
-Use the cross-platform CLI instead:
-
- python -m ingest --source telegram --format jsonl
-
-This shim reproduces the old behaviour (Telegram result.json ->
-data/chat_dataset.jsonl) by delegating to the new ``ingest`` package.
-"""
-
-import os
-import sys
-
-# Allow running as `python scripts/telegram_extract.py` from the repo root.
-sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
-
-from ingest.cli import main # noqa: E402
-
-if __name__ == "__main__":
- sys.stderr.write(
- "[deprecated] scripts/telegram_extract.py -> use: "
- "python -m ingest --source telegram --format jsonl\n"
- )
- raise SystemExit(
- main(["--source", "telegram", "--format", "jsonl",
- "--output", "./data/chat_dataset.jsonl"])
- )
From 33a5cd5b77111b7a2bb236ea002ab7a8c459cb92 Mon Sep 17 00:00:00 2001
From: NotYuSheng
Date: Tue, 23 Jun 2026 22:56:16 +0800
Subject: [PATCH 03/15] docs: resolve conflicting consent wording in top
warnings
Caution block now owns data-sensitivity + consent + law; Important block owns
model-misuse. Removes the contradictory 'never consented' phrasing and the
duplicate consent line.
Co-Authored-By: Claude Opus 4.8
---
README.md | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/README.md b/README.md
index b635033..d6f8f8d 100644
--- a/README.md
+++ b/README.md
@@ -27,10 +27,10 @@ Doppelganger fine-tunes large language models (like Qwen) on your own chat conve
Ingestion is **source-agnostic**: a small adapter parses each platform's export into a normalized message stream, and the rest of the pipeline (sessionizing, turn-merging, sensitive-data scanning, optional quality auditing, ShareGPT formatting) is shared. **Telegram** is supported today, with **WhatsApp**, **Discord**, and other platforms planned — each slots in as a drop-in adapter.
> [!CAUTION]
-> **Your chat history is sensitive data, and you are responsible for it.** A model fine-tuned on it can memorize and later reproduce personal identifiers, private conversations, credentials, and things said by other people who never consented. The built-in [sensitive-data scanning](#privacy--sensitive-data) is a **safety net, not a guarantee** — both regex and LLM detection miss real cases and raise false positives. Before training, sharing, or deploying anything, **review the dataset yourself**, obtain any consent you need, and ensure you comply with applicable privacy laws. Treat trained adapters and merged checkpoints as sensitive too — they can leak the data they were trained on.
+> **Your chat history is sensitive data, and you are responsible for it.** A model fine-tuned on it can memorize and later reproduce personal identifiers, private conversations, credentials, and messages written by other people in your chats. The built-in [sensitive-data scanning](#privacy--sensitive-data) is a **safety net, not a guarantee** — both regex and LLM detection miss real cases and raise false positives. Before training, sharing, or deploying anything: **review the dataset yourself**, get consent from others whose messages are included (especially in group chats), and comply with applicable privacy laws. Treat trained adapters and merged checkpoints as sensitive too — they can leak the data they were trained on.
> [!IMPORTANT]
-> **This is a for-fun, experimental project — not a production tool.** A model that imitates a real person can be misused for impersonation, deception, or social engineering, and it will happily generate convincing messages that person never actually wrote. Don't present its output as genuinely from anyone, don't train on someone else's chats without their knowledge, and don't rely on it for anything that matters. Enjoy it responsibly.
+> **This is a for-fun, experimental project — not a production tool.** A model that imitates a real person can be misused for impersonation, deception, or social engineering, and it will happily generate convincing messages that person never actually wrote. Don't present its output as genuinely from anyone, and don't rely on it for anything that matters. Enjoy it responsibly.
Fine-tuning on your chats can capture your:
From 7c143fe44f78b168a414f203701a4a4363846b10 Mon Sep 17 00:00:00 2001
From: NotYuSheng
Date: Tue, 23 Jun 2026 23:05:44 +0800
Subject: [PATCH 04/15] docs: add Roadmap section (training techniques +
tracked issues)
States the project's intent to explore pre-training, fine-tuning, and
alignment, plus persona prompting (#14), more adapters (#9), NER redaction
(#13), and wider locale packs (#15).
Co-Authored-By: Claude Opus 4.8
---
README.md | 12 ++++++++++++
1 file changed, 12 insertions(+)
diff --git a/README.md b/README.md
index d6f8f8d..194315c 100644
--- a/README.md
+++ b/README.md
@@ -296,6 +296,18 @@ It runs in well under a second and locks in the conversion behaviour, so you can
The pre-refactor, Windows-only workflow (which cloned LLaMA-Factory at HEAD) is preserved at the [`v0.1.0`](https://github.com/NotYuSheng/Doppelganger/releases/tag/v0.1.0) tag. The old `scripts/telegram_extract.py` and `scripts/convert_to_sharegpt.py` shims have been removed — use `python -m ingest` instead.
+## Roadmap
+
+Capturing how someone communicates is bigger than any single method. Today Doppelganger uses **LoRA fine-tuning (SFT)** on your chats; the project intends to explore other training techniques and context sources over time:
+
+- **More training techniques** — beyond fine-tuning, e.g. continued **pre-training** on larger personal corpora and **alignment / preference tuning** (DPO) to refine behaviour.
+- **Persona prompting** — a short quiz that generates a system prompt for explicit preferences/facts, complementing the fine-tuned *style* ([#14](https://github.com/NotYuSheng/Doppelganger/issues/14)).
+- **More chat sources** — WhatsApp, Discord, and others as drop-in adapters ([#9](https://github.com/NotYuSheng/Doppelganger/issues/9)).
+- **Offline NER redaction** — name/location detection without an LLM ([#13](https://github.com/NotYuSheng/Doppelganger/issues/13)).
+- **Wider locale coverage** — more country detector packs ([#15](https://github.com/NotYuSheng/Doppelganger/issues/15) tracks the Singapore gaps).
+
+This is an experimental, for-fun project — the roadmap is exploratory, not a commitment.
+
## Star History
From 4e9cb917d21bff1c20977bd4e4a3e929093eaccb Mon Sep 17 00:00:00 2001
From: NotYuSheng
Date: Tue, 23 Jun 2026 23:08:35 +0800
Subject: [PATCH 05/15] docs: expand Roadmap into exploration vision across AI
techniques
Frames the project as a learning sandbox spanning training, memory/RAG, agents,
MCP, guardrails, and evaluation. Links the new exploration issues (#16-#21)
alongside existing tracked work (#9, #13, #14, #15).
Co-Authored-By: Claude Opus 4.8
---
README.md | 32 ++++++++++++++++++++++++--------
1 file changed, 24 insertions(+), 8 deletions(-)
diff --git a/README.md b/README.md
index 194315c..54fae35 100644
--- a/README.md
+++ b/README.md
@@ -296,17 +296,33 @@ It runs in well under a second and locks in the conversion behaviour, so you can
The pre-refactor, Windows-only workflow (which cloned LLaMA-Factory at HEAD) is preserved at the [`v0.1.0`](https://github.com/NotYuSheng/Doppelganger/releases/tag/v0.1.0) tag. The old `scripts/telegram_extract.py` and `scripts/convert_to_sharegpt.py` shims have been removed — use `python -m ingest` instead.
-## Roadmap
+## Roadmap & Vision
-Capturing how someone communicates is bigger than any single method. Today Doppelganger uses **LoRA fine-tuning (SFT)** on your chats; the project intends to explore other training techniques and context sources over time:
+Doppelganger is as much a **learning sandbox** as a tool: the aim is to explore the *full* AI toolbox for capturing how a person communicates, and to find what actually moves the needle on *"does this sound like me?"*. Today that's LoRA fine-tuning — everything below is exploratory.
-- **More training techniques** — beyond fine-tuning, e.g. continued **pre-training** on larger personal corpora and **alignment / preference tuning** (DPO) to refine behaviour.
-- **Persona prompting** — a short quiz that generates a system prompt for explicit preferences/facts, complementing the fine-tuned *style* ([#14](https://github.com/NotYuSheng/Doppelganger/issues/14)).
-- **More chat sources** — WhatsApp, Discord, and others as drop-in adapters ([#9](https://github.com/NotYuSheng/Doppelganger/issues/9)).
-- **Offline NER redaction** — name/location detection without an LLM ([#13](https://github.com/NotYuSheng/Doppelganger/issues/13)).
-- **Wider locale coverage** — more country detector packs ([#15](https://github.com/NotYuSheng/Doppelganger/issues/15) tracks the Singapore gaps).
+**Shaping the model**
+- Training techniques — **pre-training**, **fine-tuning** (today), and **alignment / preference tuning** (DPO).
+- **Continual learning** — keep the model current as new chats arrive, without catastrophic forgetting ([#18](https://github.com/NotYuSheng/Doppelganger/issues/18)).
-This is an experimental, for-fun project — the roadmap is exploratory, not a commitment.
+**Giving it context & memory**
+- **RAG** — retrieve your past messages and memories at inference instead of baking everything into weights ([#16](https://github.com/NotYuSheng/Doppelganger/issues/16)).
+- **Persona prompting** — a quiz that generates a system prompt for explicit facts/preferences, complementing the fine-tuned *style* ([#14](https://github.com/NotYuSheng/Doppelganger/issues/14)).
+- **MCP** — expose memory/tools (or the doppelganger itself) via the Model Context Protocol ([#20](https://github.com/NotYuSheng/Doppelganger/issues/20)).
+
+**Making it act**
+- **Agentic doppelganger** — tool use and bounded actions on your behalf ([#19](https://github.com/NotYuSheng/Doppelganger/issues/19)).
+
+**Keeping it safe & honest**
+- **Guardrails** — block harmful output and sensitive-data leakage in generations ([#21](https://github.com/NotYuSheng/Doppelganger/issues/21)).
+- Sensitive-data redaction (shipped) + **offline NER** for names/locations ([#13](https://github.com/NotYuSheng/Doppelganger/issues/13)).
+
+**Knowing if it works**
+- **Evaluation** — measure style fidelity: LLM-as-judge, held-out perplexity, stylometrics, blind human A/B ([#17](https://github.com/NotYuSheng/Doppelganger/issues/17)).
+
+**More data & coverage**
+- More chat sources — WhatsApp, Discord, and others as drop-in adapters ([#9](https://github.com/NotYuSheng/Doppelganger/issues/9)); wider locale detector packs ([#15](https://github.com/NotYuSheng/Doppelganger/issues/15)).
+
+> This is an experimental, for-fun project — the roadmap is a wishlist of things to explore, not a commitment.
## Star History
From 4c207ea89e06352172302f6f63a21268cd12a087 Mon Sep 17 00:00:00 2001
From: NotYuSheng
Date: Tue, 23 Jun 2026 23:13:06 +0800
Subject: [PATCH 06/15] docs: expand Roadmap to full exploration backlog
(#16-#38)
Adds multimodal, inference-time control, and interpretability tracks and links
the complete exploration backlog; points to the exploration label for the rest.
Co-Authored-By: Claude Opus 4.8
---
README.md | 28 +++++++++++-----------------
1 file changed, 11 insertions(+), 17 deletions(-)
diff --git a/README.md b/README.md
index 54fae35..cc6d1f6 100644
--- a/README.md
+++ b/README.md
@@ -298,29 +298,23 @@ The pre-refactor, Windows-only workflow (which cloned LLaMA-Factory at HEAD) is
## Roadmap & Vision
-Doppelganger is as much a **learning sandbox** as a tool: the aim is to explore the *full* AI toolbox for capturing how a person communicates, and to find what actually moves the needle on *"does this sound like me?"*. Today that's LoRA fine-tuning — everything below is exploratory.
+Doppelganger is as much a **learning sandbox** as a tool: the aim is to explore the *full* AI toolbox for capturing how a person communicates, and to find what actually moves the needle on *"does this sound like me?"*. Today that's LoRA fine-tuning — everything below is exploratory. The full backlog lives under the [`exploration`](https://github.com/NotYuSheng/Doppelganger/issues?q=is%3Aissue+is%3Aopen+label%3Aexploration) label.
-**Shaping the model**
-- Training techniques — **pre-training**, **fine-tuning** (today), and **alignment / preference tuning** (DPO).
-- **Continual learning** — keep the model current as new chats arrive, without catastrophic forgetting ([#18](https://github.com/NotYuSheng/Doppelganger/issues/18)).
+**Shaping the model** — pre-training · fine-tuning (today) · alignment/DPO · continual learning ([#18](https://github.com/NotYuSheng/Doppelganger/issues/18)) · synthetic data / self-instruct ([#22](https://github.com/NotYuSheng/Doppelganger/issues/22)) · multi-LoRA personas & merging ([#23](https://github.com/NotYuSheng/Doppelganger/issues/23)) · distillation to on-device ([#24](https://github.com/NotYuSheng/Doppelganger/issues/24)) · PEFT comparison ([#25](https://github.com/NotYuSheng/Doppelganger/issues/25))
-**Giving it context & memory**
-- **RAG** — retrieve your past messages and memories at inference instead of baking everything into weights ([#16](https://github.com/NotYuSheng/Doppelganger/issues/16)).
-- **Persona prompting** — a quiz that generates a system prompt for explicit facts/preferences, complementing the fine-tuned *style* ([#14](https://github.com/NotYuSheng/Doppelganger/issues/14)).
-- **MCP** — expose memory/tools (or the doppelganger itself) via the Model Context Protocol ([#20](https://github.com/NotYuSheng/Doppelganger/issues/20)).
+**Giving it context & memory** — RAG ([#16](https://github.com/NotYuSheng/Doppelganger/issues/16)) · long-term memory + reflection ([#26](https://github.com/NotYuSheng/Doppelganger/issues/26)) · relationship/knowledge graph ([#27](https://github.com/NotYuSheng/Doppelganger/issues/27)) · style embeddings / user-conditioning ([#28](https://github.com/NotYuSheng/Doppelganger/issues/28)) · persona-prompt quiz ([#14](https://github.com/NotYuSheng/Doppelganger/issues/14)) · MCP ([#20](https://github.com/NotYuSheng/Doppelganger/issues/20))
-**Making it act**
-- **Agentic doppelganger** — tool use and bounded actions on your behalf ([#19](https://github.com/NotYuSheng/Doppelganger/issues/19)).
+**Multimodal** — voice cloning, TTS/STT ([#29](https://github.com/NotYuSheng/Doppelganger/issues/29)) · stickers / emoji / memes ([#30](https://github.com/NotYuSheng/Doppelganger/issues/30))
-**Keeping it safe & honest**
-- **Guardrails** — block harmful output and sensitive-data leakage in generations ([#21](https://github.com/NotYuSheng/Doppelganger/issues/21)).
-- Sensitive-data redaction (shipped) + **offline NER** for names/locations ([#13](https://github.com/NotYuSheng/Doppelganger/issues/13)).
+**Making it act** — agentic doppelganger ([#19](https://github.com/NotYuSheng/Doppelganger/issues/19)) · multi-agent & self-play ([#31](https://github.com/NotYuSheng/Doppelganger/issues/31)) · proactive / initiative modeling ([#32](https://github.com/NotYuSheng/Doppelganger/issues/32))
-**Knowing if it works**
-- **Evaluation** — measure style fidelity: LLM-as-judge, held-out perplexity, stylometrics, blind human A/B ([#17](https://github.com/NotYuSheng/Doppelganger/issues/17)).
+**Inference-time control** — activation steering / control vectors ([#36](https://github.com/NotYuSheng/Doppelganger/issues/36)) · prompt optimization (DSPy) ([#37](https://github.com/NotYuSheng/Doppelganger/issues/37))
-**More data & coverage**
-- More chat sources — WhatsApp, Discord, and others as drop-in adapters ([#9](https://github.com/NotYuSheng/Doppelganger/issues/9)); wider locale detector packs ([#15](https://github.com/NotYuSheng/Doppelganger/issues/15)).
+**Keeping it safe & honest** — guardrails ([#21](https://github.com/NotYuSheng/Doppelganger/issues/21)) · redaction (shipped) + offline NER ([#13](https://github.com/NotYuSheng/Doppelganger/issues/13)) · differential-privacy training ([#33](https://github.com/NotYuSheng/Doppelganger/issues/33)) · machine unlearning ([#34](https://github.com/NotYuSheng/Doppelganger/issues/34)) · memorization audits / canaries / watermarking / federated ([#35](https://github.com/NotYuSheng/Doppelganger/issues/35))
+
+**Knowing if it works** — evaluation, "does it sound like me?" ([#17](https://github.com/NotYuSheng/Doppelganger/issues/17)) · interpretability, "what did it learn about me?" ([#38](https://github.com/NotYuSheng/Doppelganger/issues/38))
+
+**More data & coverage** — more chat sources: WhatsApp, Discord, … ([#9](https://github.com/NotYuSheng/Doppelganger/issues/9)) · wider locale detector packs ([#15](https://github.com/NotYuSheng/Doppelganger/issues/15))
> This is an experimental, for-fun project — the roadmap is a wishlist of things to explore, not a commitment.
From 23b12c5c7aac74d0d1136cdbb9ea62516f200d66 Mon Sep 17 00:00:00 2001
From: NotYuSheng
Date: Tue, 23 Jun 2026 23:14:26 +0800
Subject: [PATCH 07/15] docs: state roadmap techniques without issue links
Keep the README roadmap as a plain statement of exploration areas; the issue
tracker holds the live backlog.
Co-Authored-By: Claude Opus 4.8
---
README.md | 18 +++++++++---------
1 file changed, 9 insertions(+), 9 deletions(-)
diff --git a/README.md b/README.md
index cc6d1f6..8208ea5 100644
--- a/README.md
+++ b/README.md
@@ -298,23 +298,23 @@ The pre-refactor, Windows-only workflow (which cloned LLaMA-Factory at HEAD) is
## Roadmap & Vision
-Doppelganger is as much a **learning sandbox** as a tool: the aim is to explore the *full* AI toolbox for capturing how a person communicates, and to find what actually moves the needle on *"does this sound like me?"*. Today that's LoRA fine-tuning — everything below is exploratory. The full backlog lives under the [`exploration`](https://github.com/NotYuSheng/Doppelganger/issues?q=is%3Aissue+is%3Aopen+label%3Aexploration) label.
+Doppelganger is as much a **learning sandbox** as a tool: the aim is to explore the *full* AI toolbox for capturing how a person communicates, and to find what actually moves the needle on *"does this sound like me?"*. Today that's LoRA fine-tuning — everything below is exploratory (see the issue tracker for the live backlog).
-**Shaping the model** — pre-training · fine-tuning (today) · alignment/DPO · continual learning ([#18](https://github.com/NotYuSheng/Doppelganger/issues/18)) · synthetic data / self-instruct ([#22](https://github.com/NotYuSheng/Doppelganger/issues/22)) · multi-LoRA personas & merging ([#23](https://github.com/NotYuSheng/Doppelganger/issues/23)) · distillation to on-device ([#24](https://github.com/NotYuSheng/Doppelganger/issues/24)) · PEFT comparison ([#25](https://github.com/NotYuSheng/Doppelganger/issues/25))
+**Shaping the model** — pre-training · fine-tuning (today) · alignment/DPO · continual learning · synthetic data / self-instruct · multi-LoRA personas & merging · distillation to on-device · PEFT comparison
-**Giving it context & memory** — RAG ([#16](https://github.com/NotYuSheng/Doppelganger/issues/16)) · long-term memory + reflection ([#26](https://github.com/NotYuSheng/Doppelganger/issues/26)) · relationship/knowledge graph ([#27](https://github.com/NotYuSheng/Doppelganger/issues/27)) · style embeddings / user-conditioning ([#28](https://github.com/NotYuSheng/Doppelganger/issues/28)) · persona-prompt quiz ([#14](https://github.com/NotYuSheng/Doppelganger/issues/14)) · MCP ([#20](https://github.com/NotYuSheng/Doppelganger/issues/20))
+**Giving it context & memory** — RAG · long-term memory + reflection · relationship/knowledge graph · style embeddings / user-conditioning · persona-prompt quiz · MCP
-**Multimodal** — voice cloning, TTS/STT ([#29](https://github.com/NotYuSheng/Doppelganger/issues/29)) · stickers / emoji / memes ([#30](https://github.com/NotYuSheng/Doppelganger/issues/30))
+**Multimodal** — voice cloning, TTS/STT · stickers / emoji / memes
-**Making it act** — agentic doppelganger ([#19](https://github.com/NotYuSheng/Doppelganger/issues/19)) · multi-agent & self-play ([#31](https://github.com/NotYuSheng/Doppelganger/issues/31)) · proactive / initiative modeling ([#32](https://github.com/NotYuSheng/Doppelganger/issues/32))
+**Making it act** — agentic doppelganger · multi-agent & self-play · proactive / initiative modeling
-**Inference-time control** — activation steering / control vectors ([#36](https://github.com/NotYuSheng/Doppelganger/issues/36)) · prompt optimization (DSPy) ([#37](https://github.com/NotYuSheng/Doppelganger/issues/37))
+**Inference-time control** — activation steering / control vectors · prompt optimization (DSPy)
-**Keeping it safe & honest** — guardrails ([#21](https://github.com/NotYuSheng/Doppelganger/issues/21)) · redaction (shipped) + offline NER ([#13](https://github.com/NotYuSheng/Doppelganger/issues/13)) · differential-privacy training ([#33](https://github.com/NotYuSheng/Doppelganger/issues/33)) · machine unlearning ([#34](https://github.com/NotYuSheng/Doppelganger/issues/34)) · memorization audits / canaries / watermarking / federated ([#35](https://github.com/NotYuSheng/Doppelganger/issues/35))
+**Keeping it safe & honest** — guardrails · redaction (shipped) + offline NER · differential-privacy training · machine unlearning · memorization audits / canaries / watermarking / federated
-**Knowing if it works** — evaluation, "does it sound like me?" ([#17](https://github.com/NotYuSheng/Doppelganger/issues/17)) · interpretability, "what did it learn about me?" ([#38](https://github.com/NotYuSheng/Doppelganger/issues/38))
+**Knowing if it works** — evaluation, "does it sound like me?" · interpretability, "what did it learn about me?"
-**More data & coverage** — more chat sources: WhatsApp, Discord, … ([#9](https://github.com/NotYuSheng/Doppelganger/issues/9)) · wider locale detector packs ([#15](https://github.com/NotYuSheng/Doppelganger/issues/15))
+**More data & coverage** — more chat sources: WhatsApp, Discord, … · wider locale detector packs
> This is an experimental, for-fun project — the roadmap is a wishlist of things to explore, not a commitment.
From e5b8d9cc16f37a4a7d3493d54c793d039df10496 Mon Sep 17 00:00:00 2001
From: NotYuSheng
Date: Tue, 23 Jun 2026 23:30:01 +0800
Subject: [PATCH 08/15] fix: address PR review (span overlap, speaker prefix,
LLM failure aborts, SG postal)
- redactor._replace_spans: on overlap keep the outer/longer span so an inner
span can't leak the rest (e.g. DOMAIN inside EMAIL) [security-high].
- core._assemble_turns: apply the speaker prefix only once per merged turn.
- redactor.llm_scan_samples + validator.validate_samples: abort after 5
consecutive LLM failures instead of flooding the console.
- redaction/sg.py: sg_postal also matches the common "S123456" form.
Co-Authored-By: Claude Opus 4.8
---
ingest/core.py | 5 +++--
ingest/redaction/sg.py | 4 ++--
ingest/redactor.py | 23 ++++++++++++++++-------
ingest/validator.py | 8 ++++++++
tests/test_redaction.py | 12 +++++++++---
5 files changed, 38 insertions(+), 14 deletions(-)
diff --git a/ingest/core.py b/ingest/core.py
index c27b6fe..81c2e02 100644
--- a/ingest/core.py
+++ b/ingest/core.py
@@ -154,7 +154,6 @@ def _assemble_turns(raw_turns, multi_speaker: bool) -> Sample:
for sender_id, is_self, text in raw_turns:
role = "assistant" if is_self else "user"
- value = f"{sender_id}: {text}" if (multi_speaker and role == "user") else text
same_role = bool(turns) and turns[-1]["role"] == role
# In multi-speaker mode a user turn only merges with the previous turn
@@ -163,8 +162,10 @@ def _assemble_turns(raw_turns, multi_speaker: bool) -> Sample:
multi_speaker and role == "user" and last_sender != sender_id
)
if mergeable:
- turns[-1]["text"] += "\n" + value
+ # Continuation of the same turn — don't repeat the speaker prefix.
+ turns[-1]["text"] += "\n" + text
else:
+ value = f"{sender_id}: {text}" if (multi_speaker and role == "user") else text
turns.append({"role": role, "text": value})
last_sender = sender_id
diff --git a/ingest/redaction/sg.py b/ingest/redaction/sg.py
index 7b57cdb..b78a391 100644
--- a/ingest/redaction/sg.py
+++ b/ingest/redaction/sg.py
@@ -82,11 +82,11 @@ def nric_valid(value: str) -> bool:
)
# Postal code is 6 bare digits — far too noisy alone, so require an explicit
-# "Singapore " or "S(code)" context to keep precision up.
+# "Singapore ", "S123456", or "S(123456)" context to keep precision up.
make(
"sg_postal",
"POSTAL_CODE",
"SG",
- r"(?:[Ss]ingapore\s+|\bS\()\d{6}\)?",
+ r"(?:[Ss]ingapore\s+|\bS\(?)\d{6}\)?",
severity="low",
)
diff --git a/ingest/redactor.py b/ingest/redactor.py
index 39786fa..85edec5 100644
--- a/ingest/redactor.py
+++ b/ingest/redactor.py
@@ -21,6 +21,7 @@
from ingest import redaction
DEFAULT_LOCALES = ["SG"] # universal detectors always run in addition to these
+_MAX_CONSECUTIVE_LLM_FAILURES = 5 # abort the LLM pass if the endpoint keeps failing
def scan_samples(samples, locales: Optional[Iterable[str]] = None) -> dict:
@@ -92,16 +93,18 @@ def print_summary(report: dict, report_path: str, mode: str = "off") -> None:
def _replace_spans(text: str, spans) -> str:
"""Replace ``(start, end, category)`` spans with ``[CATEGORY]`` placeholders.
- Drops overlapping spans (keeping the right-most) and replaces right-to-left
- so each replacement leaves earlier offsets valid.
+ On overlap, keep the longer/outermost span — so an inner ``DOMAIN`` can't
+ survive while its enclosing ``EMAIL`` is dropped, which would leave the email
+ username exposed. Sort by start ascending then end descending, greedily keep
+ non-overlapping spans, and apply right-to-left so earlier offsets stay valid.
"""
chosen = []
- boundary = len(text) + 1 # left edge of the span accepted to our right
- for start, end, cat in sorted(set(spans), key=lambda s: -s[0]):
- if end <= boundary:
+ last_end = 0
+ for start, end, cat in sorted(set(spans), key=lambda s: (s[0], -s[1])):
+ if start >= last_end:
chosen.append((start, end, cat))
- boundary = start
- for start, end, cat in chosen: # already right-to-left
+ last_end = end
+ for start, end, cat in reversed(chosen):
text = text[:start] + f"[{cat}]" + text[end:]
return text
@@ -195,11 +198,17 @@ def llm_scan_samples(samples, client, model) -> List[dict]:
rather than trusting an offset we can't confirm.
"""
findings = []
+ consecutive_failures = 0
for ci, turns in enumerate(samples):
try:
raw = _llm_audit_conversation(client, model, turns)
+ consecutive_failures = 0
except Exception as e:
+ consecutive_failures += 1
print(f"[redactor] LLM scan failed on conversation {ci}: {e}")
+ if consecutive_failures >= _MAX_CONSECUTIVE_LLM_FAILURES:
+ print("[redactor] Too many consecutive LLM failures — aborting LLM scan.")
+ break
continue
for rf in raw:
try:
diff --git a/ingest/validator.py b/ingest/validator.py
index 7a684c2..103ebed 100644
--- a/ingest/validator.py
+++ b/ingest/validator.py
@@ -25,6 +25,7 @@
from ingest import llm
+_MAX_CONSECUTIVE_LLM_FAILURES = 5 # abort validation if the endpoint keeps failing
COHERENCE_THRESHOLD = 0.5 # below this the conversation is considered incoherent
QUALITY_THRESHOLD = 0.5 # below this the sample is considered low-quality
PAIRING_THRESHOLD = 0.5 # below this the turns don't respond to each other
@@ -135,13 +136,20 @@ def validate_samples(samples):
passed = []
filtered = []
split_count = 0
+ consecutive_failures = 0
for i, turns in enumerate(samples):
try:
r = _score_sample(client, model, turns)
+ consecutive_failures = 0
except Exception as e:
+ consecutive_failures += 1
print(f"[validator] Sample {i}: scoring failed ({e}), keeping sample.")
passed.append(turns)
+ if consecutive_failures >= _MAX_CONSECUTIVE_LLM_FAILURES:
+ print("[validator] Too many consecutive LLM failures — keeping remaining samples unvalidated.")
+ passed.extend(samples[i + 1:])
+ break
continue
low = (
diff --git a/tests/test_redaction.py b/tests/test_redaction.py
index f7415c9..7630df8 100644
--- a/tests/test_redaction.py
+++ b/tests/test_redaction.py
@@ -157,10 +157,16 @@ def test_apply_replace_uses_llm_offsets(self):
out = redactor.apply(self._samples(), "replace", llm_findings=llm)
self.assertEqual(out[0][0]["text"], "hi I'm [NAME] from Acme")
- def test_replace_spans_drops_overlap(self):
+ def test_replace_spans_prefers_outer_span(self):
from ingest.redactor import _replace_spans
- # Two overlapping spans -> only the right-most is applied.
- self.assertEqual(_replace_spans("abcdef", [(0, 3, "X"), (2, 5, "Y")]), "ab[Y]f")
+ # Partial overlap -> keep the earlier/outer span, one clean replacement.
+ self.assertEqual(_replace_spans("abcdef", [(0, 3, "X"), (2, 5, "Y")]), "[X]def")
+ # Nested: the inner span must not survive while its enclosing span is
+ # dropped (which would leave the uncovered prefix exposed).
+ self.assertEqual(
+ _replace_spans("a@b.com x", [(0, 7, "EMAIL"), (2, 7, "DOMAIN")]),
+ "[EMAIL] x",
+ )
if __name__ == "__main__":
From 9619e7f909bba097b7692dbf77ddebc792c61eae Mon Sep 17 00:00:00 2001
From: NotYuSheng
Date: Tue, 23 Jun 2026 23:30:01 +0800
Subject: [PATCH 09/15] refactor: local-first LLM config (no hardcoded model
default, drop cloud suggestions)
- llm.py: remove DEFAULT_MODEL; LLM_MODEL is required to enable the LLM features
(clear error otherwise). Docs use the vLLM/LM Studio HF model-id convention.
- example.env / README: lead with a LOCAL OpenAI-compatible server
(Qwen/Qwen2.5-7B-Instruct), remove gpt-4o-mini/cloud suggestions; drop the
roadmap "(today)" qualifier.
Co-Authored-By: Claude Opus 4.8
---
README.md | 16 +++++++---------
example.env | 33 ++++++++++++---------------------
ingest/llm.py | 32 ++++++++++++++++++--------------
3 files changed, 37 insertions(+), 44 deletions(-)
diff --git a/README.md b/README.md
index 8208ea5..33e1067 100644
--- a/README.md
+++ b/README.md
@@ -103,15 +103,13 @@ python -m ingest --source telegram
**3. (Optional) Configure LLM features**
-Copy `example.env` to `.env` (the setup scripts do this for you) and fill it in to enable the quality auditor and LLM redaction. Local endpoints keep your chat data on your machine:
+The core pipeline needs no LLM. To *also* enable the quality auditor and LLM redaction, copy `example.env` to `.env` (the setup scripts do this) and point it at a **local** OpenAI-compatible server (vLLM, LM Studio, llama.cpp) so your chat data stays on your machine:
```dotenv
LLM_VALIDATE=true
-LLM_MODEL=gpt-4o-mini
-LLM_API_KEY=your_api_key_here
-# For a local model instead (key can be any value):
-# LLM_API_BASE_URL=http://localhost:11434/v1
-# LLM_MODEL=qwen2.5
+LLM_API_BASE_URL=http://localhost:8000/v1 # vLLM (LM Studio uses :1234/v1)
+LLM_MODEL=Qwen/Qwen2.5-7B-Instruct # the model your server serves
+LLM_API_KEY=local # local servers accept any value
```
**4. Fine-tune**
@@ -140,7 +138,7 @@ llamafactory-cli train configs/train_lora.yaml
### Optional: LLM quality auditing
-Each extracted conversation can be scored for **coherence, quality, and pairing**, dropping or splitting weak samples before training. It uses the OpenAI-compatible API, so it works with OpenAI **or any local server** (Ollama, vLLM, LM Studio). It's enabled automatically when `LLM_API_KEY` or `LLM_API_BASE_URL` is set (configure it in `.env`, step 3 above).
+Each extracted conversation can be scored for **coherence, quality, and pairing**, dropping or splitting weak samples before training. It talks to a **local** OpenAI-compatible server (vLLM, LM Studio, llama.cpp) so your chat data stays on your machine. It's enabled automatically when `LLM_API_KEY` or `LLM_API_BASE_URL` is set (configure it in `.env`, step 3 above).
To turn it off, set `LLM_VALIDATE=false` in `.env` (persistent) or pass `--skip-validation` for a single run. To disable **all** auditing at once — both this and the regex scan — use `--no-audit`.
@@ -186,7 +184,7 @@ Singapore ships as the worked reference ([`sg.py`](ingest/redaction/sg.py): nati
Regex can't catch everything (names, context-dependent secrets). With `--llm-redact`, an LLM additionally flags such spans into the **same report and the same `--redact` step** — it points at verbatim spans, never rewriting your text. To protect your data it **prefers a local endpoint**: set `LLM_API_BASE_URL` to a local OpenAI-compatible server; without one it refuses to use a hosted API unless you pass `--allow-cloud-redaction`.
```bash
-LLM_API_BASE_URL=http://localhost:11434/v1 LLM_MODEL=qwen2.5 \
+LLM_API_BASE_URL=http://localhost:8000/v1 LLM_MODEL=Qwen/Qwen2.5-7B-Instruct \
python -m ingest --source telegram --llm-redact --redact replace
```
@@ -300,7 +298,7 @@ The pre-refactor, Windows-only workflow (which cloned LLaMA-Factory at HEAD) is
Doppelganger is as much a **learning sandbox** as a tool: the aim is to explore the *full* AI toolbox for capturing how a person communicates, and to find what actually moves the needle on *"does this sound like me?"*. Today that's LoRA fine-tuning — everything below is exploratory (see the issue tracker for the live backlog).
-**Shaping the model** — pre-training · fine-tuning (today) · alignment/DPO · continual learning · synthetic data / self-instruct · multi-LoRA personas & merging · distillation to on-device · PEFT comparison
+**Shaping the model** — pre-training · fine-tuning · alignment/DPO · continual learning · synthetic data / self-instruct · multi-LoRA personas & merging · distillation to on-device · PEFT comparison
**Giving it context & memory** — RAG · long-term memory + reflection · relationship/knowledge graph · style embeddings / user-conditioning · persona-prompt quiz · MCP
diff --git a/example.env b/example.env
index 22f142a..3caa359 100644
--- a/example.env
+++ b/example.env
@@ -2,27 +2,18 @@
# Every value here is OPTIONAL; with none set, ingestion still runs (the LLM
# features just stay off).
-# ── Optional LLM features ─────────────────────────────────────────────────────
-# Used by the conversation quality auditor and the optional LLM redaction pass.
-# Both speak the OpenAI-compatible API, so they work with OpenAI or any local
-# server (Ollama, vLLM, LM Studio, llama.cpp). Running a LOCAL model keeps your
-# chat data on your machine — the recommended setup for private data.
-
-# Enable/disable the quality auditor. Default: enabled when LLM_API_KEY or
-# LLM_API_BASE_URL is set. Set to false to skip it entirely (no API calls).
-LLM_VALIDATE=true
-
-# Model id. For a local server use whatever it serves (e.g. qwen2.5, llama3.1).
-LLM_MODEL=gpt-4o-mini
-
-# API key. Required for hosted APIs; local servers usually accept any value.
-LLM_API_KEY=your_api_key_here
-
-# OpenAI-compatible endpoint. Set this to use a local model, e.g.
-# http://localhost:11434/v1 (Ollama)
-# http://localhost:8000/v1 (vLLM)
-# Leave unset to use OpenAI's hosted API.
-# LLM_API_BASE_URL=http://localhost:11434/v1
+# ── Optional LLM features (quality auditor + LLM redaction) ───────────────────
+# The CORE pipeline (parse -> dataset + regex sensitive-data scan) needs NONE of
+# this and runs with no setup. Uncomment below to ALSO enable the LLM auditor /
+# redaction.
+#
+# Run a LOCAL OpenAI-compatible server so your chat data never leaves your machine
+# (vLLM, LM Studio, llama.cpp). Serve an open model, then uncomment:
+#
+# LLM_VALIDATE=true
+# LLM_API_BASE_URL=http://localhost:8000/v1 # vLLM (LM Studio uses :1234/v1)
+# LLM_MODEL=Qwen/Qwen2.5-7B-Instruct # the model your server serves
+# LLM_API_KEY=local # local servers accept any value
# ── Optional: Hugging Face ────────────────────────────────────────────────────
# Only needed to download GATED models during training (e.g. Gemma). The default
diff --git a/ingest/llm.py b/ingest/llm.py
index b862acf..b0af03b 100644
--- a/ingest/llm.py
+++ b/ingest/llm.py
@@ -1,22 +1,21 @@
"""Shared OpenAI-compatible LLM client.
One client for every optional LLM feature (quality validation, LLM redaction).
-It speaks the OpenAI Chat Completions API, so it works against OpenAI itself
-*and* any local/self-hosted server that exposes that API — Ollama, vLLM, LM
-Studio, llama.cpp's server, LiteLLM, etc. Running a local endpoint is the
-privacy-preserving way to use these features, since your chat text never leaves
-your machine.
+It speaks the OpenAI Chat Completions API, which is the de-facto standard that
+local/self-hosted servers also expose — vLLM, LM Studio, llama.cpp's server,
+Ollama, LiteLLM, etc. For privacy, run a LOCAL endpoint so your chat text never
+leaves your machine; that is the intended setup for this project.
Environment variables:
LLM_VALIDATE true/false. Default: enabled when LLM_API_KEY or
LLM_API_BASE_URL is set, disabled otherwise.
- LLM_API_BASE_URL OpenAI-compatible base URL. Set this for a local model, e.g.
- http://localhost:11434/v1 (Ollama) or http://localhost:8000/v1
- (vLLM). Unset → OpenAI's hosted API.
- LLM_MODEL Model id (default: gpt-4o-mini). For a local server use whatever
- it serves, e.g. "qwen2.5" or "llama3.1".
- LLM_API_KEY API key. Local servers usually accept any value; falls back to
- OPENAI_API_KEY if unset.
+ LLM_API_BASE_URL Base URL of your local OpenAI-compatible server, e.g.
+ http://localhost:8000/v1 (vLLM) or http://localhost:1234/v1
+ (LM Studio).
+ LLM_MODEL Model id your server serves — required to use the LLM features
+ (no default). Use the HF repo id, as vLLM / LM Studio do
+ (e.g. "Qwen/Qwen2.5-7B-Instruct").
+ LLM_API_KEY API key. Local servers usually accept any value.
"""
import os
@@ -25,7 +24,6 @@
MODEL_ENV = "LLM_MODEL"
BASE_URL_ENV = "LLM_API_BASE_URL"
API_KEY_ENV = "LLM_API_KEY"
-DEFAULT_MODEL = "gpt-4o-mini"
def base_url() -> str:
@@ -33,7 +31,8 @@ def base_url() -> str:
def model() -> str:
- return os.environ.get(MODEL_ENV, "").strip() or DEFAULT_MODEL
+ """The configured model id, or empty string if unset (no default)."""
+ return os.environ.get(MODEL_ENV, "").strip()
def is_local() -> bool:
@@ -67,6 +66,11 @@ def get_client():
"The 'openai' package is required for LLM features. "
"Install it with: pip install openai"
)
+ if not model():
+ raise EnvironmentError(
+ f"{MODEL_ENV} is not set. Set it to the model your local server serves "
+ f"(e.g. Qwen/Qwen2.5-7B-Instruct), or set {VALIDATE_ENV}=false."
+ )
url = base_url()
key = _api_key()
if not key:
From bf75479ec760b811f35019f02621ffde8a78333f Mon Sep 17 00:00:00 2001
From: NotYuSheng
Date: Tue, 23 Jun 2026 23:31:55 +0800
Subject: [PATCH 10/15] docs: add local LLM (LM Studio) setup + model
suggestions by hardware
Co-Authored-By: Claude Opus 4.8
---
README.md | 26 ++++++++++++++++++++++++++
1 file changed, 26 insertions(+)
diff --git a/README.md b/README.md
index 33e1067..ac24575 100644
--- a/README.md
+++ b/README.md
@@ -142,6 +142,32 @@ Each extracted conversation can be scored for **coherence, quality, and pairing*
To turn it off, set `LLM_VALIDATE=false` in `.env` (persistent) or pass `--skip-validation` for a single run. To disable **all** auditing at once — both this and the regex scan — use `--no-audit`.
+### Running a local LLM (recommended: LM Studio)
+
+The LLM features are designed to run against a **local** model so your chat data never leaves your machine. [LM Studio](https://lmstudio.ai) is the easiest way to get one running with a click-through UI:
+
+1. Install **LM Studio** and use its search to download a model (see the table below).
+2. Open the **Developer** tab → **Start Server**. It serves an OpenAI-compatible API at `http://localhost:1234/v1`.
+3. In `.env`, set:
+ ```dotenv
+ LLM_VALIDATE=true
+ LLM_API_BASE_URL=http://localhost:1234/v1
+ LLM_MODEL=
+ LLM_API_KEY=local
+ ```
+
+(Prefer the CLI? **vLLM** serves the same API at `http://localhost:8000/v1` with `--model Qwen/Qwen2.5-7B-Instruct`. **Ollama** also works at `http://localhost:11434/v1`.)
+
+**Which model?** The auditor/redactor just needs solid instruction-following and JSON output — a small model is plenty. Pick by your hardware (GGUF quants in LM Studio shrink the footprint):
+
+| Your hardware | Suggested model | Notes |
+|---------------|-----------------|-------|
+| CPU-only, or ≤8 GB VRAM / 16 GB RAM | **Qwen2.5-3B-Instruct** (Q4) | Fast and light; fine for scoring + PII spans |
+| 8–16 GB VRAM | **Qwen2.5-7B-Instruct** (Q4/Q5) | Recommended balance of quality and speed |
+| 24 GB+ VRAM | **Qwen2.5-14B-Instruct** | Best judgment on tricky/ambiguous cases |
+
+Tiny machine? **Qwen2.5-1.5B-Instruct** or **Llama-3.2-3B-Instruct** also work, with slightly noisier results. Any OpenAI-compatible model will do — these are just sensible starting points.
+
## Privacy & Sensitive Data
Fine-tuning on real chat history may unintentionally encode personal identifiers, confidential conversations, or sensitive content.
From b395054f77a4e0311279d05fba94bc28646cf072 Mon Sep 17 00:00:00 2001
From: NotYuSheng
Date: Tue, 23 Jun 2026 23:45:10 +0800
Subject: [PATCH 11/15] fix: SG postal detector no longer matches the leading
digits of an NRIC
A trailing negative lookahead stops sg_postal matching the first 6 digits of a
longer token (e.g. NRIC S1234567D reading as S123456). Adds a regression test.
Co-Authored-By: Claude Opus 4.8
---
ingest/redaction/sg.py | 6 ++++--
tests/test_redaction.py | 6 ++++++
2 files changed, 10 insertions(+), 2 deletions(-)
diff --git a/ingest/redaction/sg.py b/ingest/redaction/sg.py
index b78a391..6b33b72 100644
--- a/ingest/redaction/sg.py
+++ b/ingest/redaction/sg.py
@@ -82,11 +82,13 @@ def nric_valid(value: str) -> bool:
)
# Postal code is 6 bare digits — far too noisy alone, so require an explicit
-# "Singapore ", "S123456", or "S(123456)" context to keep precision up.
+# "Singapore ", "S123456", or "S(123456)" context. The trailing lookahead
+# stops it matching the first 6 digits of a longer token (e.g. the NRIC
+# "S1234567D", which would otherwise read as "S123456").
make(
"sg_postal",
"POSTAL_CODE",
"SG",
- r"(?:[Ss]ingapore\s+|\bS\(?)\d{6}\)?",
+ r"(?:[Ss]ingapore\s+|\bS\(?)\d{6}\)?(?![\dA-Za-z])",
severity="low",
)
diff --git a/tests/test_redaction.py b/tests/test_redaction.py
index 7630df8..f090276 100644
--- a/tests/test_redaction.py
+++ b/tests/test_redaction.py
@@ -66,6 +66,12 @@ def test_phone(self):
self.assertIn("PHONE", _categories("call 9123 4567", ["SG"]))
self.assertIn("PHONE", _categories("call +65 9123 4567", ["SG"]))
+ def test_postal_requires_context_and_not_nric(self):
+ self.assertIn("POSTAL_CODE", _categories("Singapore 560123", ["SG"]))
+ self.assertIn("POSTAL_CODE", _categories("address S123456", ["SG"]))
+ # Must NOT fire on the leading 6 digits of an NRIC.
+ self.assertNotIn("POSTAL_CODE", _categories("ic S1234567D", ["SG"]))
+
def test_locale_filtering(self):
# SG detectors don't run when only universal locale is requested.
self.assertNotIn("NRIC/FIN", _categories("ic S0000001I", []))
From 0eba0e8a7187279a41b798c3de83797d6033ffce Mon Sep 17 00:00:00 2001
From: NotYuSheng
Date: Wed, 24 Jun 2026 00:33:05 +0800
Subject: [PATCH 12/15] feat: ASCII parrot+wordmark startup banner and demo GIF
- ingest/banner.py: a parrot-in-a-mirror mascot (it mimics your voice; the
mirror is the doppelganger) beside an ansi_shadow "Doppel/ganger" wordmark in
truecolor amber, printed at CLI startup. DOPPELGANGER_NO_BANNER silences it.
- README: embed demo/demo.gif (ingest + sensitive-data scan) at the top.
- demo/: synthetic sample_export.json (gitignored exception) + the mascot source
image and the build/convert scripts used to generate the art and GIF.
Co-Authored-By: Claude Opus 4.8
---
.gitignore | 4 ++
README.md | 4 ++
demo/build_final.py | 99 ++++++++++++++++++++++++++++++++++++++++
demo/demo.gif | Bin 0 -> 239193 bytes
demo/img2ascii.py | 37 +++++++++++++++
demo/mascot.txt | 17 +++++++
demo/parrot-mirror.jpg | Bin 0 -> 11862 bytes
demo/sample_export.json | 22 +++++++++
ingest/banner.py | 37 +++++++++++++++
ingest/cli.py | 3 ++
10 files changed, 223 insertions(+)
create mode 100644 demo/build_final.py
create mode 100644 demo/demo.gif
create mode 100644 demo/img2ascii.py
create mode 100644 demo/mascot.txt
create mode 100644 demo/parrot-mirror.jpg
create mode 100644 demo/sample_export.json
create mode 100644 ingest/banner.py
diff --git a/.gitignore b/.gitignore
index 2e45380..f93c2fe 100644
--- a/.gitignore
+++ b/.gitignore
@@ -10,6 +10,10 @@ ChatExport*/
# Tracked project config (re-include despite the broad *.json rule above)
!configs/dataset_info.json
+# Synthetic demo input — safe to commit and needed for the reproducible demo.
+# (Generated demo outputs like demo/sample_sharegpt.json stay ignored via *.json.)
+!demo/sample_export.json
+
# Personal training/export overrides — copy a tracked config to *.local.yaml and
# edit that for your own model/hardware; it stays out of git.
configs/*.local.yaml
diff --git a/README.md b/README.md
index ac24575..17bbc15 100644
--- a/README.md
+++ b/README.md
@@ -20,6 +20,10 @@
+
+
+
+
---
Doppelganger fine-tunes large language models (like Qwen) on your own chat conversations, capturing how *you* write. Built on top of [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory), it turns a raw chat export into a [ShareGPT](https://github.com/hiyouga/LLaMA-Factory/blob/main/data/README.md)-formatted dataset for supervised fine-tuning (SFT), then trains a LoRA adapter on it.
diff --git a/demo/build_final.py b/demo/build_final.py
new file mode 100644
index 0000000..9fe74af
--- /dev/null
+++ b/demo/build_final.py
@@ -0,0 +1,99 @@
+#!/usr/bin/env python3
+"""Build the final banner (ingest/banner.py) and demo GIF (demo/demo.gif).
+
+Layout: parrot (left) + ansi_shadow "Doppel"/"ganger" (right), amber wordmark,
+tagline centered beneath. Renders the GIF with agg's dracula theme.
+"""
+import json
+import os
+import subprocess
+
+import pyfiglet
+
+ROOT = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
+PY = os.path.join(ROOT, "venv", "bin", "python")
+AGG = "/tmp/agg"
+GAP = " "
+TAG = "fine-tune an LLM to write like you"
+CMD = "python -m ingest --source telegram --input demo/sample_export.json"
+AMBER, RESET = "\x1b[1;38;2;242;176;76m", "\x1b[0m"
+
+parrot = open(os.path.join(ROOT, "demo/mascot.txt"), encoding="utf-8").read().rstrip("\n").split("\n")
+PW = max(len(l) for l in parrot)
+
+
+def _fig(t):
+ ls = [l.rstrip() for l in pyfiglet.figlet_format(t, font="ansi_shadow", width=200).rstrip("\n").split("\n")]
+ while ls and not ls[-1].strip(): ls.pop()
+ while ls and not ls[0].strip(): ls.pop(0)
+ return ls
+
+
+word = _fig("Doppel") + _fig("ganger")
+TOP = (len(parrot) - len(word)) // 2
+TOTAL_W = PW + len(GAP) + max(len(l) for l in word)
+
+
+def rows(on, off):
+ r = []
+ for i, pl in enumerate(parrot):
+ wl = word[i - TOP] if 0 <= i - TOP < len(word) else ""
+ wl = f"{on}{wl}{off}" if wl else ""
+ r.append((pl.ljust(PW) + GAP + wl).rstrip())
+ r.append("")
+ r.append(TAG.center(TOTAL_W).rstrip()) # tagline centred under the whole logo
+ return r
+
+
+def write_banner_module():
+ body = "\n".join(rows("", "")) # sentinels; colourised at runtime
+ mod = (
+ '"""ASCII startup banner: a parrot in a mirror (it mimics your voice; the\n'
+ 'mirror is the doppelganger) beside the wordmark. The wordmark is amber via\n'
+ 'truecolor ANSI. Regenerate via demo/build_final.py.\n'
+ 'Set DOPPELGANGER_NO_BANNER=1 to silence it."""\n\n'
+ 'import os\n\n'
+ '_AMBER = "\\x1b[1;38;2;242;176;76m" # truecolor amber\n'
+ '_RESET = "\\x1b[0m"\n\n'
+ '_BANNER = r"""\n' + body + '\n"""\n\n\n'
+ 'def print_banner() -> None:\n'
+ ' if os.environ.get("DOPPELGANGER_NO_BANNER"):\n'
+ ' return\n'
+ ' print(_BANNER.replace("", _AMBER).replace("", _RESET) + "\\n")\n'
+ )
+ open(os.path.join(ROOT, "ingest/banner.py"), "w", encoding="utf-8").write(mod)
+
+
+def render_gif():
+ env = dict(os.environ, LLM_VALIDATE="false", DOPPELGANGER_NO_BANNER="1")
+ out = subprocess.run([PY] + CMD.split()[1:], cwd=ROOT, env=env, capture_output=True, text=True)
+ report = ((out.stdout or "") + (out.stderr or "")).split("\n")
+
+ events, t = [], 0.0
+ def emit(d, dt):
+ nonlocal t
+ t += dt
+ events.append([round(t, 3), "o", d])
+ emit("\x1b[32m$\x1b[0m ", 0.3)
+ for ch in CMD:
+ emit(ch, 0.026)
+ emit("\r\n", 0.5)
+ for line in rows(AMBER, RESET) + report:
+ emit(line + "\r\n", 0.05)
+ emit("\x1b[32m$\x1b[0m ", 1.6)
+
+ cast = os.path.join(ROOT, "demo/demo.cast")
+ with open(cast, "w", encoding="utf-8") as f:
+ f.write(json.dumps({"version": 2, "width": 94, "height": 34}) + "\n")
+ for ev in events:
+ f.write(json.dumps(ev, ensure_ascii=False) + "\n")
+ subprocess.run([AGG, "--font-size", "18", "--theme", "dracula", cast,
+ os.path.join(ROOT, "demo/demo.gif")],
+ stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
+
+
+if __name__ == "__main__":
+ write_banner_module()
+ render_gif()
+ print("=== layout preview ===")
+ print("\n".join(rows("", "")))
diff --git a/demo/demo.gif b/demo/demo.gif
new file mode 100644
index 0000000000000000000000000000000000000000..799a2183181b1e1e39d51cd934fc54545aa42d6f
GIT binary patch
literal 239193
zcmeFacT|*FzP9~TQ9w~dK@d<85RoimMntj*h?p>;A}WoDiiimlnWD%^NlMN+Dmf_<
zBq`>cZN;2J>v!YsF45_?&&-@TXTJAcGyj~mdX8&Wv#8yvW$l##Bd3n}>#ftnh{ZmyM2E+c#~b6h1H*a`;CYga6-yf6hxqZR-2?
zFICm2;lHMT^$pAEd?K2bSbjOE?0W4dQUBZKhwcYvKI*7^DtPLp)rGeq4?cFk`ywjD
z3OJGaf^DUC_;ltgfe?$V{@gd5F6U!c3*QNf1f9BD`XSgl@4`UkC*iQp2d}Gt3+;LQ
z?sNTDQE8Rw=F6ORvtzT5w_SYro++a`WA5^edjjKfs@gC8`awWeP1jMIR@`w}
zHk>t5eWqpSEsvE6c{QCu9*-}N)X=l~{Qk;{#QfURF1t=&9;G>J-nBQiD~=S@b>+z4&I6ui!}i;OQl|PX>>ic|d4J
z>D~5d^_BI(Gs^CFCC=RYMM$^&VPCpsf_{i@#iMiiD{Br2WA{RKDp-eaOsXPp{SP
z{q#jx@A&guCxS-K4%MrAdGB<>%!4Acs$c)wpSQ9hbXLvVCs%9seihNLegEQaXTt0-
z{ko5D2Cmc`6rElF`NQkSR~o`*H+=c+^QTW=zlz#>(nZYez4+wV_TDV5HhUitedbzU
zaclFnep0L1YyD-nw5<)0JHT{cD~6go1S%h8J8;wr+Z=+l>Y3|;b-K;hg-pH9UKgtS
zv~69O{uk!@a3hhq>m$tM0@p`cXtl48vep;a5WUcP?uHoK)qxuht=Q7O;qdAM0*rMM9FGP%#yc0bJ0@(Y7x*R7t$Xe-M|NHh{N<>})AnDI4tx=CO7<49a7yu)<2a>q
zv^t#9LiJf2(<7}dHf9`J&Doe4zolbi*3knj=j_x_3+J57qa5el+`Wv2-bEl?!qy?f9zI=~C9MFSxnpnZZy+{A6kbAF@=sfp6voRs=XDlXkxu3P3ExeQfw)~+2FRv#4JdC?(k-p)&oNg+EgI~R5Byt1W1c-K|8o_V{j?Yt4P
z>$=CYu3a|{d==h((_7Sf_bvZ1p}TK$CUoz<6FOUD&)vxR)_d+9`no1`&;9tV-FqG!
zJt*SwFg48D)BsMcp2c3mZiC4wUv-?|o8vBXsZ6>Sx`1pVfaA+4sCzbpE~<
ztz*LWz3iCKv+q^+Y|;I%&&;2{|IPU|Vf)`+-rBSO-SvZ_2j1Tfn}6WL{iLu1A0HL<
z9QgFKLG<9~mp$_j{`U4p*ugI!pYwX%qFIV
ztM~ByWZntvCLw26pHTzJ{Ii+OVjNdrC5;q;`RwN5xvqX1E-8X*m@U#BuKttrQ-rs&
zTSh%{4bU4%5k1H}AzHql5O7RElIq_!>4$|Y4IiFs1RVjE|1eyU^<`=rKK=WRiD
z1F0h#n628JZNcj`(qwwrt&?-Ng}At+jl97;HN9h7sC$0e=x6LxdXKh+?HfoN^Of0V
zz;X-s)ks$m4Qw+}bBhReNgq2#pxxZUEiyVkecXh=b}MJMsKkNv@v{XwY&dSw=^7cT
z^8-5;=eot@yJV=Z5$Igj;dZDzKSOhCV5i+9x5Kpq850f)oVI6ek3FH0IWa8o^m?`J
zai?7}CnX7VIazFv@6XSiQWV(b;=DcK>OkhS27zu@&i2H+8d)=X0=wOFw;vgB$(ngX
zpl5f-_M@-!vt~UD?AiBdd(!8Dtl3`$dOcb0$qdbGLs3qzubO*`(B^F8F|0neg?nb|
z@Pcg937o!QXZN&GPqOFCW}OM=xTh;==FFYXITM}ho}sZh$8rtpY;1>n=H!B$d0RPW
z6Cb%}={?Dre~{Im%-WG{qM5rOjMJa4wj;-CbMB%f*12qp9l47Oa+efw&gDDr$g_Kr
zyR?CIzL>Kkf4yej@*d9l^4uK-E}QdK-e6rg-m#<5y&!MZGtPzDM>~r4J;_`1m36U^
zwX@h)Gk>jU(8UuT`KV|g7`(iLWlWF~rqn4J%72cH6{%#KHgC82xYxYbMQw)e2FJ~6
zGiT2K;4yit)!ZwtiJ%LZbM0j&(>!ccG^bixEt%qK
zr8;ri^HF47H>AEJP#YoXHrJN^&<
zS5KO^k1vf1Ulk-pHIXTk)Px?s-YE#p6xr8aKRE^aUNRx@ymlu25!6GY!tn7xNY$=6b8mzAf34
zKTykHLMUS$y<*n4(}x-X5<(e-4x(aZY_OV8W(wOeU?+DYfL*jm~U75QPD-Le-Bwq8TXRROt5AUA+Z?h@NI
zdpr){7%yJzH#W7_n>A<0u6->hTCQEY6cij579Mr)-Yqqa>7W?20!hFEMwYEu>%3{3
z!+Pg=^Ox!An`>yz(3~)H=kEPx<_ouN-@Rbb%9*pwHvF>1)NFyZ&9VarebO_sjwU5v
zx^%v+t+lziDIxJlMP=owQ>~T9D*Xe38yXvWd%G`Rz7Q9maR2^o0Ql(9z1OdwzI^%Q
zFM`EQ6-fLkSU;r-oDL1v|4yp7k@|bW^zR7IzbUi-|M;Kf|IL%~ruq8$)4T&Y-rvj-
z7!VaSXpSiVm>`%V-ZznRG%PtKEifZAD?X5i&Nz}6;ggc;U0j-8UJzYXQkEAfs2!j&
zNtWU}%1f13P&6S*+K->oVV^$S-5bfAs^%w5jb=C2wFsR8J7Etr#vz9msMts`zY^uV%GfR<_zk&VOlZ|eYGc>
z8R_6DMOnL8xZjSLwv%6A8`W!{nLLc8IJtVp>2bW2;=a(2j(LF$N<}2Y!FQ5@w~17`
z^#-*kxjZ8p7H+FE-MVo}>7t|~bum&a{^7%H)kuGE&
zIxwAMgY+VkyZ0P$+rGzk>6*~6$XBnPo;lNV>(;g6k`hP+H4b)w6kq{eJu}1}H3fBK
z)f&g8%k5XLTEAoGUYi9g_U`kHias12bNI%MtGDmmxO3-bZCy=9W_D&)c293tQgSMO
z?BNm7SFT)q^5lVrmhOuekF|AX!3Yoysk6}5Po)-yf~Qb$;VI%W2m-=F~p&~-_~VVZ)j>1}6welBxO5U!a&2_t-e%fJYl}j4?Pu+W>`>|VD&tbCKNTWGYaw_5z$4#5FZ~cV1
z(!P@<<_m;ubI}>W96Nj}(@cb}$BI8Zx+oeeQSxR&
z!gB@b8nr9;8}m|pCtA(h)H`jrclVZ8n#)#Su)Ac`(?_L^=AWW4KKGWx9ojv!=tHpu
zFJAeLQWa+@PkQ?1kx`qqvbUK~rP2HQ*JS-!{NgTc{I`$#NQ~VgK>yyjE=Abo|Bqou
z(z*FXz>#0%8nNYO79ss9n+PUSNg|2(|ByOyiLlX^Ok!Jf}9%`=rhNhdC@UJrt
zNs^dDmFS#cOVT|3C+z>w48j|1FCV&|N_0}ocTtaaRpZ$re7J1Iv1xH>Ggr@lJ(G92
zvBQhw7V(Fyeu3OBMlbJO^VL4$uZ70T$opsyFGta
z;mqQ4Lxt}Foq>QZMnJz$mMBUn$f&xgHlAL76DR4TYF)d28C44|kB?78tu`{TMi5aj
zQL+{;wgcUW1=4_~+HUm*hjp7cL1E|4o$>VcL(wQGD7R1w+o@^Tb*)Kx?lRzOT41|${2_&uyU_hN8BUP*aJbwx(G4FbwI
zqDlOLXrjEMwxc*Bg=A_#@{l&<3UC^;cD&F4s1QgTdsUVLU~c1~1&8K*F$
zxTL%~s4|>W5?#~mT~{C2SlUu^@_3HFa!1-pCY@)rfV8;e2+_#4o_y`qh6Qu%>oPWTF2=(
zZD8e|&ghGs)gT~wa^P@(PJzdiv+OC0&aGQoWA>4wlC#Wey=dI5nf>Q&EnWCdIajbW
z#J;0B7ig~d0ZoDqIKyFk`>;vh2tWZVs3p7-q(Bk?1vg+6%mGl~MF1jABWew*3^?Ng
z^ybYofDY`yIxqr!ARh1$)q|(2S1FJrI`2=2w
z3So;I4}gV)kY2Q}ubV{Wzsb|T@Q?HNe@Bz!O~cZI7W56xAfOo#ekAhHA)py}Bs?iO
zIz2BxGb}sjaDG`rVW=;I&f{B>Tiy^<8FIY3wyw0X-M=}4bE>VQk2+nP6R3G6b$)QO
zH(zv(PtT<@{nU-9Ti!j2Wl!$<-1n|}MBRDb{*rp#TKE3LHNF=Ty9UN+@8`K7IoVur
z?L_Y{(%o*$o*g~0q+cXm&3>ywiu{O48?IiQ=+L&{qsVf-8mSF&d`TCX7jr#c^u0
zE}w`#W7Bnn?IHD1z+J{V=@gyz9W+mXHvI?CK-QeO3%Ntd&6`(iYij`rcquF@YHL6B
z?%fOU^$ooNa0D)Z%L;DmY93iPpWE*7n|}Aq$MZIBb30&4#hoLV3q*8c;}iYadZWeUnJlq+!|4KoBgV3_eR53#sBspu)W~_l!=#M`
zYPZ!Np~Ni|s8RDIl}x0%oQL&Ljl33o6#s>3m#?n7-gE2Dy;JR-k5&%UKb?E`K2>o{
zUuN>dc_mq_HPrrj9^u}po?Zs3j2SW}D(jmV7UC{U*N8_WlgA1MW!i_YGxHIbG_seP{HsZ}N$Q+nsgEiu*2IV32Im4`Iyg9g
zFV)}@!ok6#DuWnQTvWuTPk%+#WCsSfv^1mgb#%7Dct8Za!k<8P!NQe589)PKC=S#D
z5mL)RAtW|y_FPa1TLMIY3km@r$U`9~hju#7FY~lhrA)xlb
zgFB!ZSOa=3Z9TXhX~brL1QLiNTF4<{MxqGZkvD{JaEJI!-AEU5h)g152pH%`j{-j<
zV5ATrX2|5AqV3k95Rh2H)g)}lCnABcAr&NSWClUx&_f_~`~%;Yz!V*&n(i0Di3*E}
z@QdQagdO&c3rq+-;*-Qq2}$$L2*?V~@y{y=&&dla@Meq`oIiK!uv}``3ejYb8=VTO_WO#uH~$ti)2|
z)6lWH);#T3ng(U5>$=-s(Xe_RP4=eKHq}KDLZX`ICfY
zn{z9ivkks`@(2K&^8z@zNu;eiodk;)Ez6jy@7vq)-KmQd-u;hxYf(aXP?V~
zfWW1qCN#b|yz@g3hOfnh?;kMQI5lCSv!zF{d)!XfbWXN!NOF9_@*qF+0NY9vl`L9L
z)ts8TWmZCzNQ>D?n}$j&VMA{GRPCuun;JG`g__v`8gG*p+aAELWa>z6i4AQv#
z068E8>Oc{=;(`S2{V7YpkEnv+^tUX5TcSGB6%wL=@!vuh<0_?r%tf`&W~tLdDe_84m!dEK~|?B?|quSXv2xGDKjT153q!m-QK
zjjFq3s`(2eqTWaed>2av4*kbYpLFr~@gkTg7f3XY8|^$SfkQ!rQI`f!^Ue*@!k3qq
zBm3yqu^T1@5Qhkpzy^psS*#L!kg^72aMO)wBg;dJ6{;eBs;H18w(tfCCdnge2wHnG
zGh&ep&u|yX2s>#DQ5DJZm>7~U49tKc)C6*%Ac6`wL9r#~!5tQ~0~p{LLISENe8N5M
zfr0yE;RKVwArJ`E0ayIwhfV+N{pipk^dIWn~%u83Iq{maib@m~i!?k&nJw0hYqnOFX+SglFg^aR`?Q5h0S4zr>apW84oLV==
zaO%fiLq2|n%)Sge)#Ov1%O>uKOYLhuY#Qus;_^98J%X;t&Smh5+0%j~Hnt}5*NLfZiF+2T7MfK4@O1h*-XY7+)4+Q94=zJD0$~xC
zL0*Q&^RNWLoh6V3%ILnpC?Eilz!9eoyZ|DwTfV{p%L{<=i_=y#LTlG;#Fo;>H!wOT
z_WJcJhhyWh!~%rL-SO@%$wfBjJLjoP`G>ikG
zycFXk#lu6%O9;tPgIqVA=|6w*@>R(g*}J6=YJNQza^aE>&zNDO1ejJT;d4ZspEUF--Tc+B`M~xV@!2Vbf!^}-~BX2R4?2|T3Nqy&bw$8Xb
zN|Z8px=HbHVy`%gNB&YM@0~z17g{)su5v8u6z8&@QjE=5hT_6R&t%u6UPGBIj)P-<
z=Eah__$9hCIWDQiE+ZyZnKGsord9{R!k;wupbYOr!sj#r<)TvhV#xGwE0BkS@o&hts!aM?u$v_K8AXxy3
zap2;mb3hbl4GFGW?*ao7{h=Yp_S(2o;
zTn~azGVBmIN!-vkAZT3jku^d-vPQ^9+>jA2{0JSY2+18HfaoCwNFO4A^r3!`0HOvU
zswgB#BAI;vJc%S$JxC?6$KV6z5oD53F8RnQN(-WjrjW}&$VX;LrTJHp|ECvE|M`c1
z@px#6D|~1{!6CFjiXIsiO%FdD>mQmB9&dQk+^oI23-z28Jm>f+7+Af(>FMT}VyfOc
zOl`bVz<8GFf_dgAU0VGF6dgV%S9-QGM+xalvz-@*m)KwHkkbufyDYENmN}a{EB^7a
zj4JnEF3s{AaddTg^T&^?Dt0X-clh7lqfMawe1}i^oWV)((m6N)dO+5Z53rAbqSYZa
zfh3X`1;Riyp&6)UZ~!!(N6Z21Kpqeee1l2_|4IOc#WnEzo@_{&-M0G
zHtw4c%n9`xKgB6hF*?jU_K-@#v|-=pCjrE9)ens(FhQh16QYK`5FrD(Kn#gQ1QAyB
zooE3OGLj)gmc$ETMH&zSfRC9N>kw>Z(C3j<5xS8lK#fRoR~;BiNIfQCBEH}eqYO$Q
zCe+fhGROcqL*mi?qVYvPis&Qgh(1(-pdOOi#1RU?S`eY+
zcECg11>_HLCyqfahcHMbhBiol%ncmUkIbRElU^B-LeWL~|MiWg-%pJu=Z{7+_K%wq
zpKqOJL3wIXD8-8A@hi=(Xv{bkQ5{w5TOZZb8Q2oxLqFBqe)?Q+OWPUm_Oe)$G|`&@
zB2)-NMRbaS;+T+2b^Yh_Z~5O18SAIS=(%#4x?17gHfcD;<0&B^0%v0S5I%oB41XUXPiZ#g9E1pkK%h{^)gje3Ci4iiKUr%~BlWO&w
zYo|8OqTrr}ZElreiE3of;U(AVR(}fpn7}T&+2kR}ub&^Nt=JMga{PhBp%;A$!1u@T
z9~~Bug$4mJXuyn*LnYTP0Gg15B_YTr-6i+~E?Dfq@L17c76EWWDMy(=06;k^2p4g{
z2g(5-mvSHo&_Ovt9FZ7GId|gb@(sAbC{}s6_$JpkSWsXmiK0N3gy=M}oFS`d?Dw$B
zBO7iMFN6~*MG_DI#1dgZFhD<2imF72ABs4E8)PF?hzA#Fr0ZY8E_^to_X*_qe8ZQ+
zz35~AfiLF|e5v2@jV}r<$*HIbIu=ruT-zK_9}!rZ+tT5EvN5HzBZN<4l!%pV|A~ANqT11drnm2w-LIu7QIp3AspwF{6qk$HP)$Cg1@A|cjn`-l
z5mJ?#)ii9bmk-N_rDD?hTrSvnisAxuqle9A%SZ3xeeaa*V=gyt{v(UI>R4;bS<#Bw
z3&LBs-4*NYzuRJi=Zh|$HYH6aA_8;)Mb}g%uhaI%%Xxbhn#29grJ0R0lRk
zGbg4QN-#C7q@|5LPb7p3(7h8)Epy{icm$*e?x}Y?6BigS_gUMB(P-aXxliEb0@Wtg
zDM5v;FIV1fWDBu8nI8F7hwrY7vt0Kwh+Xb%owYJbkn!EmvjF&yk1QZzr;V-{x#3~p
zjI|~B1u~d4z&`lJnjC0gdIu(89PlCpFe;G5B?_3u(i+ah-5^*1gWKIbAVy~wH#8=A
z)PPR{phSHGi@p23xN8DTBS(``Fp%JSgY+Pfp_g|6(8Kx}4umZcFJum2B0b0uY6L=t
zAfgrkQe+MpLKewB6|qB@P%Mx-J(k@K)6{5x^t9##nSbbhqN{LTa$C=X37VB$l^`(mmnH%j4J2ZvD
zvNTd&{)CbJS;w4_8w&i#Q-WH>1_m2XoU~(4ja*pOzJhfDNjFF?b-Sj148khI-s`XaGqoLY4xE
zA!1812$D$*xjX}AlE)
z{Q4t)5=?PQfGH;#KW|qgB}*GlNz(*`a!e!K{D#qFC@I?LDU~M2H~I15@eePJA=t7!
zZ_yD`T!44=cx}vY>p0!rYjkq``5TvRuRYUBUB7tiii3~Gi2~C<12S0Z@Sizy>Nn9Uvjpk$WCO2&jNCK^@qD
zslXnX1TBOOa-9M&04F*KVk$xfoCYKSCqN>PEPxO+1sZ`6Fay9~0QdnX01OfUGiV1|
z5EM?I>nO4tBA9_O5dJ$7yY@j`KT!ep00!`boIxP607U~rEOXIU6Hd@~5M9HW0Fp}!
znScQdEDCkQqHrLhgcbuEC8AUC!^IUKk%$p)A$iz!C|BgF;D7VB#{-AH><6XbM?vm~
zaTfSD!tNOTIRGG)5jSJF#RUN5-{q$_HTd!3dKPzpxP8SB0T-&cI3?wEP*aOm0wFL1
zJjezRc)%)v0gwP2f(K3<@Ze;*X9|6hl{YDfPB=|+
zlz@sHCHD|uS(Xf@>g{N&K@k0*x#zdt!l|NY5fd?<7cUoG
z6BCM-z1L;pukbLf^7sH;H@$%_jr%zRNArA!fwf`>E@BFx+7N0#qjkyQ!b%fJrWH&{(19flUyE=jC`t0ux{=Ro9$OW>5^5m@H9_Fzn*q
z8aRVxf+r}$Y2$=3(ZWMe6iCEX4uHZg2@U`?s1M}3NWg+GA|WC2EhKium{ZA98QfCi
z3k&c|00-Sbo7@*7F^D@r2Ik~NA}A(_!LA$FlDHsJ;1sh27(}*+Jm6Gdh`DR%qXdNM
zuUo2r_Nwvk`g1=J`VY<1g9Aw%8At-_BRGQtiSOV*(u+4@e44TX6XhHD>1Vxp=?tnZ
z%A&XP3^mSo)Rk*KH!+Vqs8~>VRzj3Hc045(aeNPt*w#^19vx>&T+=B7U(aXtev{Zj
zw~E~bN6nxpp4*Myub2ypFS63iC`m;>v0nRLzd!~45Tz=$BLZ`#Qp?!=hE$-TxV>ET
z+%m`d{X$bIC$YKyp2LNl_QlGVEB2+Po!KDodwT>mB3j!dTg_7GeEwOXY;$d6Z~h|1
zDYiDdX|%h&Wo9{XZngpw^4$5}%GEr+V7sCz&_QZy{)Mylc|p^r?6sR(x_-^RsO3-n
z;@onFJlSai$$CG#7VXL4qV{4x00v7S2e5$_lrQq}B&r7hCd`9&Fb|Tk2LfgQo*?}b
z_JnYRg78lW2clpPoa54ja1IE8Iv2v9m`6a68iGBrBxQ}uKllXRgilfmxYs4*gQI^h
ztAG8OA#bTX-v4hrYM5KtG=t@;A-H8TkIR;x&rc4|I3GIxlgu{5bK%UHUm`nXC*6OY
z!}RpB(al)0`NF{Yb8il;TKmMgUGD}^9&EZOsA^u^9#;WmCB7VvNQ@!V_3R0rXsL-8>ifTeErM=N6#U*OO0%D4l
zg)zl8s{3YDop5hD-gcsOQkA()k7?K0W2aiWFZQ0F+dq+fCGc$ucnOSte0qr93~dG(
zR8czX>gs?LN-z!^I>WibT&S&RpU^d-<-o}SDHM8O2ls&hbU|oxP@Qq!IBYNnGh+G!
zYTWfB8b3^wXxC9*(J7G|W%NiWsbrQUjSy-0$pb6&>iEW&yNARuizp5zMpkxiaDHZXPH|~rQgJvNi3UdWN0}?c{qDomtl}y<#Z}}?d_G*ja3;p+n$eYVu~rrxm7=x?+e*T=`GGp
zUbxtk=BP4infyS1=!l8R{jRR5bGnusVf5+Bs;d@SRVV6atnd8wOt$UDwevhG)Rq<$
z4}GP577W@#W#4b8$s`Ivkzo`-fhJ&x^8qh72OKFTHdH`dBjG}bT>oI+!u2j5^OLa^
z#|tO`0`9EQV**KFNJ=N>140~d18C3@+J6*QQc2O^lNyTKUX)RE-?*ef0mTT5kEGCn
z;*jy^0u*2%0RA8mp&_j+wSv6+wS*!+n)M++aCMlmqNbeOQ9BgDb&$|?SP7mqMGhBg?g6A
zXvq;uv%K35_tf;sdQXt?)(M{>bx8P5+x-WWs-OrZEhs-ZT#;GqWlu3g=}P9gzigEp
zZP;hWxJXe(l#s0aEHO{}UKUM4ioVHQeMK;S2bU|;a~)HOCBU30W@EfwD>6TIG}DIi
z^r9HzwZSu-?h7S!yB3~lKRezgg_^o^_++}$LYoCsH!NX!ZtO`ihJ;2Sop?5duf|a}@WmO*1_njk`NAU53@S0S2f;v!a12BU
z$H0pa3cYzGTMmaq|CRKk4lC7cACl5!5;
zK!jw!MphmeEFjUr&Dzer#57PNWCJxqHc%im18-qx0dJAfmzahqlhka86J;CXBu&77
z9P_^^v!PPq1bn}_3_2VecQ`yTGA0R64C4~Rk0wQ@<>vWigk@#Z^UHGaL@v9$IzB_3
zlEGTJDXcd{djv$CQjBpMj1?E4&SMFnYwV%_u-YirQYGf
z)Wn-3k_5aTJ$)L^5aQb+)0je?6qTB4^020L>y`ve(EPjcHdnQG=)G~7iAuH-fy`R|9l_tyoN^j_0
zraWDkn*4)Mu)iFf(-;1}h$7nqvLlBk(5{dz6uJ>a2%BuA1wuiXk!sSDAiYo$Y{Jbg
zY(hpSERqpR;uK^OLLmu@M%8IePhH*_Y57WP|MWg)cO#T9mtNG5I*v5Ue~g;+u&
z5Fb1UcR((KpFRHJH5de1g!|A_LvRox1P6(rBZ5aDIH&}QUI?*rv$1?vzmia>{el+~1EPp?GXjnKsI4Xd1DB3$BJ~s9cEMMR31#V6dYEIF1gCOK03*WH`%+;8Xc6zyI?vE`F_
zD5emdZ={KaR4T8QqDbkj0bWl@x(zj5!Neik>*{)K*_#6t#h|k&dHcq@%EPYE%^yxm
zrYVl1ma$v~7Aen5h)`plopH;Ik6%!WvH63}{nPQ6Z1^rNV4vzwSrX!~WJJ)lWNNbe
zU7Je##-*vYPqZ@x7Ty5z1JRv?>2$O(#qA<$I7OfUqQv7jZ$
zj2HrDf*~L<3`XdV&=)~YWUhjopfoZ?;qrz!gSZO*gIb6)$k!Zj4AccfKySoeWS<0w
zk?sZ7fy~hDpnZZW&^1A6kOXOIzFn&f{?;FKPEPja?mmZX^k
zNloL7swy6PB1~R2(yo1Ba(jTvqSY>Qx8Jc0poDCcY-}<+;tedX2WM|robRnao^KV$
z)#-#6J=ADkN4Z){ig^ahO80DT;|xO{A?Bzt`z}%jEoTdMzj?Y!_Ly%AgEAV`op7m~Uc-lcHmnG*3|?k@-7)It=3?7}Em0i(}A$dE8(3nd(JK~|7Igc!X!GJ$H1@S#Yf>Z3NJDw8{O9IhcF791bd+?Lfe7Sjc5s){8Oo*
zokHUUmq4P#C6FAkDO>`fa&^KrEAa|(7}Nr}!nIt%V1WBi7xKUJZkPZ4kY8ir|Ko@g
z^v8%3`^Sjm`}>HKMn;@C@)c0v;8#F7&Fy~p3aF!p3gJ;=XsV16tZzM4+S?H#?&eQpA_j&rz`A1APS8{)LC!Z3sl73X|CT4m#H+`OTlS7*qlh>J5
z9k^(C+2&K6LS5(3gfoeC-NWQRMlW8xWzi8^;gkJG&ll8s>_}wTE?Vk!b?6dx#_vtW
z_s?q~1GL9@wunz`uqH=X(WnzKAkD}zfJMW9>QpP548#|AT+QD_uUO<4J+%VTej{%
zQpvgy?oKicELU-!fLcC#7u
z?K6|Xd7BfKNQ;PxiBC;Cd?cKcoPyt};++|qloMG{o>ml6Qd&_Hd+ca-X<2OxyS^&t
zM0@bDQ^_5%0+lo#VS(XHP1&P-!wr1&>2s*2=AQNtEx$25l+`pHNs3p*)cY=-|9;Si
zcs#h2*BCW@7!Sp7I7>;IqP+aY_uhSdM6hIibDEMaAMY@OMkdGboszg<@=Y-jBNK*@
z!tjN6$H`?_$arS~W*ow)x8+NXYEUR%mMV9?q|*9X@ew
zuih}XJ!dv-5W*5KU3R>)!1BfPcDBmWZuhvjbtMU(_8irl=3dknXPL0t_d{>pkh@On
z-wWM$wsHEXFn|rE0QBMth-}fpCm03>eB;5p|Ew%r_}`mvDyCV=>2%rIOi7h6bT@rlX6Pf#K>Iwr}J^
zFTZP!;GxtN4aIEi#e`|YMk&rb=5m8)jsSInz1WMEB0!(ajy5elCv2>|fF9m-=AL})
zbjL@za|J{4HY#e+osuK&$cxGQd2@D~i+Juc8S&(CK(dJ-HHOJrX44oTAR?5Z-??Jb
z8kUiu#pSxZ;aladtPprR%swI|Kv49Gi0^S)x4(#+=(96tHgzl?cQPnznA1LuiFcO-
z&srk4N2Y9jN8rFt<#R5Ry80yc?Aik>q?fXQ;aEfww*SxbJvFQ#f=?K42TnfOEV^
z0!Uz$C><&W;aJ*$Nvvhb;Q~CgJme|Uw-xQ+7sOb?z>rudgETZv2o5*|nTQM$gs>n2
zfD^$%!-%{<<^U40LJ*K0#Dohc>IHE=!m@dbI{_8pLOTi5f@cs*ssNzH?1$!&K#M8>
zZpn^_1dsrXoFWV8-F_k&(F4gyA7aI=3*Z?MLl_Cwh$M;IVCQ%=1@VOQNj$ml=C~mR
z>SP}c+L3x>6R|~JernM8p8x*aH`>y5WSn%9yfW-93bcwYx=3}z*fv})o_^ZTqoKdV
zW534i9w*w~?wvDlIaRNA*(xGPXP9o=xYvK_{sl&!-X^|_0wS#TAN2AGi;Oe1=HU?-
zE;&qg0sl5WD$FfDFOf&iM4HNQ%F4;hpT{#&P=rr-v{+1tapkeNL~4X%(4o-S*2qNp
zRi|5wPWA>Qj_Rj^5BpwFR@w1ZE;38)D6#OmDLBcK9N&?(#q
zqjDP+;6w%ja7&=VB!sY_<-lGPbOB?+Fc`yl1cte!kqcJ5`y=~FjNo_W%h-;!^>`4StA4>k
zW@6)uX&2tbY@O8C`SNUq|A{rd3va%TtQwbGwZ}PnX0Y&xsMVGF0}D23dT81^P5Y!7
z#N(e6bb8f-+kSg$w#46jI^(4A
z=1XUg@6WzH56K19b^(c`T*U+S4?}OzOMzKGBf5+e#Z@4H2pM+aD7Tf|_fC|omC1DlR0d+tKX3!WA
zsR9f#i2_2dV8J1g525jII|QVLT;&sg{+*@vr^U|q${RrA1O?OBKJRLUvOsLVbddZ-|>Hmkn+XiE}jiYlMs
zS}lPpN??pQlTXIfyQ#1(mLu&aRmG5q7QS+I`1QlIX?XI+m^@RrikGgzV>FgU7omBklBkqMQMe7wFcT_m!1lv
zFAmOMq7$JkHp)%YMTN~c5mzI&Rzcd+NJL^Hv(3if?om%+9{=dx=b{fw;^WRs-Zt#M
zs^o5~x^|ueHYPKw{M$_grKy+8k2*9
zWAvk?pF0?bywlfG(k2@cox$HoK8Wx9AMnMf09)b6@#qr+4*?enfLs6y-n8N1xsTXz
zuH@k{T72BfVl@k)z=J>s9OEV#^dJmSBvOG$0CZ#ktOInAjcj1xA-{e}7d^q{T#>~tcH!I3!`CTb`sl3YL+1OVL_8#yt@!U&Su^e?kbuHd_G6bZg
zB?LXZIlCn6wXDJ=O~hl}eRuMkhEXF#_{Eqq{9YNEKDooKg(%^&%@vj5qSnJGR-I$R
zfn0_)exvTGHKsG$9q@@m-!JEOUYOV28?ZF!>bmgL_{t&v>Rl?t_&V+8TDOE{2_LKp$L1QSv&2_t|SEORj#
z$_VBHV1h?IZnfaO(L+2dE!3^QCj3gUI-0#dkd
z7yg^P{1botyCE@m$<@VLK
zzTT6e^2VG|fzQUaN2x5z9d)QFqCG}?T}Pro?uN9(lea&*+FmB!ebm5*wS;}^Y)^`L
zl$!7!_nfm?Hp8ZdSXAVm+qJ03`C5nM%JYTxEu3+0cjR9xb6OiSud}k?iq)p18y!b1
zM=ZEnz57+=weiVCHyY@#S#rT6iq|%?C9+UlkZ5Nl`6y;~|-Wd=MCf1Wbck!~iXt>$Y8h8-hiO$Q%8k`60p=sgjlp7T!`5#S7ZEY>Ndg4Zf^
zSY~4K%l|1ewRQCkLo(BOx~rR;nTwY$U-=_5ckkVQ@Ox&SJ%90%Wah)iPoGI5-b{(0?
z%1o5q&l+v9c$iI?X_VTVZ<*0fFgju(ncknftjwr3w<2$;k6i<2bms&OWaeHRGNV~t
zvE}WZjyHU}3a{1lKe~Obb7|c5dQTpnPwK~u*RJ#(<1DQ$PxIsrmSnED`hMHp_TbTD
zuGeaow9x4cG0IiGoMxD88Z|XU_q9`ZanYVFu^#RlX<4-X)gd(%k1jT5bq!`FV=yyo
zd8>9%?PoEFSkCt
zeZ}8e)G&MXmoL21BRuI+Y$-23<@b`_tf_2C9}%l~IBJzD!-BM0FNy{0oM(%LOuau{EL8XHL(wpOzPS1F-=E0|QX{CF
z^p%t^_iu6fizsq)h~>dgtrCKVx`7S}B?2`HvE;so2Iv5k@l~vCZeHE=$6YmF?NtUJ
z%sb^4x;(CT)#!2W(wErsY~Qo5jripS8nR(+>2B>Vx=Fp+?iK|riZ15w?0n+UaI<)~
zl~zK}!*Y+VEj2fuAKzzPbmiIm+5L)&&plHmIezV*zwj`}Cg!Of#0^vzX(p-A^%k>>GO7l`jB}ZnQK1{h(rwf$bvx=Fq
z?kOYp)i3j-V13ThH)}zbZ5?Zg+7g;=iN`gTLB4IVUD<((6_t;d
zS2|wEjL9wE!kZ~9b8=kGh|Mc2R(@*Uq5kgKvIg%5^-EliO{8i=$0qn_RJl#>*ulBo
ze|l1OPnS`MhuKH3NqxP^vv*D1@BYx~!XcVNX4J*!4p;m(CGe&sXs@}!_7dd_uG~4m
zv3mAeS^u-^t1C6LHFgzFKlc2QSJ9X`B5pOiT*?}P0;IkOJ-(2-Luy8k+#Bb2SxYYX
z-q5Yb{))DJADXuV9lnRuYW9rLev0HwJp2K^)
zkaJgen}%LqGVE^Tg^7FCx|>+6TOYl+S!YAcvO>2FhwS1@HypP2r=N>mZzOp?E5T)iG{!>)ein^&K0>tGI{a?}<#jyHPpGbM@^k%1g?kJ=tHGC6LOQTFp!gHFIXB
zN7~9VGY&a+@Mp&F^x)4r>aWJ1oqFgpUruHwhc7p`QkO5Uu=Npde(Cvm-h#^e7QBVk
zZ$C1M>iO~*#m!PH86~aCf{fCRsWry5A)lcMQZuL}JPW_2?q#tR{{CA0-GAKRVa7xm
zjB+)+m}FiRXo^Z`yUZTN(8qChnZL=4B1=?7&-Jq>>r98>+)HS
zi}a2KnqE$t%Pk1JU&+TQzh|v>s@jpuZiRoAwO7fh7C&UIC10g{&%YpRhSSHFDOV^R
zJ3ggmT5ZW-!H98IJ~{5x*!GSkU8crS6Sa&wXpMz84ISsm)c>Fbw!>6qLkYn3#Htk3o}Fl6DFOgEMMwY5LsJ
zZ{L62F!xg4IMyDGHAfTYzPfc%XPTt)`a^vVT}cKq_eV`yY_h;bVlUsSj`G1SE6w5v
zx~%cBdmQd>ud2wqu-~qLGuqNg;b@vu)W^k^l?5x=F2AOSYgJWxuQPVirt7sIm=AiAjf1W$RF`i&t7o0J`pAqa
zk1rHY$=e-ryW~mR9f!wSPxc(!+MRw-@YI0E@fYnU!_>DP@qR0CC2U*J$}P3i-rQ&^
z`ea==YwEkF9lw+n3fGmqcD=c4<@{%Q{Cl5Wx>58+>)_s3FM2vBg`V}K@Y@T`Zzl1)
zde5Ki$q;)V`qhgjvh3zTVdeKXy+qEp$a}Ni@+~#T>(W5|e!6fUC5;C&g?3YAjvxJu^L=E|wDzVJN4)K4SLN4@#7k
zV$S*)Yh{6Z5zEB`??tb1p1UF1c8h98?2=5UJ5FZ#?U{SGcq^|B%PMTrvo-CKmt2wHP@#0I@Wkp{SP}@Ge
zF#AsE1FuPsHasqnZOlsJcx2@l`gCR
z*|O9CMe#El0b_M%rhKh9w{T{G&lS$aCuK^j^9lm{bm%S&(e%8hJkp};-i
zUiz3`_q=1~gyMvhl|{Ar9}*@cIZUp880{;qcR!uq(nC38gPKN3s^ILa#gQS2LCfh+
z-^w<%YC7$cIi;{y$gTae3=AyDI3cv@1!TGRJCLx99b%Wz8~st(tmG
zwi~#KJ`i8^y05KJ*6WP%$o)mVHkKi79fxPFn$k`)p6Au4{YK8;e#s7i>ugsl2@C>~C+cx%GrpY}j$QYgcC;ugI~>YlB9Iv_)x(?KeIeV|4M1Ck=u29S+HP%ci2Alr-6A5Pc^F;)kif4Ze7>k
zJ8(|XZ3}PX*4`@72L9QfStpmue|{KKBTIYUF(&Nu15e`yy?eBkVKW{r{cQDOte&y+
z_V+6yMN8gIdNA;*SkXX!)|DfnH}zl5D^mOV>6qLV{VxX=g0;v#Fxt7%eHm>##^K%F7=Cf8{eY=^8+?_MAjsm&uq;N3}wWOys>+Zx$T0mrFYX5
z)EAiZd7Sa`*Grb4J;TCAQ72>|=#Ju%_vTB?bPj~O$c+wbYmtjD4n3@7uN?TWdGTs_
z9xB$8S3YYt^;svjCHkI3n5vbn#X`nut4y6N<~;kU5^N8Svf!Q%oDt$p3FcPgvN$uk
z#m|RL>rfM&=0~$uD36M`9lchsWn<^|`W(jWjx7Cy<=qE9PLlm3^l&!6h0RiidlK)w
zY$MBUHv6?+r-9meMll8k6Q=^4*%z0k)MZ~{O->sr*v9x31
zwCwT>?GRPz)oN$sFE3_IuFJ9B#qmvi`8vzsjn{l1i~iJV`)sp$xeMod_Gjo;WY625
zyLgZ6xttXhnM>+&ZI{QNEAZHnzUsH!Wz%HOm&8k^ZkUy~LNNY(MRi5e=KXngE2W40
zu3SON4Yf{(@3A1a?4=AMuY5w#%X(vO8iVtM4D9um>BC#ojqe}Pn`Lk%(>bV
zrcThB!Y!5m<#Ri}uZ>===A-bpy8T&ydfs!eKjpE%w`ae9>5bd&k|GDs+<19+|C8&-
zUQq`QT6=Hs44C8RZ)F<7{$nu_&RHK2w{Cyj;lM;E-&A(S&a6FHP$VTUQW;OFDW@cw
zmy{`om==*CD>^SBXI@)OYeAuBhvn(qUgu*cELj2+lfSb2(iQ$%lMCO5tV3tq2CVuc
zpPoD4^Sb@{s+X@gk8WPQ{yFr$#58MW28+S8ny#p@mO(Kkb8F6rH@2_^X^U%PweOg*
zN6Y9kR2SSa4_4NUe7##|y+!D_*{S=g7N%Q7>#VX)Pwlas|5fR|Ht%n7J=bfJSF-tZ
zR&0@PG&1LtnmoPTi=qjKY9FtrJeTCp4A?B$`(~3ZwXSpZtc3HAR}|P?^EMrML1R^^
z+okjS_bz>MwQ}dD`+h+ewCwBGW~iJ?9iEeu68itq_SSD*wSB(#mu{q`q`OO6q&o!x
zX^<{Sr3IwBySux)yOHkhZcy0^xyHwP=AJ$KIgY*mhV{eiU7vH|ur9r{n%935>&vAn
zDYvgA>G5Jz=7&_(a)*ly`n7XH_3D`OV~^#VTl@XF8nH;x=y<7(GScfnU`mCqc8pZ_!!+pUpq0|F(stA`Qhd?)3~*x`Mlv|
zb8zP?%`W!oMQK*-fYSzVgrYQQw@SYU-O?u7H}>}LX
z*bo3Ci}4fir;6}vho{b=f`_8VA_AAc42oO#XD*1#4B^2#)eBoIs1
zKK(e9rWI=vI~GnO^=i)h@d99Y-%12`ah9SS#gsf-6n1e#ap&`J{GruEvmzy(s54aE
zBKL9=LOk-Z{kctY_|q(%b_>EH1ak6o!TE_8Ll@_w6C-EU5=+Xi`e{>3?x>4Q8+zpOlRI?Y`_lIizl)-F=zM?!FO95^q_aV+pAt|0
zY#6t3k$DJx@CSq4>y4rFF=4!5=I?TNKLxWPD27rxHNG6#ld+K!(G;eKyy+?An^dTn
z)eDmJSul)!^#xSw^X5xIf@na^l0_ta?eZI==HaD+n*A?p`Z`PT(@>te={`8F7DSt<
z(_Xs1F+#V*Td{I+TssNhQdaGCzk;8LO{p;t;_c(<&2Cj7twbRZ?49rMt|Ov^)j*agEPz0
zVmCw6-X(mlS=lM8`vbT(MtDX0;6o~hyzWn%d;P!Q|@M^{o&1m0)KJ2!^6fhx(nv)RN
zU7l^UGrlj@dhaL_OBfC}84&^(`ht1?8T`!j(9WW|5Z*Rs&qZ>U9k_D+K#q`Dl?3=R51^W9%#NWLIejQK28!sXA>s_2R
z(Ka%apBQ}_OSGmAIeqRPu1QNoLiuGh9cm=DG!}ELy@noB^WCscjx|bj5-DeQoQPop
zTEb}%C7fcq2NI|agu|51y6-sZLK2DjqlsJ4Gh^#)JF4GH#&v#^7hra38%qEiU@~II%*_4lXS9#J#G>Bm%0$G%p
zxl+`XiZQ=n_=$g-$i_;f6k=cHsifz>iyio$!+n*1B2b`|E;&`8c2%G{O!G1;K&ja2
zs!#{1P<5++Dl_D&$XKROZALP=0ynzYD!5Q%M^c5W?sdNXaG}=CfJ*)Hdyp$qkq(TM
zYVNPxfL*Oh7Z;6_4H^AibJ6`5~}
zO;rZx8**x@%Yuu|oA2l6=C5n&qD?Gp2N#y_vA#9_EVlBM(zt=Ft?QF1u^F~r+~U5e
zABz^Y&6HZ&SK~mKY%cj+JE%#fU)#8fRBGEwyL_H7Q@i;FxzuR}Vd1jkruh`g0_Csm
z00JoR-y;6M!u{FW{=xdQkp+|hU}OQ}1C;+m_CM?f!2O5&f3yA>;(!47PtbqHIzV}V
z^#Jz)=>NY&{K>70X5q;L@G6$4qmB4CV`-59??2sxhOamtE(jk-Ggd4z$hXu}u_&{otj6nIrB7{Ba6x5&Nc2D5!-oP`7!Wj6O12o#QDv2X9~sM^1Wheg
zH*J_HuFkA)?tk1~*gYt1S>9&Yor>V5B`U9Kgn=1xH!nuh-~RtBHfy-_{p69#{_00
z@FlKGiqU8RYZCH#RuhGEnfMZ66NOO$Pmf;$bh|tI(_*-_T1iv}x>Mx^X>lS8VksTL
z_`&81MEv~VAcTFEw1HT7%|?eSf;(r9Quk&+qVVz7S?ad?zhYc2oMl=c2xDdqL7cVO
z7>Or$o_6_QXWXXLsl1t4t!HKzqNM$Cp^#2_zLu(`)3vu)z+!rRU`nu|)S9ZB>+Nz<
zQ@u&ghyd$XEj6dBtr<$%nU+huyzX)z$^ODVJI3g%HtGmuqnk}`SEn#x9t@U0U
ztc+^oI7w|MPbIpRjn80t?DZQ(TWnEt5;!Aq8~b=5M10ZK_-T;0=7SG@`#7yqUAZ8N*I7Gz)2E#7wbHdZ#@Zue7wMf0NGhS7@{ONq530?Uchjlju?N9|ArOVrCC
zN{O+uw(B;IuvpWy`*P<8m9~g_PieoD?6YJP=+@hpKFvIk%%|+bxg6l&DNZbkkj}Nr
zVW`C46Q5FHEuWX!)1Q%N80Fw?&ZgW>X{yFQpQq=krJI-GO_D%m*K*ALs@QvW-Yn}q
zw6RGJBg;VnK4lCy;cHTPo#JXr4Jj~$0!j5Afhp6(qv}QW#)6vZlGDdSd{(`OB?%78AQ1V5p;HE6_`Y4a@E9o@bcPkY?lSTFQ2a{Ztk8iFp!ljn$
z(6cL~UvC}lzbJ*`WSLA&As`dUX`x8&fV@^wSFK0bQeQ)|F
zjQGbC&Roao%CoaR%GB`53ZY@%Dw3TNuAK>lPrkX6&f?BiFKJJH-mE@+;~ZW{EOa|9CCNpU;&4^<-X~Men8092YI|
zR^sqOk#9X*jhu^C0^wcK>sku;W7yIM%w>lwZVd}@D=J}(^yMjTeWozY5j^dClDwL5
zlQPe##?`A$vG#3|vTxG2Y5P9d9WC+lLeylh4j?N#G^$Ur8M8d2Q(~bD-yi$Ps1jnu
zqwv1sJMz6jNX72W<9)9IJRIof4t0!C%bH^y{Q5@}-V|&*#_%I*h#GIltNbq_Z!*F|
znmSVmSNeM3mr(umrL#y@u2KpdNdph^g-IjBdXYkc-QX*dzelbR
z0}ib@Pn2Uo;j3X)CK2F@d+&a`Vk!ee3B`elmh{w!R0&H4#7ki?2=|o>C%y-~OYX(e^eJCwK)S#Mg{6h>1%OPyU!1pBRvtB^I0Go^Kpc
zn3k&mwDU#j?@9yftJG?;6Z}NVfA{i(A?;$n@tD*!IzE|^FFrQ$J#QMcGzS_f-&m<$
zi`Z2>a68i2-t7_|`pw3^i`l)lj&9o8ql8+KU&v^54#%mDU&5dpbZ^Dv1`g|{R1Ew34&J?Ylp~MvJnH9W;21`73A};R
zytr(X@4_S<*;(ML6sA66gK1&xj!jWnYQh*?-A~1dP;Sq9C#!xQo1^{k8eX@9-`yQj
z3~Tm1DD7AiFfDPUV$f9Qz8dESU&izw`Uc3WiYdmS
znQ*zam&o-|TQ>RXdT;TSW9;F*RqO1gVWwOyf$oF1#nd_$AO5XN)z!siN~Kmw_#p_I
zYSsP6SPp>@L^I{IbL*RmR9qA;RSk61fwNU1*k2oj9{8VkAJ=^Asj=mKD*ShSEiu_`
z(%J90D)cP8>CQ>iEylksqA4vYDJrq-#YKg_xv#bZ;%dDa
zB~1`d7Wi5g+sBgvtwV>Cl#OUR4m-$%BHk$FgJFW6UYKVbg>gMhoJmQ4PNI2Dd~S->
z0GzJ6Zb7{f#OGx$!=#(`sg+>Kpf3sbBOkE#GQ$KZ^Rq*JZsw!KIMtGKR^wry^I|1c
zOtTZLO;`Os#E9!T5x3kA^BO9h`V^b5I{B1<-Nr)}8=T@E7I%fL(Z2;l_T3QgSUb^aytrmj@8z
z#s_#a?sk9hq~+Hy9syX|)j`3G7C3#zHt-#Ft4=Tk{ImA+b(M{dpX^DcJrE~S<;p$u
z1>?13vF0qkhdzf4K^}|B)&x`$RGtuQ+B#svoIB>WBfKX=w!>75K^JgDir5Q7uMvJ;
zKnXl1RZYt6NVEAQFZxtL@bM&HO-N4iR`Zg~*q+bo$2Ln31e73ZSFH0QTn}Ie6Cak<
zPg{tN!?n`XE)BCtK`-0Zn6@puoH#+RD&ccfM7-sL!;F;Xy8Dh9fd^L~9jlf
z_qccG_<^AmKGfxSbb;XK!5|@(1K0#RjpfIkki>nAWo_CPvNaFXU)5V$)-6~&2_@Y7
zGp!3YO@4A6+@}_#MMHZJg~_4odWqB5ebw}m(V_5b2g}BNxh0HAp*(w)9rqPud@s%z
ztu@Y9npUq{ewmySypo0n`RmpIM$Zf3UT1#2mxZsc$W6-+2jR-P{bK0!t1c-q{1+gM
z?RfE(HDeS}$oMUC!Kt~h*D%rkq}XX44N^UDB4z>5GBoHnENz$8UYa?cj7*|5A}Z}tYrcOu{DbK
zkSxB*rl<=2xRj_187mfC@9TiFkZu%0LMW5sx8@V>NSA`(54+TY;1LMEKZHm)gd_ry
z#v?@Zb!my22gQb%v1Ep_UI1*=N)Up;1Gz)6$RqiH6yv(g*ItEL4e2)Y#Ptl>Uh35}jAhS9*18kIITl|kCM%9rNLGQzIpk~MwH-PEHp_4f3?dD>|ub7PUHa;QNP(JCOHvd
z)1C?YIv+@k0X9aZcTv6Bl>*q9Fn^f?8IUh%JusQ_#73EKvQQ}BGjOl+#DDI|M?TwM
zTumCK+$Z~s>HY4BjRR99hLnX`fc@>2oEC7eezw2!UX>MBPxe=NiDsr@{;GsOn-;LY
zGc5^9rEX95_d&i5?HaJZhsa2Uvpw9`3i
zWEps#=@p!^_v60D5oHW*sjs0-d!j=ZP?bV~@Erovjuz@{p}yNAs4!?Yy4G$MBe*jz
zw@N|3Y&lS2TnzkCF!=x&2-())(?+vs{
z?^rJ0?H9MXHX|nre|F-mt{NhGLAMScRus0ive3S_cEhQ#_2Trwt5?-Y6ay<-Fzw#XuJ
zmNGYZ#Jq^w158L%77_jdSV^gV<{PgMtkD<3*wDGfX^P+<<%aG_<&}
zO#5zTc+JmV9_{%0{ZKw?cJsPwd-ihD^LbbcCTc0BE8#TWxt<;{)S2>W+#7M_@d{u#
z7H*1E@zDK}v>#mNjfmYEy=`f6D-(1V${KJRjq?hfZoG3#NZRSLyQ&Lls@0_z@+uYG
zr;n;N_|4QDWP}e2L&}5sAKM!aO4=Bp7>dLR5l`;pj~g-S1}&Q)x2Q|Yer+3(R(w=)
zXN020m$GVyp=`?YM-ihd_rIQi5ro8c9s}#ak42$_ptBmX?zYrQI_n{Emd-P?ghxCr
zrH5(U>3|A!0WW0>RPY%7&dYl?$`9ju985M3*sjy;ii`FC@Xu@1)lQFEm2OJ0i#=)tM5=+bF|EABTx
zJJ1*)kvaQUPLFEWJb8jQzO!~B-Z66>AacU1s8^fQ2-eSTBAiF3FcRJ}5??_tfRC
z+pZaJ(9ej7#jevE`0!XYq(W4ZiwCx0DP4-W?b5hWyWCQg7ZTD~a8Q&u)3u)!8`5-`
zCj05<`FH0tr<(G}!LIA@Oa%Lv4UdDmCyocqE~y=%+ILCJ4|JD5E@ucO?ifCLw4NwQ
zv^<>beoUQMyy4rhA8~_~P6KZ)dkZ~p=^_tWZret$yhJs3Jwb2>n>kIh8NRN*5R}&i2WMG&)%;U2HlX>=$)yg1F~lTm_^c26r%!a2q(X5*g_$
zxPIIY^N7%@)k&{4
zfN6!RQmJW=Rf;3$^ZLNUL{Z@iA|Ydp?UzOcyFeI?0%v|*pE>pYoTEpnL%pVdxR&kN
zwDwAj*vMul?y>=Y+8{_$(?AP!CVZk@Egpp^Q3?-H&^m~v}iSuOBX1UI{{rTU|PG4R5~T6D-vEO
zeFVDPnE_>}oa*1Xw8Z6fJpg8hH)5I>0GRKuP@R7kKIPJMkTt(^=}@cLZUD@VNdMCL
z&H!xvyYmIAUclKtubco+`?vf3Q~Lr5FkpZIqx%Q7e;hAhfB{hcoe2Y}?{fnDjLttU
z7hotrPk^C7xBJH@|LuzZhVTFAegU?60-MDDrt|%mP2za>t@H0qB5!0Rm>rvwoQs>g
z&bO}~vcBGmfyyEN&aQs%!@?zFVuK@t#Nrba0*$eEU!XOTpoDO8PVl>f9-%=*w~KITzc)!2^YsPb#o>3O2iMT3+8cFUj~h4nY|+#Eey`Lq=LS&`wK?5o1|YUZw=B@-Q*Ex
zQL%oRM64QqQ*kE6srP&VeL|WPM9Am_gD{FX*ticf%hqXE!}qbUBXj^CC$gm3ipO^Xu<$2I%VTza%e
zpPfru_euhN>3+@^(l_04_iMDDM5vdv@ACa%>;h>gmZ3os@}Bja*x;9To~`o?+6n3y
zo%d7OqIZDkD$NC;=z9-YhFEHU*bFm-^4kt`Z1D1r(tk@UW2apxsvPCNXsVpxdu*y2
zl6XN>^+{ymhux5ZoTl^?D@XFx7>$st{m4h2Bl|f+(%|bEY};h^5rvP<*NftWNH>#$
zk>V&F!GQFOs0aK&64R*$s#f6DPTSX
z`01D$4p6T|Y5jOF5!FIe=?ufc{MdssFmlDaY19NhDGy~})dHQb(^;TvCkb}1cn?d$
zS&*?bDV7>#>T2~FUzu&@3Gybi_v&K_+!uX_b1e=r&g=3o;NyY>CX~JW1!rM?Kp>pz
z-q{gW<|*Nkf%h3Zhsvc4NmN6ZoYlWwAI=&%RwIi_Ok#kMJ;JjmLmw_x1m-b76T<0%l9*~NyX
z0eJ&~FkpJouxWrm_%h=@i|RL9Jry$vfIwKxR5~o;GK)%Pk65MvFuj-AAC>i~p90~8
z+`>wfB$_I0dxvV{}*loqhJ7<&$a0@C;@c-
zNr#^Y!T?wSfBKve|EFCrFp>J3cK}bH{pRzG>TjMtd(UU20`z>Q>fb)}Z-)Ml+yr*V
z#35v0N&nx*-(tn#_wJaVU+15Ve?4MO+*scCyhJtMk-2EKC8yRt1_x=`jsat{!mg2q
ze>VObd%Ty$qk#`#U#LFvxci8F^$pNv4D>NGC-eO0xQjQ$-6cug-zmx~JvECnC&wt?
zI4qnb;rY0Wt*F?t{#~|7rjB=0E=OlTVUJb`87c$>JSE}?0tm3HlT#yO6A<60$3ROf
z3$x3-tJ|A18}l>FM|;s{TkCV@mnU~?4+6J0S9X_`Ul_!Br2AVOB>P~eh;`synav|U
zX5@)t<5@9wl|P1X7S6tK*=;DPrw&?+=)7~M^npRoDM=^ZnjL1HiNuJk3KS(
zgd2SJ?mNYF=$B;hY$=x9r2@VScqk3Kh}BN7D|UkhyZx000ehkzIIlwTT8zrNH-j~@
zJGgPtR>(_bP9TETVQ4b{Nz_*)WR1=HcKpC{i&La;df300q4Cp~Io-PuN;=A&$(
z&W{kJ!xoUWg(mOCGlCtE`B_+3}-yj6B&e_T8Lgf3VmFiU$lgKUh9z1jZhczV;k
z6Kh*`U<3C8l4=zK=dujT9RgZa=MmADC~no$%6)Cs9YyQQycf9gZikJ$PTnffBYB<<
znax?Xs#fGtHun3l_`Zw)&-_z8I_i`-?2n=OuwuciB$yj`G4DIlA;rmbh}47$bjiN?
zkePHvpt`N;qsps^&qoO>)f4+l>2N_t31LZOh|zA(VuTuA$89HgEKvb}yi1Fp3bw(D
z5lJ?vnajWzh
z7DthWOV=N5x}TpXdY1@^Ud0Uza@j!nNH7ierohX%V`;@@12n!~P3r~SU)%2oHgm+w
z_DxH5&>HDVG84k~tG+9n6+fChVy?NDB@*S+*Cxa;Z`Je^G<4Uk`W#?2Z0UI&q8_q0
z?$=g`b}Tn<%~EQ+Yz_Moxa_spF1JD%WPED1Y`J>!Q(`;#auqs-UAy!eF
zcUyzIx0r3rwsZFqPM&m$ZOQrIHk0VhWcZB%evJit@ko9!v!4b8%JsU~`gPraKF{&h
zv@t8k9h^BW!BKUZFT?$#b+Wn}r?Gz`!}VP7CrbvZ0L2i^Lk1%P*dsQnqr1G!o8yO*
zgs9_&sZ{Kid-&c}Exf}5p%KTmfl!OavvcKNb8Y#j)dPIX_kJ(47_c+j2~RN__4#D6z)&WbgHmo-c<{IPtW|d
z6EEp(1TXsrCf
zSf}vM0(B5c&(J(Q9}#pwAoMcO&!%@htU2XhlQ}(b#`8pZ4?l%`^4d5Ax%YBMl-arJ
zYi!-$axnv5)z75^CYMTZrHh%nj#+6dU`>#Yf1I0+N}q(IU*aY@BwRL2f8#g}Z%+4J
z#AM?p<_Dq(<%V};$wAq4MiPT+K*5^ilFeZK#MATGHlSeLNE=i^zKH7s#>RkGWyYF}
z9uE44r;%u*=3B38`+xIvW{85W9q_78JhjW=oP74Gr~1L1P$>U+)m-H5{86{3g4KfZ
zErjIng0&By>dCA6aI*!$yFPnW0?G`mrx1BSfk(&SxnUKf0p37)0;&lxC;kmokij!g
zGZb@v^AuO+_wE>JpWgMEr`lxCJk62(&C^^{guHi88!FRRMFmqOfLDFcpnM9EE7pnr
z<)Q`-tmxm6^SMs_LrwsgK&%TK)d0W*p!(dd{;`|?o0TzOH~$9U-{kxou+RMbV@3gf
z{>gNoiTXd16Z(-KaGYfSjpO`hWxQH-J;YuZd31YAUAi!Lvhhl8CHBJootveHESa2~>`d|G)O4r#M8%Rc&!S5C>M~G)e??HR
zaQ*Z7vsTN17!o3o=Vr9E2Ds+WO#6xhpe&ys22fc$HN3pNwr)Bk+0nW2JIz%;8Q5P{
zs9X3}W`F=YCRc0o^H=Z&a8t?WC4T7n^tY}v<;+^#iMcNG+)X~N+?Kl~X(C*rAz6?l8fh}%7#
zA(4jyu14BFp^j&N+m%}rDm$#Rubvc+`$mco0=(h0%8vnKw=q(T04jB_16|KwA*fod
zG}-JSi{h)Et+o2!p;^+!mq+wR9pSPD#j}1K4P$-FaOhC6mgC4DxlKD|UQuB1DknbS
zGwnjPeM`v!`Sqb?u}W{o))HRiK|t^_xN7m8|K3Duy07vp$1}UVA+f+77cTAtz3;Wo
zwJFW4pU3l7yx#uw`^-EmAA-k7FwH*0F{Cq?E6M)r*UiuR*ie@<#PDSgSc<$fGVpD1
zLaUp#n^afmH40x#xbfZOf^^2~{!}@}@2v>mkuC_MzuS%Ww$@Qi
zRlDHz8~EC_+q;1a$xebH{hrr8QbB$#>btxcb39pu$dO$HnvfG@RHT?>yg{^@XuP5l
zjayY#gbdo;@>@cTwCx1!qP?61?5fk8M2nIBoa8SB`GQ}6`A2Ri+OOkKrDV$Mk?Ff4{iuSR$Uhfh@|8@8}l_kz$gu)5-O+a{Qy}m;uY14*YBYi!1
zU^-3PH5%Q=X1JjSjpFwh3xyTE$OsmtYjBB6hINujMZK-;GezhxzYZP7pZb8>2WY>v
zoe$Etf-VLbMu%*M*@fI$A(`;I6ns
zxPc*jg<)`5^SKFjSa-u~5uDT-L>OKUpmeL-a({_?3ujDiakR~+9e!gOChNwv9h})x
zx0bKI%)Js?s;XO;qf+w7z3;VB!n0aV>6FS5^`hl|FU|~^b7!K@_2zttV^n8vd{zB4
z177hH^yZ@^L37L0LNEP?loz}{trBga3iz-k*-?)B>t7
zsvp<$(|yq&*3JA;;qfrU!Yc@0QK^D&0@YxBOhM7B4p9&-*H^!6Jd@L*g&+KkAoE))
zGs&E{IJfyn^TRlbqqtrJkwQ4~!i_HO59Q}wXZkefr1-wo0%AVMggzQsv@;&!f)3?o
zs%c~|ZG}Qq1X*H+$F|(D=a|sAP5?q~GZ|4!ID;-UhZeIZFzTg$BAwBAdTp^r5=$Ga
zDLM`SHEVCamt~GoVm+HaJ*xGfVaHg)U7{xYR|#pSIM(DYL?S54aDKS!NomzSA~}XL
zIVu?xPoykC0S;n$F9N3Y3Bg?U%nuUwC5q@5S)NSfl5znd8!20ZM!1AEqT*p_p?OW7
zG~W#TBOBPV9~Dk#$(7Xx5@Og;h?1#zY1KxBN+J`MyC}YU?MoTiRAF-!dyz26}Z@)*u8+~A67x>&wt2;BV(oeyjS1Dw!
zu{SU`QjjOGgBVU{-Q`u05G`uqKpQ9vsb={Lw)k?c$jC@)VJzU9ztsCE5O<;Ty&rq+
zH^^gVHN1(XiLJW6Y$=P_!Nu*Tm2q&%MTC^*p4v^r?C|fEF}*xT{Ss2CL$4IDGXCPY
zA~JF8I5W6>-2kkNk%XN>@mB8VZ(44L869^DHGbaTw1SY!+;az4!0>L{U}eiZL)*2%
z=x*ClLdv`-dA*?(Z#%F@%6xvBtsyzzb`l{of99cEM@zizq7tR>R~%Z$YP{{fRb>t`
zlHR~wxJ6}YDG&KFv_Zf_+{=et5$3z7M}&UYCoaq49bB|Y!E@J-&%hLwE4@Y28r!ct
zQV~;os6l6YH>iVL$&60RGL`A}yFf+8x9x60!8B&?FLkGt>c{k!PQu!sjXov6N
zZd4ieUxINGK&Je?275O6|1}SPw)}sj37})1$I^gCdA^PM6K?-^TI6rN@+V0Ddo}$m
zP5$=ye~kbC9hm|gLk$}WEj%6Ti}4(6bEFh4U8K6{nV5Q5IeGh8eiQc(c2*S$5B&2=
zi-zAngXs`6-<*68;1?KTDem5`6=G=_S*3;5#dXOAAM;wPB0IXjD%50#izNN?;>3qC
zU}X%9rn7t6dsZFAX6F|{(=xldotq0m4Ks7cOC2j~w$V3N?~b4bg};NtKfO31;A|Im
znOS=h(&Fpc)6be~DIr2LjkSHAX_fKvQ#sLkCfUUID7QfpMC{X5zi-h<$hb=V6XZiO
zj4)kUb7=e{kIA&ozaZNtb7V6F>JjPpFDNFnB@zi-_t~hwC%p5}Yo@8%pH7Tpak_ku
zs4MaQ)u6>Db_6|Ex#`Th1W)K8W2I3j{XpXGW|sj%2oq<<(>{y^0eA}=n}4k{k$_Z-
zF0EW26h)7hG8A`eIFwwgF>;8*V$z!`7mD26z+yUCq}c0obpLv1KDM(0uMrAgtHqVP
zr#a564x`TgOK){x4ZW3hS4yL|C(An>Py#U}LEe33&h*?TUzlL^&ldEnQh_RRZZ2Qu
zzfCS!j{f?2f7h^Pa)>mMez|>y$RW)QQIbjcX<({bTL|)(iYOoQ^qLCSBVzfIH?|?L
zl=i}}dt9V5;Ce}HgMn!r?vG7wJi3N!wr
z>D0Rf^OC(B1zVr^m58S1ya7m*%RNfSS52%2k)jRc9jR_u9h4dF1Y*YNws1RouRG`h
zjMHqmjQ3LGSU9Ncd`X-lI!BV{7WkgRZU(fY2tjaA9sc|T(ltmjf&w0c&>j~&2d06E%RHgOF7t!D-EYK>u_bQA$
zlzd!qxz0|vY&>0;TxmS-d7#z%!bd_~6QwNK-U|A5T2vDb%_%2J20|>$fPbX5s&rtu
zF|Nsx9%F)fp-7wDjbwC4-Tql4vaExaY6#q$GXj`A)0NT{rQ}N!wGW{=5Ha_m9T%R7
zgZc1RLI(UYzsQn?65+~g9y#L9Fi6bv=x3cX)Xx&Rw~Rh8nfo`YDVYNjZN!B
zh{x)8UNbL^3H?ntgLY{_JpFEq73s`cusWIR?{>i+H`p5J9tP67>dX!pzNz^X$$A5G
zx0o8-U`EN2ZwU+)q6teJ<8`2%FV(xH<$+I^6#NuqI(KDvsSnd@JnJCwan`)Wr8CfovEDBBqwup8t>K9J
z2@wDb?EZ4SW5d2isv|kSK9UTycOCU~^a^nP&K=uj8Qw$D<+7S;8u%Msb+5S}M?zwq
zGq)CJbELb<;I|snerCZUUc6`46XqF5M(t-@{lFBUg`p4IdD~5rGVpbfhZGIA57PNH
z^qb~@)f+y?68RW3&lj7~#qa7F?7c%jpWAMdHj4KNE}wo=1G?%p-)>P$)PSle1@f;r
zO}Z56UpFQ;Q90`&hmAC>%}**RueBVSi4H|c4i+{}`5ir#N+y3Gh;(9gs3hi**>hLQ5o?rm
zZ_L%%aRSpDCwWpp&zlCRK%S#(LZZPgmnqpKSWi;%^8iM^)KPv|+C4t?-YC3Je}OU*
z>-Q%k|8xx1yI)V&7#xzrt4gSD2;6d^yY
z!GL3^!RZd0=-vQiWH<`V=>?uiXE&fN9Ol2RC~trF_aO)
z^vncs3^iPAiD0y_bbnpjhIB08M7y|-dsEl<3er0AS*D1qGT7w;$56}44bx9D<>14i
z%?)r2b)wt+bPRRVWI@i)==3C03Y40UgG*g@5cqE=o@B~N$YJ5iFF>Y%tuea8NUwq`
z3bsK5GUcT-6eKGkQ{2kFkxH+@KXUv_>d*6UMFEge07(T9Spe$bPe~5wg6B-*k2Lsu
zBK~hh!5{bo+5g|7;6EKs{heC;U9AH$;ki!#zat7(_n$AMtI(bg|VBpG6uSj%?eAx(u*;L%97BdQt5lf6MeHEq?iTVk
zS31ZEVc-XEg30(}DC7pX^2l+q6UA~AfVX`*P#7>J&9v$T>GB+5{i{7s+k51_$1JvLR@eJg2|8*r`Esy|bda$Gp*Y|J8oIXa%{?
zzUl!DKdWQ06(6hm8LhyJ;H9^sSt5*z}a
zKNd$TGeDh_qbymNPgo?Mk6<*tiIg3`t!4&MJAE&apOkP)0D95#q2KlW7|a?5oFtxi
zwVZ09#aDRtMtV=go}1#@#Q-?kEX{q=$34BLrFmoCYWneK?h$AHCa^a3p0Ly&+&!<|
zn0lI4rs@Xu>x2z%LU@oa!4Edq%LHR}M
z+!MLP#X9qmPm;7~JB+D)m?PHxVubTxh;=k2h35p3C-UAF92io*mLYyUPcBFb5zCI)#9}^>wW}kugyPe`znw5wi
zwvaA1cD80_PNv&c{8CD@(fcHluEzxhb?Z!l+mg2YB7skJv+$Ff`Uy|5!&x8R@h&Wy
zI7slmj
zPeNX0a7MlITjLjR*DAn!^P&ZVxrhabKH?0#Gh_+%C^TTBi$^w~%7$(^w)!bJ8K>ck
zMuf}k;Oe1{Lc89%^E1zjnuplz9d>b;M7vL+BRFERd?%I@s&9XL#$XS&NZgd1A+k%_
zSi?03$;g^tyIDp;R9%m6-J>=9XRc(7o4#JYoU@>j1)^6lR9)-?<#^weNYNm28aMI>
z#OLcWG0y$i->RKQ#3gMKFF-VsS6I763~9bPgnG@-U>VgVrH9Yq7%cC!XEGD0O#vLk
zoYT`dD;lS&_w>FgwUUcHv0b5G%Z<^-RaDrGB>zcJUshXO5kg54O3MQ$rkJ3h6VheK
zTE8YK1Pd2X9B+W>*hgsd`7n?b5(QJ+Pd0ih%vaMq9(zTC!InWWQNuQe1a7ZTPNp(p
z8iRoKB7fAQ!ZaLZF9)6K1LPpnpx1g*2fT=($a2*M`+iqC7Gp!FlapQ01>&AtiN72h
z*7vk*IuoKZDj8*XrR>LCVFUKPF}8?_s3_)W6`J_neq+A!2vk56RM!SQRU&|Sm-z$_
z$|QN8pYc6kF_NY)L>jX|08nPLSoxk@<$h`-;99)5TaUyed;0
z@_;C)N*9kNJC{$`b|4kLwnuTkGSdpYRk_O#U=o)!)^ojAsh_?o5Az<^#L4IB!itIX
zct!i+t)ptQp>n0$u5c*Upv(k!c{zV@5ik*-99)0R8}o`z+u&j9&1GawHIjw7ZlnLW
z+h>xV08#LiT_}6GqYte@^W3&0YB70H
z0-``Yt_iKB%!gxW4Oxi2lK{EgPfS{`#-px_O13;eWhfZE^0u2Pq&%o`bscBnwuiID
zI@lJ-E*@q(d483L`ATm>z|;fTMMcDd^CmfvUC7s1MrH!pg?fF7VhfO63>i=x)<0zz
zamklk%s_Twyet|&31kC^t2)ak<2(!2YOAK*^T6Z_h!x~DOy={?V{
zWU??*g)P)NKJ+zr&Sn#dh29p(7ddPbG?I}@oS*lWCl3VpE`l-TLFv~Rph6=$Sao@f
zJVK)pZZFPV>|mxsxFc`DLFKORCQHC7$2)(1lnwrFr2uP1MjSkj>n
zEUM$3oDV;|D2qOzL81kd(;QuI-=arL!pA{aHHZO&Wju-}KATn!6B;l5`u1X{=qJaD
z<>(oujtwO8EuJMbckZeyjNlhI$GtgQf;V9%D(KF{j+BoRc3%FzXnvRpFQL52kriQl
zDJ|mGWZ>ZxR)W~3jW>el(ql6iUmAKPYvD>>uWQpmPm}VmZ5fk=aUUC}+mmsBc!f@2
zc|&Tzyq>oeqw$#wF5q1ZF}9B@wf{(%IIiBdZxA^}hMz9q9b}yKfP_K#D@^~$57v%*
z((0QYYTCWD)ryG<{87qYrTt#PkqIDZ*hw0H!XYfn-2E(9vsZqWn^)fAh
zB#tu`j51}Z&vJHH@78pdcQZBp_-PxqATKk=LpVRzhe;@}ux9Rn04K?m27ymYq7d`s
z{bw9cH@-=2cjS32VmCQ{aqZgb*KtJE%OSM7^$2_!Z=QE|3(cw0%~0ysqp
zyZiA5$Fzl&0@F_X-t{b8jIW^LtQm3c2$Fg_48!IaWdvn;U${(4s#>tlaE~*i9|(Gs
z^sQN%+0^Vb3|bB#ru--wr0smzdQBs*!wQ8Qcqa?TvCt$qN`=j1)~4d1nyvYICFmf5
z6g;eInCoSY?I>-VmF<{}=ufLr_Ew^^F?wZ0c5vo_y0basTulqGSNyBj3u-V{p|GF6
zowhGnrk>gtVgw>F@rXT4vM;019WhFN+jw=eX!(t(bLrE;lKG;6zWWyh!O-NLFax~w
z>m6WxpwJQZLjM#SCAPe_#kDhFyVd3G0puPzmh_|Z-p4pEZrs9+6sJyJg5!+twL=M$
z+ziDI6AUGsg0Aeoui;pik%kR1S5uz`a&x{jp8UKXm-Fm^+>CwOzRnfF?*_jeE7!T%
zUaMtSgZD*%L8>WNU?FKI+-PptPH(kQYL15o`&+O
z_|Y$mZ@IBr7{WXe6$l=7%S1X}eBm_TJnb>$JtRg$bC(pnxQ~L|w^(zI35ZfJ$!*>+
z_!M25ltR-8)1n%2j`ImIkzi|AoEuFdGEo|hklK++3Ku2Ppz0%L#(S?@@v}&WTkJ;t
z=@M{M1q4a`IkaL4K`I+t4e}q`=#BmY8m7Z>y)N(IW9I~QC?W`Vy_U+y_QWQ~{k0;4
zvKX1M1cC)v6HwoEpbUN(Hqs+bTwBZ$L>L(JVo^#&OIqi;!0Bym-Vd?yBI6JoLqc2Uz!pg
zlrEZ{!cAKsW#6FaLQ|c?Vewv$CwHPi?X~0M#(L=K7<}k;VBx*9QIaQDPp`{sL`@m+
z_oYB)p-fwdZGox!UTM6zr`<}8K1bDGQo26FUQ&EENC&17rxy1=GK*OT)a&wK-~pGX
z3pU^Z7fA;?U1lJ&czVDEi9VK%M|B*yV9R^TECQh>;A4J2-~v=D-3>Km)j($P1$e;4
z@~{T{^yvW?AhW1NYkTt_@&vZuwd8kZk-+|(Sqy5{=>2}c#i2K`<|N_ug51CwcN&rS%t5nLCWR1_=GBT4DNM+`V}`)cwEz|DG{pW~^i1
z8M23vLfbVnc1lAOmF%J}#qag;q*sDItlXXr;~f+2!-S&UKyh
zJHK;o-+#Wh^ZmbWacf@ly5C>V<#Cv|`cFK`@cb8^*msFSX+?hvh3EM1(n?+CmetbA
zAV~-CWK#4G>P;=clY`r_;*b8slYh>^`~5qLSRD!YtNyY2)B#WdwD9-W4ybmlK6L;y
zK;z?IoB(L?ccNJBYXp`6W`I#T*zfk&dj{|Wpu*}K#$VI{m;sF&usII4#liBp)xD{(
zLhirw$PwvpuJ@lbX#8U&;Khq#>D4vGzuE>61Boze2nDi_qQ{<&cZrL!OSVi&h;~i$
z$=a0@VUb5G%(p3Iok~3CR2o@!%%teg{q+7^c!Yp!XliM*ztVcGneB4@s@+Y?-p&r!
z+dlVq_18S44L0<3-7{y=%?=G2j2&NK3b{5
z8k^0J2}GiEG;iGGGGd^Glg)$0DFr6VeDdgQ2iBxuvdDVVMg?016GTbBPy^=RZ=`7M
zx^>E@zQpRR`jOlFEsiT&ubkHo`6Vnl=yipWPn0yy_8;03y~O&KHWKY4dchp4d2jD!
z(jMjAH^RfS&OVM@!lnHAbJOiP4zGdTe7s+aUD#II&cnXT4aZ)wBsPD|ZwvX{_2`n>!Q`Lqf$yIk|62BO
z^NEJV>YzmXGuWW5pI-WWK6&0u|HzBQkOq}Iho3Pg|D1zgnvXRL+4;N$8)2gsoXa3o
zd2vS8`eI$9Q8v9dkNzT1R>osVG2Q~2jp`so$L=
zO)pMLX|+D>GWTm8<3v^tS5GGAz5C;OW1?2`TestP+Fq31xpZy#+Qow%Bgs;=SI{35
zFF_l{TLm`91FMA_%c2?qZ5PK9-(UOm#?Yrre3$QGfz~(ehYi`1-w!t^spNgAy|2!#
zdFMfqMnm1TjAPgQkwivOFKiC}RkM5@kIW-P!`{>(sz4_VP$9~h5q>m4EPCxoI
zV6*n>W#QwH#>QsRzBY}vh~ruhUX4O{a};KNk#w{iJ3savY3peSo2hR36}I5)@~aKoRDOE%
z%!74jPR+USKKSvzLrgCGOSAifpKk|M)^{w9gzY@_wORjt*fML)^?OUB=lQqkEadI{
z^}065JL=VsukTl4WR5#@lQxo3Zfj5V%{=X{k7m1L;7b`M?+C${I8EInI8U)SD?K|S
zU5F7kV^Ns-J~9N`1d8HeSFea<&5R@XcAZF;Gy|Wj59bEy$l1
zrKs+t7H74o*fhr5SlO#GHblDa|nHzL7-OB@sM+V(Ubq2lIp#ZM%sY!9}95eF$
zloZ!zYec@8_n2(^km~!RgrwthONl0!nlU#ewyn=XpHpQ|*F$-gLtmmiuini_ca+dL
zAFYv%waToy>#1q~uNKCRE;SR;?VD;J5*h*>HFHcFa!k0h?sO!rKQesxkYIWC(A~0)
zEfNYp1O9E??pRme^zR)pzgrmB=krqB&W%aLNN+>MnVeD(STkvSirRPH*7o5xj@+H}
z-SLIL$L$&n@7ifKa!)X*y8E8!dO!S7e7n2CWc_U$pRJ#M&%vLQd}EWM_1aVLv6AW9
zP50vc(l4w$DV<(l?(^{X9Q-s{#X4wp4*u$(8Z8!9r@|b3J#n>#5!S%%b4j!-zb#tv
z=Nh;-ww1T-SJ%KDrnXty*X90R1GiSYZu7MLV{?{5f-qkg5&({8H4IIk<&oywZM}Aj6oE}@Y{BF?js=4_5U1(O_v!9oP
zdoDg*_)QV^go`1M;F06dhJTbWVo|$Mt8WA_uK)|p;a*?8#sB^&0Q+Cz3WuBf?{5ci
z-vbK3H4iTZaL2<1|Nk?i@YntQSE~Y;|LQZr-=72i{)q7R{>=aKcE_wh`Wi>;zmF;K
zv;T2F@Xry3qx%4*Pf1JUG%M
z;ly{N@7U(84{lxmdM1Ez`gp*pAiwA+*KM3)I58|FNgT+`l1NQQIMbL&Zbkw6d{KU3
zN?LJ6S;>W}^3uw@%MBOnS+z}Vjn(ZH9W6CiTQ9MT&Yo+|zI))-&FlT$H@Z6O2A|%!
z@?dCmfcb2s=f&hhjD^s>-}y9KXGGuq&f3@&bM*5!EqX$H=+ckJv&TLA|A^oaP@2Q}
z3s57uIW}&oVD8jKhQ9dNO&r|n!@{P0mE#8&3B}(Yu*09mdvtmxy
z*L@29Sn{B7^R8wdu61K~OU%>mt5*G9ov!fCFe*WKQa%P>V|hxI_SUXit3Q@L&is+SQh4oZ
z81nY&_49vhJ00|Gsc95<{`#3uKhaAnBqx7_&zt0hm8~GT5VU4kyLh*$PS^-odrxc=
zIZ#owMZ#~UXp_uo)nX<2WN!swrNW9yLAA>>lSIvHsxSF;?t7Q;ZWya5;Wm6Xuc&6y!NE?{_vxbj%Vz#o3|g2-+KCddibh^(7-!Lr_rAi
z`~Zng%i21zF5gzv#OG(lR8wGfjay0DyOu3T#FZu
z)s${anYw*;b9udodb);dmC((BHNANxkK7~+K_`uZ*^~F@hU+VASZik;*?o&1#z8c?Mdf#>{VU1ASq1Jj&X8O9rW76-d_MSx#M`&4n^W7U=W)db-
z5L)!6_M-cZ})GeBOHE>cI8bGuy6hTK>?LpXbb9urFwlF&t31Q*PJvx22C#Z8g`guSjQLR9%=#
zDcnl^~)r~>o(w>a}nGrXc@VT;g
zGBLLe|1v3{&-~?C+u^vE$p*8%FVC$k;is5i@D@`k@5JM$QV*#1O{MKLAiPR9Gq-q^
zL3N0KmAT2g?^V`@V8V2^W}L-z4qrsvbneFNzUe&SD#Gi0b>n{=vbO*1kp2ByTD_Ct
zoc;YOf};ChO8OtKC2-2W+vloW{@p(RezyM0Tj#Hwy#MS^{-0b+M|H%1A2Kw#$*t+x
zn+=BYI1)4$%#
zN1E+?as9+kK~cLIEf$4=OVhy^2)I&WMdocfhSv5)sgmCQj`a8tX+}`mRkgn_wa0>6
z6LQlm9!>S-8XSM4v3#uj!Fh80CmzejiUCohbLvjF(~TaM(Mklv)tjp9VV%&S=a#x_
zpVqkK1oK)oSKrC=dQ^2!=j5f)7Uhwy0;{KLBM_T*hGr^Z0rd8ZoC%>#hYKz_-zk&`Yo5<&
zI{H>NC#>bD>eVyc&Vr{}7v&+*<3-A;w&POw&RYcugtvbdeq3`{B_;gISN`YM6Q2@K
zcPxLHzFD&KefZX}`F~y#k5Jr*E)w}4)0zJM`GxBNKEi+h?4ae6&Q;!-=P$``97$|=
zQv97iX3{)!I=<}9?dEsfJpum+Fq}iBLx5rIU;oBxIwsH?b^vIN#MsScs*B2$Ash};
zULlJW@_Qj`6EZI$m<=;d7y>&XBo^Z0Ac+%_75)E=1@%1W
z2bqnK1J`}CJ2EOBk~N=?ja8}~yCB>4
zfU_4w4R6_M26?lPz6g235VQ!f)opF95X=rK!H@tvGV&Def1qc`o;@BI>fw3B&i(*o
ze}DK?SBmF@yO81uaqh8kNf2&WS6`o!n!$*Og=lvOI)tc0$ee=+&!C$Xten^2TTfXr0i@N|81)KSfZjfb9&&%W%nDz002q(AnQr)X{9{_Y#_xTwJD
zbCL)=`u=+4o)zf>jlmy2&TJ5!x$|osf(>6MxBn>XmK498p5n8I3Pu=VwIOXA&biM`
z2=Yl@kB(sOs*uCKzG`tkvw}5Gk-?jfGsHa=jn9WK&5ET2a;3yE-L>QV(TMR6GAf8D
z;J`v~=voA4i(kAYzRz5dA;%NX=AiOGPokoac^!_r(4-`X=!!b5L4GVlevvS$hn6-j
zV>~8{rqgK|B5s5f{1Me|0`Jz5xRT^83hCn%Zb6p=p~2IHfzb8sJ+IQ<3Jg%3|4{5U
zVytbHc$uW9u*8bs*`%sk6r)fzV-zbg6q%3SKC-)L%`MS_ibpA)x418mUD%23TZNUv
zaFQ!J#*8G&6vD+VqCn?2)@AN>4wC*o1y9%DoUw;FXeeLd+j_3S}sV`50;9
zSs1E09_gXfKfGOuVvbnk7S5DiWO9g~;;cY~PU&><3EKI~Ali8CNP?+0DhiROO+Cv;
z24`J5K0XOMdhPp{??>4JA&chX8^v8C#03UhWxGC&NFBPQl$y(Q=jgX#C=`o#9tzqy
zO(IM2Co!WiXaj)(Ika9;zdU+~qPRwVrj{(UryFs|!WqWaAiUqX7wX^LHW5=6nPL)e
z{Mf|1XcUYOu<$@*xarW$vAyA5a;lX|WKh-H=E6DC`65J^6cyc!yXw+;6&JrUNoP9=
z`qq}5kPH0UjLh8AM^3v7M;zi4`WZr2
zMN8|n_26_!WQAjgV;Uz38@8on@_NP)F&>{t2oKjpGV>kLGvX~Jg3L{xDUq|=RO6F1
z`ek(JyjakC1~>uU?(%b1$qPB*Yi|*zGizBeK!-yvvN1mLtVe>j!cvX;t+-SBgg}jZbM2Ond-*g9OvRfL>6vfZXt*|T=&)c
zU;A{a#>&cEhDs(@zG*vpD#^{wn5R=!(7^l%_M$?WaVAkn2E!l(kc>&Hlqi~Zj6g`I
zq@@l@q8dSnL&7O~i$qanULQ+gy6CQ6jXDx9TA(*l62p#;!W?5BwA{}1%>@&+(0^8Y
zi@C6U5K%q%t-2A<1)qf0u6K+8e<@n0X*&ryq~yVep6RC`HYSR)iWw5}H!$Y35>-4q
zaW~J)@3*|kH<*h_PPQ+N9A+(}$o%q3+D^u*42B}!4I@B~5)ON8R*ItMaCcrXHFk{`
z2#wb&8^Xx((TM?WrC0&B5$RJf^PsF^?qK8v4HS9h&Q2|pbmShl8t?MuUAqUeb_bKC
zMd^_WGyV$Ck1^u-iczRq7T+;E1AAzAo2x9Qh|9CWNQoM=P0@vW<9LMd>E)Zp1pMW_
zFo>i;@%C8bz+!hHHka@SA~LvL!Dg_l>;Mak3Pwel>b|DPab3MA>c0DA8LqO@2z|KPzpgx+C*{VT
zOoVQYALrpx6Eu3am(`&h)o>)o)|va$)0ma<>U9cknqm7B?W20hh$gw$cX-#nnBcBr
zu0>yGvguH>xM&1PP+Paomb=z9(f;DbDv`|!jYrDh&UGwQ7R97Tc4S?VKdEgj8x$Q)a_Hna
zsU7`;!3AH}w_W;vjN0K}BAMSJ~=ZdYR#fxv)kfpLLl0bT)50Y(920a5{ZfpLL%
z0Zf4lA>$pu7f2VvhXMNlZGjE}-2ig|i-7`x-hi$EOo4y_Re>ylcOm$FHS8T=3rH8R
z7^oMB88ElE_ZA>5kS^dOU^K*S1GxhKS#7rmRt6vj@CDcfgarTv?gYeyOl`nHfM!S@
z2IK|I1@Z&710)5`23!U*UIk*{U_f3V(bTj|fNub6AZfs8fM7sLz;__&RWb(r1@;A^
z1#$%>1@r{e21Ew%g#cvWRLIH(n!I=K4$w0|Fn~L7HsCdoGypU}GZ6DCE&~t)q~3pU
z7f2Ks6o?a&jR9T((*da;^xuQ%Z6ICXT0l}jPoQClrUz~YPzHWAHr)w`3>**i48#mT
z3=j;k3z!QO3)t%Bei%~A6B5r=R8(H+XiqxJ1d#yQ0GbDW1ttdM1;!0L5w>r?yNjzg
zWRydka@EC)fVP0S0MmX)Pn@AgolQ;w