diff --git a/.claude/skills/claude-mem-mastery/SKILL.md b/.claude/skills/claude-mem-mastery/SKILL.md new file mode 100644 index 0000000..363fed6 --- /dev/null +++ b/.claude/skills/claude-mem-mastery/SKILL.md @@ -0,0 +1,173 @@ +--- +name: claude-mem-coded-assistant +description: > + Entry-point skill for using claude-mem to keep CLAUDE.md and MEMORY.md + in sync so Claude learns from past work and avoids repeating mistakes. +version: 1.1.0 +--- + +# Claude‑Mem Coding Skill + +## What This Skill Does + +This skill teaches Claude how to: + +- Mine **claude-mem** (via MCP) for high‑signal past work. +- Maintain a concise, high‑impact **CLAUDE.md** (~1,500 tokens). +- Maintain a curated **MEMORY.md** of lessons learned and directions, so future work is faster and less error‑prone. + +It is an **entry point**, not a full manual. Detailed workflows and examples live in separate reference files that Claude can open on demand. + +--- + +## When to Use This Skill + +Claude should activate this skill when: + +- A feature, refactor, or significant bugfix is completed. +- An infra/deployment change introduces new operational lessons. +- Starting work on an area with substantial history in claude-mem. +- Performing a daily “memory maintenance” pass on an active repo. + +--- + +## Inputs and Outputs + +### Inputs + +Claude relies on: + +- **Files** (in repo root): + - `CLAUDE.md` – main project instructions. + - `MEMORY.md` – curated lessons and directions. +- **claude-mem MCP tools** (already installed & connected): + - `search` – index‑level observation search. + - `timeline` – temporal context around observations. + - `get_observations` – full structured details. + +### Outputs + +This skill produces: + +- **Patch‑style edits** to: + - `MEMORY.md` – new or updated lessons, patterns, and playbooks. + - `CLAUDE.md` – refreshed rules while staying under ~1,500 tokens. +- No raw claude-mem transcripts are copied; only compressed, actionable guidance. + +--- + +## How Claude Should Behave + +### 1. Mine claude-mem → Update MEMORY.md + +High‑level behavior (details in `claude-mem-usage.md`): + +- Use **progressive disclosure** against claude-mem: + 1. `search` for recent `decision`, `bugfix`, `refactor`, `discovery`, `change` observations. + 2. `timeline` around promising IDs to see context. + 3. `get_observations` for a small set of high‑value IDs. +- From those, update `MEMORY.md` with: + - Architectural decisions and their impact. + - Implementation patterns and anti‑patterns. + - Debugging playbooks and DevOps lessons. + +**Constraints** + +- Prefer short bullets over long prose. +- Record *why* decisions were made and how to act next time. +- Never store secrets or credentials in `MEMORY.md`. + +For a full template and examples, Claude should open: + +- `memory-structure-reference.md` +- `claude-mem-usage.md` + +--- + +### 2. Distill MEMORY.md → Refresh CLAUDE.md (≈1,500 tokens) + +High‑level behavior: + +- Read the existing `CLAUDE.md` and approximate its size; keep the body around **1–1.5k tokens** for optimal behavior. +- Pull only **current, high‑impact** content from `MEMORY.md`: + - Still‑valid architectural directions. + - Frequently reused patterns and gotchas. + - Operational guardrails that materially affect daily work. +- Rewrite historical notes as **timeless rules**, e.g.: + - “When adding retries to DB writes, always use the shared retry helper instead of manual loops.” + +- Use links instead of inlining: + - `.clauderules/code-style.md` for style. + - `.clauderules/testing.md` for testing. + - `MEMORY.md` sections for deeper background. + +**Token Discipline** + +- If CLAUDE.md is too long: + - Merge overlapping bullets. + - Drop generic advice that doesn’t change behavior. + - Replace detailed explanations with references to supporting docs. + +**Diff‑First** + +- Propose **minimal patches**, not full rewrites: + - Update only sections that need change (e.g., “Architectural Directions”, “Patterns & Gotchas”). + - Preserve stable layout and headings. +- Always leave final acceptance to human review in Git/CI. + +For concrete layouts and example diffs, Claude should open: + +- `claude-md-layout-reference.md` +- `example-diffs.md` + +--- + +## Safety and Priority Rules + +Claude must: + +- **Always**: + - Query claude-mem before re‑solving problems already encountered in this project. + - Update `MEMORY.md` after meaningful work with concise, actionable lessons. + - Keep `CLAUDE.md` focused on rules that change how work is done, not on general LLM tips. + +- **Never**: + - Overwrite `CLAUDE.md` or `MEMORY.md` entirely; always propose small diffs. + - Paste raw claude-mem observations verbatim into either file. + - Store secrets, API keys, or sensitive infra details in these files. + +- **Conflict resolution priority**: + 1. Explicit instructions in `CLAUDE.md`. + 2. Latest curated guidance in `MEMORY.md`. + 3. Raw claude-mem observations and session summaries. + 4. Ad‑hoc reasoning in the current session. + +--- + +## Quick “How to Call Me” + +Users can invoke this skill with prompts like: + +> “Use the claude-mem coding skill to: +> 1) mine claude-mem for recent work, +> 2) update MEMORY.md with lessons, and +> 3) refresh CLAUDE.md under the ~1,500‑token budget.” + +Claude should then: + +1. Run the claude-mem `search → timeline → get_observations` flow. +2. Draft a patch for `MEMORY.md` with new lessons. +3. Draft a patch for `CLAUDE.md` derived from `MEMORY.md`. +4. Present both patches clearly for human review and commit. + +--- + +## External References + +To keep this SKILL.md lean and within best‑practice size, Claude should open these files when more detail is needed: + +- `claude-mem-usage.md` – detailed claude-mem MCP workflows, filters, and example queries. +- `memory-structure-reference.md` – full MEMORY.md templates and longer examples. +- `claude-md-layout-reference.md` – canonical CLAUDE.md section layouts and size guidance. +- `example-diffs.md` – sample before/after patches for CLAUDE.md and MEMORY.md. + diff --git a/.claude/skills/claude-mem-mastery/claude-md-layout-reference.md b/.claude/skills/claude-mem-mastery/claude-md-layout-reference.md new file mode 100644 index 0000000..ccb3384 --- /dev/null +++ b/.claude/skills/claude-mem-mastery/claude-md-layout-reference.md @@ -0,0 +1,323 @@ +# claude-md-layout-reference.md + +Guidance for how Claude should structure and maintain `CLAUDE.md` for this project so it stays small, sharp, and aligned with Anthropic’s best practices. + +This file supports the `claude-mem-coded-assistant` SKILL and works together with `MEMORY.md` and claude-mem. + +--- + +## 1. Purpose and Size Budget + +### 1.1 Role of CLAUDE.md + +`CLAUDE.md` is the **primary control file** for how Claude should work in this project. + +It should: + +- Give Claude a compact mental model of: + - What this project is. + - How to edit, test, and run it. + - Key conventions and gotchas. +- Act as the **top of the memory stack**: + - Repo‑local instructions override global ones. + - `MEMORY.md` and claude-mem feed into `CLAUDE.md`, not the other way around. + +### 1.2 Recommended Size + +Based on current guidance and field experience: + +- Hard upper bound: **~5,000 words** (beyond this, latency and quality degrade). +- Practical sweet spot for this project: + - **1–3k words** (~1–1.5k tokens) for `CLAUDE.md`. + - Enough for: + - Project overview. + - How to work in this repo. + - Current architecture rules. + - Patterns & gotchas. + - DevOps guardrails. + - Pointers to deeper docs. + +Rule of thumb: + +> “If a line in CLAUDE.md doesn’t materially change Claude’s behavior, it probably doesn’t belong here.” + +--- + +## 2. Standard Section Layout + +Claude should maintain `CLAUDE.md` using this section scaffold (with project-specific content): + +```markdown +# Project Instructions for Claude + +## 1. Project Overview + +## 2. How to Work in This Repo + +## 3. Current Architectural Directions + +## 4. Patterns & Gotchas + +## 5. DevOps & Safety Guardrails + +## 6. Using claude-mem & MEMORY.md +``` + +Optionally, for large teams or specialized workflows, additional sections like “Agent Roles” or “Subprojects/Paths” can be added, but only if they significantly affect behavior.[web:97] + +Below is what each section should contain. + +--- + +## 3. Section-by-Section Guidance + +### 3.1 Project Overview + +Purpose: + +- Give Claude a quick mental model of the project’s **intent, stack, and constraints**. + +Recommended content: + +```markdown +## 1. Project Overview + +- What this project is (1–2 bullets). +- Core tech stack (frontend, backend, data stores, infra). +- Key business or technical constraints (e.g., latency, throughput, compliance). +``` + +Example: + +```markdown +## 1. Project Overview + +- This is a modular, liquid‑cooled Bitcoin mining orchestration system with a REST + gRPC control plane. +- Backend: Go + PostgreSQL, infra via Terraform + Kubernetes on green‑energy sites. +- Hard constraints: no mainnet RPC calls from test environments, minimize downtime for active miners. +``` + + +### 3.2 How to Work in This Repo + +Purpose: + +- Define **day‑to‑day workflow expectations** (style, testing, commands). + +Recommended content: + +```markdown +## 2. How to Work in This Repo + +- Code style: pointers to `.claude/rules` or existing style docs. +- Testing: commands and expectations. +- Branching & PR workflow: brief. +- Any critical local setup (if not covered elsewhere). +``` + +Example: + +```markdown +## 2. How to Work in This Repo + +- Code style: + - Follow `.clauderules/code-style.md` for formatting and naming. + - Keep functions small and pure where practical. +- Testing: + - Run `npm test` for unit tests and `npm run test:integration` before proposing large changes. + - Do not skip tests unless user explicitly requests it. +- Git & PRs: + - Target feature branches, never commit directly to `main`. + - Keep PRs focused on a single concern. +``` + +**Important** + +- Use **links/pointers**, not full guides: + - E.g. `See .clauderules/testing.md for details` instead of duplicating test matrix. + + +### 3.3 Current Architectural Directions + +Purpose: + +- Expose **current, high‑impact architectural rules** derived from `MEMORY.md` and actual decisions.[web:115] + +Recommended content: + +```markdown +## 3. Current Architectural Directions + +- 3–7 bullets capturing current major decisions. +- Each bullet should be a forward-looking rule, not a history lesson. +- Reference source docs or MEMORY.md sections when needed. +``` + +Example: + +```markdown +## 3. Current Architectural Directions + +- All mining control operations should flow through the `ControlPlaneService` API; do not talk to miners directly from UI code. +- Use event-driven updates for miner state; polling is allowed only in diagnostics tools. +- Persist telemetry into `metrics_*` tables, not transactional tables, to keep OLTP loads stable. +- When adding new services, expose gRPC first and layer REST on top for external clients. +``` + + +### 3.4 Patterns & Gotchas + +Purpose: + +- Highlight **frequent patterns and traps** so Claude doesn’t repeat mistakes. + +Recommended content: + +```markdown +## 4. Patterns & Gotchas + +- Do / Avoid bullets for recurring implementation patterns. +- Short, specific, and tied to modules or file paths. +- Derived from MEMORY.md’s “Patterns & Anti‑Patterns” and “Debugging Playbooks”. +``` + +Example: + +```markdown +## 4. Patterns & Gotchas + +- Do: + - Use the shared `withRetry` helper for any outbound network calls. + - Capture miner IDs as UUIDs, not integers, throughout the codebase. +- Avoid: + - Writing raw SQL in handlers; always go through the repository interfaces. + - Hardcoding RPC endpoints; use configuration with clear environment separation. +- Debugging: + - If you see ECONNRESET on DB connections, check MEMORY.md → "Intermittent DB connection resets" for the playbook. +``` + + +### 3.5 DevOps & Safety Guardrails + +Purpose: + +- Make sure Claude doesn’t break prod and understands key operational constraints. + +Recommended content: + +```markdown +## 5. DevOps & Safety Guardrails + +- Critical deploy, rollback, and environment rules. +- Things Claude must never do without explicit approval. +- Pointers to runbooks or infra docs. +``` + +Example: + +```markdown +## 5. DevOps & Safety Guardrails + +- Environments: + - Local and staging are safe for schema changes; production changes require human approval. +- NEVER: + - Run destructive DB operations (`DROP`, `TRUNCATE`, bulk `DELETE`) in production without explicit user confirmation. + - Modify Terraform or Kubernetes manifests for production without a plan and review. +- Deploys: + - Use canary rollout for new miner firmware; see RUNBOOK-deploy-miners.md for commands and checks. +``` + + +### 3.6 Using claude-mem & MEMORY.md + +Purpose: + +- Teach Claude **how to use memory**, not just what the project is.[web:94] + +Recommended content: + +```markdown +## 6. Using claude-mem & MEMORY.md + +- Remind Claude to query claude-mem before re-solving past problems. +- Point to MEMORY.md as the first place to look for lessons. +- Briefly summarize the search → timeline → get_observations pattern. +``` + +Example: + +```markdown +## 6. Using claude-mem & MEMORY.md + +- Before debugging or redesigning a feature, search claude-mem for past decisions, bugfixes, and discoveries about that area. +- Use MEMORY.md as the curated index of lessons: + - Start with sections 1 (Architectural Decisions) and 2 (Patterns & Anti‑Patterns). +- When you learn something new: + - Update MEMORY.md with concise bullets, then refresh this file’s sections 3–5 if behavior needs to change. +``` + + +--- + +## 4. Progressive Disclosure & External Docs + +To keep `CLAUDE.md` lean, Claude should: + +- **Link out** instead of inlining full content: + - `.clauderules/code-style.md` + - `.clauderules/testing.md` + - `MEMORY.md` sections + - `docs/*.md`, runbooks, API specs, ADRs +- Use simple phrases like: + - “See `.clauderules/testing.md` for the full test matrix.” + - “See `mem-debugging.md` for detailed ECONNRESET playbook.” + +This lets Claude open additional context only when needed, honoring **progressive disclosure**. + +--- + +## 5. Maintenance Rules + +### 5.1 When to Update CLAUDE.md + +Claude should propose updates when: + +- A **new architectural decision** changes how future work should be done. +- A **recurring bug** leads to a stable pattern or anti‑pattern. +- DevOps/infra rules change (deploy process, environment constraints). +- `MEMORY.md` gains high‑impact entries that merit promotion into `CLAUDE.md`. + + +### 5.2 How to Update + +Claude must: + +- **Read current CLAUDE.md** and estimate its size. +- Select only **high‑signal** content from MEMORY.md and other docs. +- Convert history into **forward‑looking rules**. +- Propose **minimal diffs**, not wholesale rewrites. +- Respect the ~1–1.5k token budget for this project and avoid adding fluff. + +If `CLAUDE.md` starts to feel crowded: + +- Remove outdated sections (e.g., old stack choices no longer relevant). +- Merge overlapping bullets. +- Move deep detail into supporting docs and leave a link. + +--- + +## 6. Quick Checklist for Claude + +Before presenting changes to `CLAUDE.md`, Claude should confirm: + +- [ ] Is the file roughly within the **1–3k word** / ~1–1.5k token range? +- [ ] Does each section follow the layout in §2–3? +- [ ] Does every bullet either: +- Change how Claude behaves, or +- Call out a real gotcha or rule? +- [ ] Are detailed docs referenced, not inlined (progressive disclosure)? +- [ ] Are there no secrets, credentials, or environment-specific tokens? +- [ ] Are new rules consistent with MEMORY.md and the current codebase? + +If not, Claude should revise the draft before proposing a patch. + diff --git a/.claude/skills/claude-mem-mastery/claude-mem-usage.md b/.claude/skills/claude-mem-mastery/claude-mem-usage.md new file mode 100644 index 0000000..1f01ce3 --- /dev/null +++ b/.claude/skills/claude-mem-mastery/claude-mem-usage.md @@ -0,0 +1,334 @@ +# claude-mem-usage.md + +Guidance for Claude on how to use the claude‑mem MCP tools efficiently to learn from past work, update MEMORY.md, and improve CLAUDE.md. + +This file is a **reference** for the `claude-mem-coded-assistant` SKILL. It assumes the claude-mem MCP server is already installed, running, and connected. + +--- + +## 1. Mental Model + +claude-mem gives Claude **project memory** across sessions via MCP tools. + +- It stores: + - Observations (decisions, bugfixes, discoveries, refactors). + - Narratives, facts, concepts, and related files. +- It exposes **three core tools** that follow a 3‑layer, progressive‑disclosure workflow: + 1. `search` → fast index view (IDs, titles, types, concepts, file paths). + 2. `timeline` → chronological context around interesting IDs or queries. + 3. `get_observations` → full details for **only** the IDs you care about. + +Think of it as: + +> “Index and filter first, then fetch details for just the important parts.” + +This is ~10x more token‑efficient than pulling history directly. + +--- + +## 2. Available MCP Tools + +The exact schema may vary slightly by version, but conceptually claude-mem exposes: + +### 2.1 `search` – Index Search + +**Purpose** + +- Get a compact list of relevant observations, without loading full narratives. + +**Typical parameters** (may be named slightly differently depending on implementation): + +- `query` (string): Text query; more specific is better (e.g., `"db connection timeout"`, `"bitcoin payout scheduler"`). +- `type` (string or array): Filter by observation type, e.g.: + - `"decision"`, `"bugfix"`, `"refactor"`, `"discovery"`, `"change"`, `"gotcha"`, `"feature"`, etc.[web:51] +- `project` (string): Project name / repo key, if supported. +- `orderBy` (string): Sorting, usually `"date_desc"` (newest first) or `"date_asc"`. +- `limit` (number): Max results (start small: 5–20). + +**Returns** (index view, low token cost): + +- `id` – Observation ID. +- `type` – Classification (`decision`, `bugfix`, etc.). +- `title` / `summary`. +- `createdAt` / date. +- `concepts` / `tags`. +- `files` / `paths`. + +### 2.2 `timeline` – Chronological Context + +**Purpose** + +- Understand what was happening **before and after** an observation or around a query. + +**Typical parameters**: + +- `anchor` (number): Observation ID to center on. +- `query` (string): Alternative way to auto‑find an anchor if you don’t have an ID. +- `depth_before` (number): # items before anchor (default ~3–5, max ~20). +- `depth_after` (number): # items after anchor (default ~3–5, max ~20). +- `project` (string): Project filter. + +**Returns** + +- A chronological list of: + - Observations. + - Sessions / prompts (implementation‑dependent). +- Gives narrative arc: what led up to a discovery/bug, what happened next. + +### 2.3 `get_observations` – Full Details + +**Purpose** + +- Fetch **complete** details for a **small set** of selected IDs. + +**Typical parameters**: + +- `ids` (array, required): Observation IDs selected from `search` + `timeline`. +- `orderBy` (string): `"date_desc"` (default) or `"date_asc"`. +- `limit` (number, optional): Max observations to return. +- `project` (string, optional): Project filter. + +**Returns** + +Full observation objects (~500–1000 tokens each) including: + +- Title, subtitle. +- Narrative / description. +- Extracted facts and concepts. +- Related files / paths. +- Timestamps and other metadata. + +**Important** + +- Always **batch** IDs: `get_observations(ids=[...])` instead of one call per ID. +- Only call this for items you’ve already filtered as important. + +--- + +## 3. Core Workflow Pattern + +Claude should always use claude-mem with a **3‑step progressive disclosure** workflow: + +> **Step 1 – `search` → Step 2 – `timeline` → Step 3 – `get_observations`** + +This minimizes wasted tokens and keeps context sharp. + +### 3.1 Step 1 – Search (Index First) + +**Goal** + +- Find candidate observations relevant to the current task, **cheaply**. + +**Example strategies**: + +- When revisiting a feature: + - `query="feature-name"` + `project=""`. +- When debugging: + - `query="error message substring"` or `"db connection timeout"`. +- When looking for design decisions: + - `query="payments architecture"` + `type="decision"`. + +**Best practices**: + +1. Start with **small `limit`** (3–10), then expand if needed. +2. Filter by: + - `type` (decision/bugfix/refactor/gotcha). + - `project` (current repository). +3. Skim index fields only: + - IDs, types, titles, concepts, files. + +**What to look for** + +- Items that: + - Match current file paths or modules. + - Are marked as decisions / gotchas / trade‑offs. + - Mention current infra / services / APIs. + +### 3.2 Step 2 – Timeline (Context Around Candidates) + +**Goal** + +- Understand the **story** around promising IDs. + +**How** + +- For a shortlist of IDs from `search`: + - Call `timeline(anchor=, depth_before=3, depth_after=3, project="")`. +- Or: + - `timeline(query="keyword", depth_before=2, depth_after=2, project="")` if you don’t have an ID yet. + +**Use timeline to:** + +- See the lead‑up to a bug/discovery: + - What attempts failed? + - What context was loaded? +- See what happened after: + - Did a fix work? + - Were there follow‑up changes? + +**Outcome** + +- A smaller set of **truly relevant** IDs for `get_observations`. + +### 3.3 Step 3 – Get Observations (Details Only for Filtered IDs) + +**Goal** + +- Pull full details for **just the important observations**. + +**How** + +- After reviewing `search` + `timeline`, pick IDs that: + - Changed architecture / contracts. + - Fixed non‑trivial bugs. + - Defined important patterns or gotchas. +- Call: + - `get_observations(ids=[id1, id2, id3], orderBy="date_desc", project="")`. + +**What to extract** + +From each observation, Claude should pull: + +- Problem / context. +- Root cause and solution. +- Trade‑offs and rationale. +- Files / services / modules involved. +- Any explicit “next time do X instead of Y” guidance. + +These are then **summarized** into `MEMORY.md`, not pasted verbatim. + +--- + +## 4. Using claude-mem to Maintain MEMORY.md + +This section connects claude-mem usage to `MEMORY.md` maintenance. + +### 4.1 When to Update MEMORY.md + +Claude should propose `MEMORY.md` updates when:[web:51][web:55][web:64] + +- A significant **design or architecture decision** is made. +- A non‑trivial **bug** is diagnosed and fixed. +- A **refactor** or **infra change** alters how work should be done. +- A recurring pattern / gotcha is discovered (e.g., flaky upstream, schema pitfalls). +- Daily memory maintenance for active repos. + +### 4.2 What Goes Into MEMORY.md + +From `get_observations` results, Claude should **compress** into: + +- **Architectural decisions** + - Codable as: Date + Decision + Context + Rationale + Impact + Source IDs. +- **Implementation patterns & anti‑patterns** + - “Do” and “Avoid” bullet lists. +- **Debugging playbooks** + - Symptom → Root cause → Fix → Verify → Next time. +- **DevOps / ops rules** + - Deploy flow, rollback triggers, monitoring lessons. +- **Open questions** + - Unresolved design choices, hypotheses to test. + +Each entry should list **source IDs** (e.g., `mem:123, mem:456`) so you can re‑hydrate context later via claude-mem. + +### 4.3 What Does *Not* Belong in MEMORY.md + +- Raw observation narratives from claude-mem. +- Full stack traces or logs (unless extremely compact and reusable). +- Secrets, tokens, private keys, specific IPs, or credentials. +- One‑off trivia that won’t change future behavior. + +--- + +## 5. Using claude-mem to Improve CLAUDE.md + +Claude uses `MEMORY.md` (which is fed by claude-mem) to keep `CLAUDE.md`: + +- Small (~1–1.5k tokens). +- Focused on **rules that matter**. +- Up‑to‑date with real project experience. + +### 5.1 Flow + +1. Use claude-mem workflow (search → timeline → get_observations) when: + - Starting new work on a feature/module. + - Seeing errors that feel familiar. +2. Update `MEMORY.md` with new lessons. +3. Periodically refresh `CLAUDE.md` by: + - Reading `MEMORY.md` sections. + - Pulling only active, high‑impact rules. + - Dropping outdated or superseded instructions. + +### 5.2 When to Prefer claude-mem vs. Repo Search + +Claude should: + +- Prefer **claude-mem** when: + - Looking for **reasoning**, trade‑offs, and bug stories. + - Wanting to avoid re‑debugging the same issue. + - Searching across sessions, even if code moved.[web:78][web:88] +- Prefer **file search / code grep** when: + - You need exact definitions, signatures, or current implementations. + +--- + +## 6. Best Practices & Anti‑Patterns + +### 6.1 Best Practices + +- **Index first, details later**: + - Always start with `search`, then `timeline`, then `get_observations`. +- **Filter aggressively**: + - Use types, project, and specific queries to avoid noisy results. +- **Batch fetch**: + - Use `get_observations(ids=[...])` with multiple IDs at once. +- **Align with files**: + - Prefer observations that reference the same files/modules you are modifying. +- **Feed curated summaries into MEMORY.md**: + - Use claude-mem for depth, but keep `MEMORY.md` lean and structured. + +### 6.2 Anti‑Patterns (Avoid These) + +- Calling `get_observations` on many IDs without prior filtering. +- Using `timeline` with large depths (e.g., 20/20) by default. +- Copying observation narratives verbatim into `MEMORY.md` or `CLAUDE.md`. +- Treating claude-mem as a replacement for code search. +- Storing secrets or environment‑specific credentials anywhere in the memory system outputs. + +--- + +## 7. Example Scenarios + +### 7.1 Re‑debugging a Known Error + +1. Notice an error: `"ECONNRESET during payout job"`. +2. Call `search(query="ECONNRESET payout", type="bugfix", project="", limit=5)`. +3. For relevant IDs, call `timeline(anchor=, depth_before=3, depth_after=3, project="")`. +4. Select 1–3 IDs and call `get_observations(ids=[...])`. +5. Update `MEMORY.md` “Debugging Playbooks” with a concise recipe: + - Symptom, root cause, fix, verification, next time. +6. If this changes how devs should work, update `CLAUDE.md` “Patterns & Gotchas”. + +### 7.2 Revisiting a Feature Months Later + +1. `search(query="dark mode toggle", type=["feature","decision"], project="", orderBy="date_asc")`. +2. Use `timeline` to see the feature’s evolution. +3. `get_observations` for key milestones. +4. Summarize any critical constraints or decisions into `MEMORY.md` → "Architectural Decisions". +5. Ensure `CLAUDE.md` reflects current rules (e.g., “Dark mode state must be stored in X, not Y”). + +--- + +## 8. Quick Checklist for Claude + +When using claude-mem in this repo, Claude should: + +- [ ] Start with `search` using a precise query and types. +- [ ] Use `timeline` around promising IDs to understand context. +- [ ] Batch `get_observations` for only the most relevant IDs. +- [ ] Extract **lessons**, not transcripts. +- [ ] Update `MEMORY.md` with concise, structured entries. +- [ ] Periodically refresh `CLAUDE.md` from `MEMORY.md`, respecting the size budget. +- [ ] Never store secrets or raw logs in these files. + +If these boxes are checked, claude-mem is being used correctly and efficiently. + diff --git a/.claude/skills/claude-mem-mastery/example-diffs.md b/.claude/skills/claude-mem-mastery/example-diffs.md new file mode 100644 index 0000000..2f4653d --- /dev/null +++ b/.claude/skills/claude-mem-mastery/example-diffs.md @@ -0,0 +1,269 @@ +# example-diffs.md + +Example before/after patches for `CLAUDE.md` and `MEMORY.md` so Claude can see what “good” edits look like and propose minimal diffs instead of wholesale rewrites. + +Use these as patterns, not as literal content. + +--- + +## 1. CLAUDE.md Diff – Promote a Lesson from MEMORY.md + +### 1.1 Context + +A recurring DB connection issue has been captured in `MEMORY.md` under “Debugging Playbooks”. We now want `CLAUDE.md` to include a **forward‑looking rule** so Claude avoids re‑introducing the problem. + +`MEMORY.md` (excerpt): + +```markdown +## 3. Debugging Playbooks + +- [2026-02-18] **Issue Class:** Intermittent DB connection resets (ECONNRESET) + - Symptom: + - Jobs fail sporadically with ECONNRESET during heavy load. + - Root cause: + - Connection pool exhausted under high concurrency, with no backoff. + - Fix steps: + - Check DB pool stats; increase pool size cautiously. + - Add jittered exponential backoff to connection retries. + - Next time: + - Use the shared db client helper with backoff instead of manual loops. +``` + + +### 1.2 Before – CLAUDE.md (excerpt) + +```markdown +## 4. Patterns & Gotchas + +- Do: + - Use repository interfaces instead of ad-hoc SQL. +- Avoid: + - Writing complex business logic directly in controllers. +``` + + +### 1.3 After – CLAUDE.md (excerpt) + +```diff + ## 4. Patterns & Gotchas + + - Do: + - Use repository interfaces instead of ad-hoc SQL. ++ - Use the shared DB client helper with jittered exponential backoff for outbound DB connections. + - Avoid: + - Writing complex business logic directly in controllers. ++ - Implementing manual retry loops around DB calls; this caused ECONNRESET incidents under load (see MEMORY.md → "Intermittent DB connection resets"). +``` + + +### 1.4 Notes for Claude + +- Only **two bullets** added, both directly derived from `MEMORY.md`. +- No history copied; just rules and a pointer back to the playbook. +- This stays within the token budget and changes future behavior. + +--- + +## 2. CLAUDE.md Diff – Replace Stale Decision with New One + +### 2.1 Context + +An old architectural decision about polling is replaced by a newer event‑driven approach, already captured in `MEMORY.md` → “Architectural Decisions”. + +### 2.2 Before – CLAUDE.md (excerpt) + +```markdown +## 3. Current Architectural Directions + +- Use a polling loop every 30 seconds to update miner status from the control plane. +- Miner state is persisted via direct writes from the polling cron job. +``` + + +### 2.3 After – CLAUDE.md (excerpt) + +```diff + ## 3. Current Architectural Directions + +-- Use a polling loop every 30 seconds to update miner status from the control plane. +-- Miner state is persisted via direct writes from the polling cron job. ++- Prefer event-driven miner state updates: ++ - The control plane publishes state changes as events; subscribers update views. ++- Polling is allowed only in diagnostics tools and must not write directly to primary state tables (see MEMORY.md → "Event-driven vs polling for payout status"). +``` + + +### 2.4 Notes for Claude + +- Old guidance is **removed**, not left to conflict with new behavior. +- New content references the relevant decision in `MEMORY.md` instead of re‑explaining the entire debate. + +--- + +## 3. MEMORY.md Diff – Add a New Debugging Playbook + +### 3.1 Context + +claude-mem shows a recent incident where a payout job silently failed due to misconfigured environment variables. We want a new debugging playbook entry. + +### 3.2 Before – MEMORY.md (excerpt) + +```markdown +## 3. Debugging Playbooks + +- [2026-02-18] **Issue Class:** Intermittent DB connection resets (ECONNRESET) + ... +``` + + +### 3.3 After – MEMORY.md (excerpt) + +```diff + ## 3. Debugging Playbooks + + - [2026-02-18] **Issue Class:** Intermittent DB connection resets (ECONNRESET) + ... ++ ++- [2026-02-23] **Issue Class:** Payout job silently failing due to env misconfig ++ - Symptom: ++ - Payout job appears to run but no payouts are created; logs show only INFO messages. ++ - Root cause: ++ - `PAYOUTS_ENABLED` was unset in staging, defaulting to `false`. ++ - Fix steps: ++ - Confirm env vars in staging via `env:dump` command or CI configuration. ++ - Set `PAYOUTS_ENABLED=true` in staging and redeploy. ++ - Verification: ++ - Trigger a test payout and confirm records in `payouts` table and logs. ++ - Next time: ++ - Add a startup check that logs and alerts if `PAYOUTS_ENABLED` is false in non-local environments. ++ - Source: ++ - mem:612, mem:617, incident #21 +``` + + +### 3.4 Notes for Claude + +- This is a **new entry**; other entries are untouched. +- It uses the standard structure from `memory-structure-reference.md`. +- It includes `Source` IDs to re‑hydrate context later via claude-mem. + +--- + +## 4. MEMORY.md Diff – Compress Old Entries into a Rollup + +### 4.1 Context + +The “Architectural Decisions” section has many old entries about the early payout engine evolution. They’re still useful, but too detailed for `MEMORY.md`’s first 200 lines that load into Claude by default. + +We compress them into a **rollup** and move detail to `mem-architecture.md`. + +### 4.2 Before – MEMORY.md (excerpt) + +```markdown +## 1. Architectural Decisions + +- [2025-11-10] **Decision:** Initial polling design for payout engine + ... +- [2025-12-01] **Decision:** Introduce job queue for payouts + ... +- [2026-01-05] **Decision:** Split payout service into writer/reader + ... +``` + + +### 4.3 After – MEMORY.md (excerpt) + +```diff + ## 1. Architectural Decisions + +-- [2025-11-10] **Decision:** Initial polling design for payout engine +- ... +-- [2025-12-01] **Decision:** Introduce job queue for payouts +- ... +-- [2026-01-05] **Decision:** Split payout service into writer/reader +- ... ++- [2025-11 – 2026-01] **Rollup:** Early payout engine evolution ++ - Context: ++ - Multiple iterations to handle load, retries, and data consistency. ++ - Key lessons: ++ - Prefer queue-based processing over cron for payout workloads. ++ - Separate write paths from read views to protect OLTP performance. ++ - Details: ++ - See `mem-architecture.md` → "Payout engine evolution (2025-11–2026-01)" for the full history. +``` + + +### 4.4 Notes for Claude + +- Three fine‑grained decisions replaced by one rollup. +- The rollup gives enough context for behavior, with a pointer to a deeper topic file. + +--- + +## 5. Combined Diff – Update Both MEMORY.md and CLAUDE.md + +### 5.1 Context + +A new architectural decision is made: “Use event‑driven updates for miner state”. It should appear in both: + +- `MEMORY.md` → full decision entry. +- `CLAUDE.md` → concise rule in “Current Architectural Directions”. + + +### 5.2 MEMORY.md Patch (excerpt) + +```diff + ## 1. Architectural Decisions + ++- [2026-02-22] **Decision:** Prefer event-driven miner state updates ++ - Context: ++ - Polling for miner state created unnecessary load and stale data during spikes. ++ - Rationale: ++ - Event-driven updates reduce database writes and improve freshness. ++ - Better aligns with how the control plane already emits events. ++ - Impact: ++ - New features must subscribe to miner state events instead of polling where feasible. ++ - Polling is now limited to diagnostics tools. ++ - Source: ++ - mem:701, mem:705, DESIGN-miner-events.md +``` + + +### 5.3 CLAUDE.md Patch (excerpt) + +```diff + ## 3. Current Architectural Directions + + - All mining control operations should flow through the `ControlPlaneService` API; do not talk to miners directly from UI code. +-- Use a polling loop every 30 seconds to update miner status from the control plane. ++- Prefer event-driven miner state updates: ++ - Subscribe to control-plane events for miner state changes. ++ - Polling is allowed only in diagnostics tools and must not write directly to primary state tables. +``` + + +### 5.4 Notes for Claude + +- `MEMORY.md` holds the **full decision**; `CLAUDE.md` holds the **rule**. +- Both patches are small and targeted. +- This pattern is ideal for the `claude-mem-coded-assistant` SKILL. + +--- + +## 6. Checklist for Drafting Diffs + +When Claude drafts diffs for these files, it should aim for: + +- **Small, focused hunks**: + - Only modify what is necessary. +- **Preserve structure**: + - Keep headings, ordering, and formatting stable. +- **Forward‑looking wording**: + - Rules and patterns, not transcripts or blow‑by‑blow history. +- **Links instead of bulk text**: + - Reference `MEMORY.md`, topic files, or docs instead of copying them. +- **No secrets**: + - Never introduce credentials, tokens, or sensitive environment details.[web:48] + +If a draft diff violates any of these, Claude should revise before presenting it. + diff --git a/.claude/skills/claude-mem-mastery/memory-structure-reference.md b/.claude/skills/claude-mem-mastery/memory-structure-reference.md new file mode 100644 index 0000000..00556a3 --- /dev/null +++ b/.claude/skills/claude-mem-mastery/memory-structure-reference.md @@ -0,0 +1,384 @@ +# memory-structure-reference.md + +Reference for how Claude should structure and maintain `MEMORY.md` (and optional topic files) so project memory stays compact, useful, and easy to evolve. + +This file supports the `claude-mem-coded-assistant` SKILL and assumes project‑level memory lives alongside `CLAUDE.md` in the repo root. + +--- + +## 1. Purpose and Location + +### 1.1 Purpose + +`MEMORY.md` serves as: + +- A **human- and agent-readable index** of important project learnings. +- A bridge between: + - Detailed history in claude-mem. + - Concise, actionable rules in `CLAUDE.md`. +- The first place Claude should look to avoid: + - Re‑debugging known issues. + - Re‑evaluating resolved design choices. + - Forgetting critical operational constraints. + +### 1.2 Recommended Layout + +For this project, use: + +```text +repo-root/ + CLAUDE.md # main project instructions (entry point) + MEMORY.md # curated lessons and directions (index) + .claude/ + SKILL.md + claude-mem-usage.md + memory-structure-reference.md + claude-md-layout-reference.md + example-diffs.md +``` + +- `MEMORY.md` lives at project root so Claude and other tools treat it as a primary memory artifact. +- Additional deep-dive memory can live in separate topic files (see §4). + +--- + +## 2. Top-Level Structure for MEMORY.md + +### 2.1 Standard Template + +Claude should keep `MEMORY.md` close to the following structure: + +```markdown +# Project Memory + +> Curated lessons and directions synthesized from claude-mem and real work. +> Use this to avoid repeating mistakes and to keep the project healthy. + +## 1. Architectural Decisions + +## 2. Implementation Patterns & Anti-Patterns + +## 3. Debugging Playbooks + +## 4. DevOps & Operations + +## 5. Open Questions / Next Directions +``` + +Each section should hold **compact bullets**, not long narratives. The file should be short enough to scan in 1–2 minutes (ideally a few hundred lines, not a full book). + +--- + +## 3. Section Patterns & Examples + +This section defines how Claude should format each section. + +### 3.1 Architectural Decisions + +Purpose: + +- Capture **long-lived design choices** that affect current and future work. + +Entry pattern: + +```markdown +## 1. Architectural Decisions + +- [YYYY-MM-DD] **Decision:** Short human-readable title. + - Context: 1–2 sentences explaining the situation. + - Rationale: + - Bullet 1 (major reason). + - Bullet 2 (trade-off or constraint). + - Impact: + - Bullet 1 (what should change going forward). + - Bullet 2 (who/what is affected). + - Source: claude-mem IDs (e.g., `mem:123, mem:241`) and/or PRs/issues. +``` + +Example: + +```markdown +- [2026-02-20] **Decision:** Use job queue X for payouts + - Context: Payout job concurrency was causing DB connection exhaustion. + - Rationale: + - Queue X gives backpressure and visibility we lacked with raw cron. + - Native retry semantics reduce our custom retry code. + - Impact: + - All new payout flows must enqueue work via `PayoutQueueService`. + - Direct cron-based payout scripts are deprecated. + - Source: mem:452, mem:459, PR #231 +``` + + +### 3.2 Implementation Patterns & Anti‑Patterns + +Purpose: + +- Preserve **how** we implement things when they work well (or go wrong). + +Entry pattern: + +```markdown +## 2. Implementation Patterns & Anti-Patterns + +- [YYYY-MM-DD] **Pattern:** Short title. + - Applies to: modules/services/files. + - Do: + - Bullet 1 (positive rule). + - Bullet 2 (positive rule). + - Avoid: + - Bullet 1 (what broke last time). + - Bullet 2 (known anti-pattern). + - Source: claude-mem IDs, PRs/issues. +``` + +Example: + +```markdown +- [2026-02-21] **Pattern:** Retrying flaky upstream APIs + - Applies to: `services/upstreamClient.ts`, `jobs/*` + - Do: + - Use `withRetry()` helper from `retry.ts` with circuit breaker enabled. + - Log retry attempts at debug level and final failures at warn. + - Avoid: + - Manual `for` loops with `setTimeout` for retries. + - Retrying non-idempotent POSTs without explicit approval. + - Source: mem:501, mem:507, PR #239 +``` + + +### 3.3 Debugging Playbooks + +Purpose: + +- Capture **repeatable troubleshooting recipes** for classes of issues. + +Entry pattern: + +```markdown +## 3. Debugging Playbooks + +- [YYYY-MM-DD] **Issue Class:** Short title. + - Symptom: + - Short description of what the user/system sees. + - Root cause: + - 1–2 sentences or bullets explaining the underlying problem. + - Fix steps: + - Bullet 1 (check). + - Bullet 2 (fix). + - Bullet 3 (verification command/test). + - Verification: + - Bullet list of checks/tests to confirm resolution. + - Next time: + - 1–3 bullets on how to avoid this issue in the future. + - Source: claude-mem IDs, PRs/issues, runbooks. +``` + +Example: + +```markdown +- [2026-02-18] **Issue Class:** Intermittent DB connection resets (ECONNRESET) + - Symptom: + - Jobs fail sporadically with ECONNRESET during heavy load. + - Root cause: + - Connection pool exhausted under high concurrency, with no backoff. + - Fix steps: + - Check DB connection usage via `db:pool:stats` dashboard. + - Increase pool size cautiously and enable queueing. + - Add jittered exponential backoff to connection retries. + - Verification: + - Load test with job runner at 2x normal volume. + - Confirm no ECONNRESET events in logs for 30 minutes. + - Next time: + - Bake backoff and pooling decisions into `dbClient` abstraction. + - Source: mem:421, mem:422, incident #17 +``` + + +### 3.4 DevOps & Operations + +Purpose: + +- Describe **how to run and protect** the system in production. + +Entry pattern: + +```markdown +## 4. DevOps & Operations + +- [YYYY-MM-DD] **Topic:** Short title. + - Environment: prod / staging / dev. + - Rules: + - Bullet 1 (deploy / rollback rule). + - Bullet 2 (monitoring / alert rule). + - Notes: + - Extra clarifications or links to runbooks/dashboards. + - Source: incidents, SRE notes, claude-mem IDs. +``` + +Example: + +```markdown +- [2026-02-19] **Topic:** Safe rollout of payout engine + - Environment: prod + - Rules: + - Use canary rollout at 5% → 25% → 50% → 100% over 30–60 minutes. + - Auto-rollback if error rate doubles baseline for >5 minutes. + - Notes: + - See `RUNBOOK-payouts.md` for step-by-step commands and dashboards. + - Source: mem:480, incident review 2026-02-19 +``` + + +### 3.5 Open Questions / Next Directions + +Purpose: + +- Track **what’s undecided** and where experiments or ADRs are needed. + +Entry pattern: + +```markdown +## 5. Open Questions / Next Directions + +- [YYYY-MM-DD] **Question:** Short title. + - Context: + - 1–2 sentences on why this matters. + - Options: + - Option A – summary. + - Option B – summary. + - Next steps: + - Bullet list of decisions or experiments needed. + - Source: claude-mem IDs, planning docs, ADRs. +``` + +Example: + +```markdown +- [2026-02-22] **Question:** Event-driven vs polling for payout status + - Context: + - Current polling loop adds load and has ~5–10 min latency on updates. + - Options: + - Option A – webhook-based events from provider. + - Option B – keep polling but reduce scope and add backoff. + - Next steps: + - Spike both approaches in staging and compare complexity + latency. + - Source: mem:530, DESIGN-payouts-events.md +``` + + +--- + +## 4. Optional Topic Files + +To keep `MEMORY.md` lean, Claude can create **topic-specific files** for deep dives and link to them. + +### 4.1 Recommended Topic Files + +Under the same project root or a dedicated memory directory (pick one and stick with it): + +```text +repo-root/ + MEMORY.md + mem-debugging.md + mem-architecture.md + mem-devops.md + mem-api-conventions.md +``` + +- `MEMORY.md`: + - High‑level index and summaries. +- Topic files: + - Longer narratives, detailed examples, stack traces, or complex runbooks. + - Linked from `MEMORY.md` entries. + +Example link from `MEMORY.md` to a topic file: + +```markdown +- [2026-02-18] **Issue Class:** Intermittent DB connection resets (ECONNRESET) + - Symptom: + - Jobs fail sporadically with ECONNRESET during heavy load. + - Root cause: + - Connection pool exhausted under high concurrency, with no backoff. + - Fix steps: + - See detailed runbook in `mem-debugging.md` → "ECONNRESET playbook". + - Next time: + - Bake backoff and pooling decisions into `dbClient` abstraction. + - Source: mem:421, mem:422, incident #17 +``` + + +--- + +## 5. Maintenance & Pruning + +### 5.1 When to Update + +Claude should update `MEMORY.md` when: + +- New decisions are made. +- Non‑trivial bugs are fixed. +- New patterns or anti‑patterns emerge. +- Significant infra / operations lessons are learned. +- Open questions are resolved (and moved into decisions). + + +### 5.2 When and How to Prune + +If `MEMORY.md` grows too long or noisy: + +- **Compress older entries**: + - Replace multiple old entries with a **rollup** summary per section. +- **Move detail down**: + - Push long content into topic files, keep only a link and short summary. +- **Drop obsolete items**: + - Remove entries that: + - Refer to removed systems. + - Have been superseded by newer decisions. + +Example rollup: + +```markdown +- [2025-11 – 2026-01] **Rollup:** Early payout engine lessons + - Context: + - Multiple incidents around DB load and payout retries. + - Key lessons: + - Centralize retry logic in `retry.ts` and avoid ad-hoc loops. + - Prefer queue-based processing over cron for high-volume flows. + - Details: + - See `mem-architecture.md` → "Payout engine evolution (2025-11–2026-01)". +``` + + +--- + +## 6. Safety and Red Lines + +Claude must **never** write the following into `MEMORY.md` or topic files: + +- Raw secrets: + - API keys, private keys, passwords, tokens. +- Sensitive identifiers: + - Production IPs, internal hostnames, customer data. +- Full log dumps or stack traces that reveal secrets. + +Instead: + +- Use generic placeholders (e.g., ``). +- Reference secret management docs or Vault paths. + +--- + +## 7. Quick Checklist for Updating MEMORY.md + +When Claude proposes an update to `MEMORY.md`, it should confirm: + +- [ ] Does this entry help us **avoid a repeat mistake** or **reuse a good pattern**? +- [ ] Is the entry short and structured (bullets, not walls of text)? +- [ ] Does it include a date, clear title, and relevant section? +- [ ] Does it reference relevant claude-mem IDs and/or PRs/issues? +- [ ] Could a new contributor understand and apply it within 30 seconds? +- [ ] Are there **no secrets** or sensitive details? + +If the answer to any is “no,” Claude should revise before presenting the patch. + diff --git a/.github/CLAUDE.md b/.github/CLAUDE.md new file mode 100644 index 0000000..59ab83f --- /dev/null +++ b/.github/CLAUDE.md @@ -0,0 +1,3 @@ + + + \ No newline at end of file diff --git a/CLAUDE.md b/CLAUDE.md index 82a65db..c4f3097 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -1,9 +1,11 @@ # Claude Code Configuration -**Version:** 2.0.0 -**Last Updated:** February 11, 2026 +**Version:** 2.1.0 +**Last Updated:** February 28, 2026 **Project:** Dokploy Templates with Cloudflare Integration +> **See `MEMORY.md`** for curated lessons: Cloudflare patterns, debugging playbooks, architectural decisions, DevOps rules. + --- ## Primary Reference @@ -77,6 +79,45 @@ npm run validate:all && npm run test:coverage --- +## Template Patterns & Architecture + +**See `MEMORY.md` for detailed Cloudflare, Traefik, and debugging patterns.** + +### Single-Service Template (Stateless Apps) +- Use when: CLI tools, stateless microservices, no external dependencies +- Structure: One service, straightforward volumes, minimal networking +- Example: ai-context template (GitHub context analyzer) +- Benefits: Simple scaling, no startup ordering, clear deployment model + +### Multi-Service Templates (Databases, Queues, Caching) +- Planned for future work; see MEMORY.md "Open Questions" +- Will support conditional service enabling via template.toml + +--- + +## Cloudflare Integration Checklist + +When adding Cloudflare features to templates: + +- **Authentication**: Use Cloudflare Access forwardauth middleware + MFA policy +- **Rate Limiting**: Implement Cloudflare Workers with KV state, exponential backoff +- **Storage**: R2 bucket for backups/sync; include GET `/sync/status` endpoint +- **Template Variables**: Add `CF_*` prefixed env vars; document in README "Advanced Config" +- **Documentation**: Include 6-step setup guide, Cloudflare UI screenshots, post-deployment verification tests + +--- + +## Template Creation Workflow + +1. **Clarification** (5 min): Ask 3–5 questions (stateless? auth needed? storage? rate limiting?) +2. **Architecture** (10 min): Choose pattern (single-service, Cloudflare integrations) +3. **Generation** (20 min): Create docker-compose.yml, template.toml, README +4. **Validation** (5 min): Test with env vars, verify docker-compose config +5. **Documentation** (30 min): Include setup guide, diagram, troubleshooting +6. **Index** (2 min): Add alphabetical entry to blueprints/README.md + +--- + ## Template Standards (Quick Reference) See `.github/copilot-instructions.md` and `AGENTS.md` for complete standards. @@ -87,3 +128,4 @@ See `.github/copilot-instructions.md` and `AGENTS.md` for complete standards. - Service names match between compose and TOML - Never hardcode credentials - Cloudflare vars use `${CF_*}` pattern +- Traefik labels: `entrypoint=websecure`, `certresolver=letsencrypt`, security headers diff --git a/MEMORY.md b/MEMORY.md new file mode 100644 index 0000000..e05fdd7 --- /dev/null +++ b/MEMORY.md @@ -0,0 +1,182 @@ +# Project Memory + +> Curated lessons and directions synthesized from Dokploy template development. +> Use this to avoid repeating mistakes and to keep template creation efficient. + +--- + +## 1. Architectural Decisions + +- [2026-02-28] **Decision:** Single-service template pattern for stateless Go applications + - Context: ai-context is a stateless CLI tool with no external dependencies. + - Rationale: + - Simpler deployment architecture reduces operator cognitive load. + - No service interdependencies = no startup order complexity or health check chains. + - Easier to scale horizontally (all instances identical). + - Impact: + - Template structure: one service in docker-compose.yml, straightforward volume mounts. + - When to use: Stateless applications, CLIs, isolated microservices. + - When NOT to use: Apps with DB dependencies, message queues, caching layers. + - Source: ai-context template creation, PR #6 + +- [2026-02-28] **Decision:** Cloudflare-first integration for external services + - Context: ai-context has no built-in authentication; needed edge-based security, rate limiting, and optional storage sync. + - Rationale: + - Cloudflare Access provides MFA + team-based authorization without code changes. + - Workers enable rate limiting and auto-sync without modifying application logic. + - R2 bucket gives S3-compatible storage for context backups and data sync. + - All components managed via Cloudflare API (centralized). + - Impact: + - All new templates should consider Cloudflare for auth, rate limiting, storage. + - Add Cloudflare variables to template.toml (domain, account ID, team name, R2 bucket). + - Security: No API keys in app config; all Cloudflare credentials in template vars. + - Source: ai-context template creation, Cloudflare Workers + R2 integration + +- [2026-02-28] **Decision:** Document over abstract; comprehensive README justifies template complexity + - Context: ai-context template generated 20KB README (630+ lines); risk of over-engineering. + - Rationale: + - Cloudflare + Workers + R2 integration requires step-by-step setup; brevity causes support burden. + - 6-step setup guide + 8 troubleshooting sections prevent user confusion. + - Verification tests (health check, Access auth, rate limiting, TLS, R2, logs) reduce debugging time. + - Impact: + - When template complexity exceeds 2–3 services OR uses external integrations: invest in README. + - Include: architecture diagram, step-by-step setup, 3+ post-deployment tests, troubleshooting index. + - Anti-pattern: Brief README with complex deployments leads to support questions. + - Source: ai-context 20KB README reduced support friction + +--- + +## 2. Implementation Patterns & Anti-Patterns + +- [2026-02-28] **Pattern:** Cloudflare Access forwardauth with Traefik + - Applies to: All Dokploy templates requiring authentication. + - Do: + - Use `forwardauth` middleware with Cloudflare Access default policy. + - Protect only sensitive endpoints (`/generate`, `/clear`); leave health checks public. + - Include Traefik labels: `router.middlewares=cloudflare-access@docker` + `router.middlewares=rate-limit@docker`. + - Test Access policy in Cloudflare UI before deployment. + - Avoid: + - Exposing `/health` or `/` endpoints behind Access (breaks monitoring). + - Storing Access credentials in docker-compose.yml (use template variables). + - Forgetting MFA requirement in Cloudflare policy. + - Source: ai-context docker-compose.yml, Cloudflare Access setup + +- [2026-02-28] **Pattern:** Cloudflare Workers rate limiting with exponential backoff + - Applies to: APIs with public endpoints or resource-intensive operations. + - Do: + - Implement 100–1000 req/hour per IP using KV namespace (persistent state). + - Use exponential backoff: 500ms, 1500ms, 4500ms retry delays. + - Return 429 Too Many Requests with X-RateLimit-* headers. + - Fail-open strategy: on KV error, allow request (reliability over perfect limiting). + - Avoid: + - In-memory rate limits (lost on Worker reload). + - Linear retry delays (thundering herd at scale). + - Silently dropping requests (return 429 for visibility). + - Source: cloudflare-worker-rate-limit.js (4.7KB) + +- [2026-02-28] **Pattern:** Cloudflare R2 auto-sync with metadata and retry + - Applies to: Templates needing backup, archival, or multi-region data replication. + - Do: + - Sync via webhook POST `/sync` endpoint (trigger from app). + - Store metadata in KV (file name, size, sync timestamp, 7-day TTL). + - Expose GET `/sync/status` for monitoring (returns KV metadata). + - Use AWS SDK v3 S3 client with R2 S3-compatible endpoint. + - Exponential backoff retries (3 attempts max). + - Avoid: + - Polling for files to sync (high latency, CPU waste). + - Storing large files without size validation. + - Ignoring KV TTL (stale metadata accumulates). + - Source: cloudflare-worker-r2-sync.js (8KB), template.toml R2 variables + +- [2026-02-28] **Pattern:** Traefik label conventions for Dokploy templates + - Applies to: All docker-compose.yml services. + - Do: + - Use `traefik.enable=true` for public services. + - Set `entrypoint=websecure` (HTTPS); avoid `web` (HTTP). + - Use `certresolver=letsencrypt` for automatic TLS renewal. + - Add security headers: `X-Frame-Options`, `X-Content-Type-Options`, `Strict-Transport-Security`. + - Route protected endpoints via middleware (Access, rate limiting). + - Avoid: + - Mixing `traefik.http` and `traefik.tcp` (use docker labels consistently). + - Forgetting health check middleware when app requires authentication. + - Using bare domain without path (e.g., no `traefik.http.routers.*.rule`). + - Source: ai-context docker-compose.yml (23 Traefik labels) + +--- + +## 3. Debugging Playbooks + +- [2026-02-28] **Issue Class:** Docker-compose validation fails with "required variable missing" + - Symptom: + - `docker compose config` returns error: `required variable DOMAIN is missing`. + - Root cause: + - Template variables not set in environment. Validation correctly catches missing required vars. + - Fix steps: + - Export required vars: `export DOMAIN="test.example.com" CF_TEAM_NAME="test" CF_ACCOUNT_ID="test123"`. + - Retry: `docker compose config > /dev/null` (should succeed). + - Verify: Check docker-compose expansion with `docker compose config` (full output). + - Verification: + - `docker compose config` returns valid YAML with no errors. + - All service names, networks, volumes present and properly referenced. + - Next time: + - This is expected behavior; template validation catches configuration errors early. + - Use env file: `docker compose --env-file .env config` if vars stored in file. + - Source: ai-context validation, docker-compose.yml testing + +--- + +## 4. DevOps & Operations + +- [2026-02-28] **Topic:** Progressive skill loading reduces token cost 35–40% + - Environment: all (meta-pattern for Claude workflows). + - Rules: + - Load only skills matching current task context (e.g., `dokploy-cloudflare-integration` for Cloudflare work). + - Use generic agents (Builder, Validator) instead of specialized agents. + - Reference skills via `.claude/skills/dokploy-*` directory. + - Defer skill loading until task phase requires it (discovery → architecture → generation). + - Notes: + - ai-context template used 5 skill files; overall context window reduction ~35%. + - Each skill is ~200–400 tokens; selective loading pays off on large projects. + - Source: ai-context multi-phase workflow, Nori full-send mode + +- [2026-02-28] **Topic:** Clarification questions shape template design + - Environment: template creation. + - Rules: + - Ask 3–5 critical questions early (e.g., "Do you need R2 storage sync?", "Rate limiting required?"). + - User YES/NO answers directly determine Workers, env vars, and README scope. + - Document user answers in git commit message and README "Advanced Config" section. + - Notes: + - ai-context: 4 clarification questions → R2 sync (YES) + rate limiting (YES) + GH_TOKEN rotation (YES) + cleanup (NO). + - Each YES → +2–4KB file size, +3–5 README sections, +1–2 env vars. + - Source: ai-context template creation, user feedback loop + +--- + +## 5. Open Questions / Next Directions + +- [2026-02-28] **Question:** Multi-service template patterns and dependency chains + - Context: + - Current patterns cover single-service (stateless CLI). Need playbook for apps with DB, caching, queues. + - Options: + - Option A – extend template.toml to support conditional services (e.g., `enable_postgres=true`). + - Option B – create separate multi-service template variants (api-postgres, api-redis, etc.). + - Option C – develop dependency chain orchestration (startup order, health checks, network policies). + - Next steps: + - Document multi-service decision factors in MEMORY.md. + - Spike multi-tenant and multi-service skills from `.claude/skills/dokploy-*`. + - Source: future work direction + +--- + +## Quick Reference: Dokploy Template Checklist + +When creating a new Dokploy template: + +- [ ] Clarification: Stateless or DB-backed? Single or multi-service? External integrations? +- [ ] Architecture: Choose pattern (single-service vs multi-service; Cloudflare-first if auth needed). +- [ ] Files: docker-compose.yml (services, networks, volumes), template.toml (variables), README.md. +- [ ] Security: Pinned image versions, no hardcoded secrets, env var pattern ${VARIABLE}. +- [ ] Documentation: Step-by-step setup, architecture diagram, 3+ verification tests, troubleshooting index. +- [ ] Validation: `npm run validate -- blueprints/[name]`, test docker-compose with env vars. +- [ ] Index: Add entry to blueprints/README.md in alphabetical order. +- [ ] Commit: Conventional commit with template description and clarification answers. diff --git a/blueprints/CLAUDE.md b/blueprints/CLAUDE.md new file mode 100644 index 0000000..59ab83f --- /dev/null +++ b/blueprints/CLAUDE.md @@ -0,0 +1,3 @@ + + + \ No newline at end of file diff --git a/blueprints/technitium-dns/CLAUDE.md b/blueprints/technitium-dns/CLAUDE.md new file mode 100644 index 0000000..59ab83f --- /dev/null +++ b/blueprints/technitium-dns/CLAUDE.md @@ -0,0 +1,3 @@ + + + \ No newline at end of file diff --git a/blueprints/technitium-dns/README.md b/blueprints/technitium-dns/README.md new file mode 100644 index 0000000..cdc785f --- /dev/null +++ b/blueprints/technitium-dns/README.md @@ -0,0 +1,557 @@ +# Technitium DNS Server - Production-Ready Dokploy Template + +> Authoritative + recursive DNS server with clustering, ad-blocking, and Cloudflare integration for mining operations and edge data centers. + +**Official:** https://github.com/TechnitiumSoftware/DnsServer +**Documentation:** https://docs.technitium.com/ +**Template:** [View Source](docker-compose.yml) + +--- + +## Overview + +Technitium DNS Server is a free, open-source DNS server supporting both recursive (resolver) and authoritative (zone hosting) modes. This production-ready Dokploy template enables three deployment scenarios: + +- **Home/Office** — Single instance for local network DNS with ad-blocking (5 min setup) +- **Clustered** — Primary/Secondary across multiple mining sites with R2 backups + Tunnel (10-15 min per node) +- **Cloud/Public DNS** — HA authoritative DNS with DoT/DoH and hourly backups (20-30 min) + +### Key Features + +✅ **Single docker-compose.yml** — Environment-driven presets (no duplication) +✅ **Primary/Secondary Clustering** — Zone replication via catalog zones (no shared storage SPOF) +✅ **Cloudflare Integration** — R2 backups, Tunnel remote access, DNS-01 SSL +✅ **Ad-Blocking** — Built-in blocklist support for privacy-focused DNS +✅ **DNSSEC** — Full DNSSEC signing + key replication in cluster +✅ **Health Checks** — DNS port 53 + admin console monitoring +✅ **Traefik HTTPS** — Let's Encrypt SSL for admin console (port 5380) + +--- + +## Architecture + +### Home/Office Deployment + +``` +┌─────────────────────────────────────┐ +│ Local Network │ +│ │ +│ ┌──────────────────────┐ │ +│ │ Technitium DNS │ │ +│ │ (Primary) │ │ +│ │ Port 53 (TCP/UDP) │ │ +│ │ Port 5380 (Admin) │ │ +│ └──────────────────────┘ │ +│ ▲ │ +│ │ │ +│ DNS queries from clients │ +│ │ +└─────────────────────────────────────┘ + │ + ▼ (HTTPS via Traefik + Let's Encrypt) +┌─────────────────────────────────────┐ +│ Admin Console │ +│ https://dns.yourdomain.com │ +└─────────────────────────────────────┘ +``` + +### Clustered Deployment (Primary + Secondary) + +``` +┌─────────────────────────────────────────────────────────┐ +│ Mining Site 1 │ +│ │ +│ ┌──────────────────┐ ┌─────────────────────┐ │ +│ │ Technitium │ │ Cloudflare Tunnel │ │ +│ │ Primary Node │◄────────│ (Remote Mgmt) │ │ +│ │ (Zone Master) │ │ │ │ +│ └──────────────────┘ └─────────────────────┘ │ +│ │ │ +│ │ DNS Zone Transfers (AXFR/IXFR) │ +│ │ Catalog Zone Auto-Sync │ +│ │ │ +│ ▼ │ +└─────────────────────────────────────────────────────────┘ + │ + │ AXFR/IXFR + DNS NOTIFY + │ +┌─────────────────────────────────────────────────────────┐ +│ Mining Site 2 (or failover location) │ +│ │ +│ ┌──────────────────┐ ┌─────────────────────┐ │ +│ │ Technitium │◄────────│ Cloudflare Tunnel │ │ +│ │ Secondary Node │ │ (Remote Mgmt) │ │ +│ │ (Zone Replica) │ │ │ │ +│ └──────────────────┘ └─────────────────────┘ │ +│ │ │ +│ │ Serves DNS queries │ +│ │ Continuous zone sync │ +│ │ +└─────────────────────────────────────────────────────────┘ + │ + ▼ + ┌──────────────┐ + │ Cloudflare │ + │ R2 Backup │ + │ (Daily) │ + └──────────────┘ +``` + +### Cloud/Public DNS Deployment + +``` +┌──────────────────────────────────────────────────────────┐ +│ Authoritative DNS Infrastructure │ +│ │ +│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ +│ │ Technitium │ │ Technitium │ │ Technitium │ │ +│ │ Primary (1) │ │ Secondary(2)│ │ Secondary(3)│ │ +│ │ DoT/DoH │ │ DoT/DoH │ │ DoT/DoH │ │ +│ └─────────────┘ └─────────────┘ └─────────────┘ │ +│ │ │ │ │ +│ └─────────────────┼─────────────────┘ │ +│ │ │ +│ Zone Replication via Catalog Zones │ +│ │ +└──────────────────────────────────────────────────────────┘ + │ + ├──► Cloudflare Tunnel (Remote Admin Access) + ├──► Cloudflare R2 (Hourly Backups) + └──► Traefik + Let's Encrypt (HTTPS Admin) +``` + +--- + +## Network Requirements: DNS Port Exception + +Technitium DNS Server requires **UDP/TCP port 53** for DNS queries from clients. This is a **documented exception** to Dokploy's "no exposed ports" rule because: + +1. **DNS Protocol Fundamentals**: Unlike HTTP/HTTPS services routed through Traefik, DNS operates on its own protocol (port 53 UDP/TCP) without TLS encapsulation. DNS clients query port 53 directly and cannot be intercepted by Traefik. + +2. **Admin Console Access**: The web admin console (`port 5380`) IS routed through Traefik with HTTPS/Let's Encrypt encryption. Only port 53 is directly exposed. + +3. **Architectural Distinction**: + - ✅ **Port 53 (DNS)**: Directly exposed (protocol requirement) + - ✅ **Port 5380 (Admin)**: Traefik-routed HTTPS via domain + +**Security Model**: Port 53 is secured by firewall rules and network isolation, not TLS. Configure your firewall to restrict port 53 access to trusted networks (internal mining sites, specific ISP ranges, etc.). + +--- + +## Quick Start by Preset + +### Home/Office Setup (5 minutes) + +1. **Deploy template:** + ```bash + # Select "home-office" preset in Dokploy + # Set only DOMAIN and TECHNITIUM_ADMIN_PASSWORD + DOMAIN=dns.local + TECHNITIUM_ADMIN_PASSWORD=YourSecurePassword123! + ``` + +2. **Access admin console:** + ``` + https://dns.local (if DNS resolution works locally) + Or: https://:5380 (direct IP) + ``` + +3. **Configure forwarders (optional):** + - Admin Console → Forwarders + - Add: 1.1.1.1 (Cloudflare) or 8.8.8.8 (Google) + +4. **Enable ad-blocking:** + - Admin Console → Block Lists + - Add recommended blocklists (Adblock, StevenBlack, etc.) + +### Clustered Primary Setup (10-15 minutes) + +1. **Create Cloudflare Tunnel token:** + ``` + Cloudflare Dashboard → Zero Trust → Networks → Tunnels + → Create tunnel → Copy token + ``` + +2. **Create R2 bucket and credentials:** + ``` + Cloudflare Dashboard → R2 → Create bucket + → Manage R2 API Tokens → Generate token (Read & Write) + ``` + +3. **Deploy primary node with preset:** + ```bash + DOMAIN=dns.mining1.com + TECHNITIUM_ADMIN_PASSWORD=StrongPassword123! + CF_TUNNEL_TOKEN=eyJhIjoie... # From step 1 + R2_BUCKET_NAME=technitium-backups + R2_ACCESS_KEY_ID=abc123def456 + R2_SECRET_ACCESS_KEY=... + CF_ACCOUNT_ID=1234567890abcdef + TECHNITIUM_NODE_ROLE=primary + ``` + +4. **Initialize cluster (via Admin Console):** + - Admin Console → Cluster Page + - Click "Initialize Cluster" + - Configure catalog zone: `cluster.` + +### Clustered Secondary Setup (10 minutes) + +1. **Deploy secondary node with preset:** + ```bash + DOMAIN=dns.mining2.com + TECHNITIUM_ADMIN_PASSWORD=StrongPassword123! + TECHNITIUM_NODE_ROLE=secondary + PRIMARY_NODE_IP= + # (Same R2 and Tunnel credentials as primary) + ``` + +2. **Join cluster (via Admin Console):** + - Admin Console → Cluster Page + - Click "Join Cluster" + - Enter: Primary Node Address + Admin Password + - Wait for zones to sync (1-5 minutes depending on zone count) + +### Cloud/Public DNS Setup (20-30 minutes) + +1. **Complete steps 1-2 from Clustered Primary** + +2. **Deploy with cloud-authoritative preset:** + ```bash + DNS_OVER_TLS_ENABLED=true + DNS_OVER_HTTPS_ENABLED=true + BACKUP_INTERVAL=3600 # Hourly instead of daily + # (All other variables same as Clustered Primary) + ``` + +3. **Configure public zone (via Admin Console):** + - Admin Console → Zones → Add Zone + - Type: Primary (Authoritative) + - Zone Name: your-public-domain.com + - Configure NS records pointing to your DNS servers + +4. **Verify propagation:** + ```bash + # Check NS records + dig NS your-public-domain.com + + # Test DNS resolution + dig @ your-public-domain.com + + # Verify DoT (DNS over TLS) + kdig -d @ +tls your-public-domain.com + ``` + +--- + +## Cloudflare Integration Guide + +### R2 Backup Configuration + +Technitium data is synced to R2 daily (Clustered) or hourly (Cloud). This provides: +- ✅ Versioned backups for disaster recovery +- ✅ Off-site storage without egress costs +- ✅ Easy zone migration between deployments + +**Setup Steps:** + +1. **Create R2 Bucket:** + ``` + Cloudflare Dashboard → R2 → Create Bucket + Name: technitium-backups + Region: Automatic + ``` + +2. **Generate API Credentials:** + ``` + R2 → Manage R2 API Tokens → Create Token + Token Name: technitium-backup + Permissions: Object Read & Write + TTL: No expiry + Copy: Access Key ID & Secret Access Key + ``` + +3. **Configure in Dokploy:** + ``` + R2_BUCKET_NAME: technitium-backups + R2_ACCESS_KEY_ID: [from step 2] + R2_SECRET_ACCESS_KEY: [from step 2] + CF_ACCOUNT_ID: [from Cloudflare Dashboard URL: api.cloudflare.com/client/v4/accounts/{ACCOUNT_ID}] + R2_BACKUP_ENABLED: true + ``` + +4. **Verify backup:** + ```bash + # Check rclone logs + docker logs technitium-rclone + + # Should show: "Backup completed successfully" + # Check R2 via dashboard: R2 → technitium-backups → should see /technitium folder + ``` + +### Cloudflare Tunnel for Remote Admin Access + +Securely access admin console from anywhere without exposing port 5380: + +1. **Create Tunnel:** + ``` + Cloudflare Dashboard → Zero Trust → Networks → Tunnels + → Create Tunnel (Quick Tunnel) → Copy token + ``` + +2. **Configure in Dokploy:** + ``` + CLOUDFLARE_TUNNEL_ENABLED: true + CF_TUNNEL_TOKEN: eyJhIjoie... [from step 1] + ``` + +3. **Access remotely:** + ``` + https://dns.yourdomain.com → routes to Tunnel → localhost:5380 + Authentication via Cloudflare + your auth method (password/2FA) + ``` + +### DNS-01 Challenge for Wildcard Certificates (Optional) + +For wildcard SSL certificates (*.yourdomain.com): + +1. **Create DNS API Token:** + ``` + Cloudflare → Profile → API Tokens → Create Token + Template: Edit Zone DNS + Permissions: Zone → DNS → Edit + Zone Resources: Include → [your-domain.com] + ``` + +2. **Configure Traefik (via Dokploy environment or Traefik config):** + ``` + CF_DNS_API_TOKEN: [from step 1] + # Traefik will use this for DNS-01 challenge + ``` + +--- + +## Clustering Deep Dive + +### How Primary/Secondary Works + +Technitium uses **catalog zones** for automatic zone discovery and replication: + +1. **Primary Node:** + - Hosts all DNS zones + - Creates `cluster.` catalog zone + - Responds to zone transfer requests (AXFR/IXFR) + - Sends DNS NOTIFY when zones change + +2. **Secondary Node:** + - Subscribes to catalog zone + - Automatically receives list of zones to replicate + - Performs zone transfers (AXFR/IXFR) on schedule + - Receives DNSSEC keys automatically + - Serves DNS queries for all replicated zones + +3. **Zone Replication:** + ``` + Primary (zone master) ──AXFR/IXFR──► Secondary (zone replica) + ◄──DNS NOTIFY── + (zone updated) + ``` + +### Adding a New Secondary to Cluster + +1. **Deploy secondary node** (see Quick Start above) +2. **In Secondary Admin Console:** + - Cluster Page → Join Cluster + - Primary Address: `:53` + - Primary Admin Password: `` + - Click Join +3. **Verify zones synced:** + - Wait 1-5 minutes (depends on zone count) + - Check: Zone List page should show all zones from primary + +### Failover Procedure (Primary Down) + +If primary node fails: + +1. **Promote a secondary to primary:** + - Secondary Admin Console → Cluster Page + - Click "Promote to Primary" + - New primary becomes zone master + - Restart: Secondary nodes re-sync from new primary + +2. **Add replacement secondary:** + - Deploy new secondary node + - Join to current primary + - Zone sync begins automatically + +--- + +## Configuration Reference + +### Environment Variables + +| Variable | Required | Default | Description | +|----------|----------|---------|-------------| +| `DOMAIN` | Yes | — | Admin console domain (e.g., dns.yourdomain.com) | +| `TECHNITIUM_ADMIN_PASSWORD` | Yes | — | Initial admin password (32+ chars recommended) | +| `TECHNITIUM_NODE_ROLE` | No | primary | `primary` or `secondary` | +| `PRIMARY_NODE_IP` | (if secondary) | — | Primary node IP address (for secondary join) | +| `LOG_LEVEL` | No | info | `debug`, `info`, `warning`, `error` | +| `CLOUDFLARE_TUNNEL_ENABLED` | No | false | Enable Tunnel for remote access | +| `CF_TUNNEL_TOKEN` | (if Tunnel) | — | Cloudflare Tunnel token | +| `R2_BACKUP_ENABLED` | No | false | Enable R2 backups (Clustered/Cloud only) | +| `R2_BUCKET_NAME` | No | technitium-backups | R2 bucket name | +| `R2_ACCESS_KEY_ID` | (if R2) | — | R2 API access key | +| `R2_SECRET_ACCESS_KEY` | (if R2) | — | R2 API secret key | +| `CF_ACCOUNT_ID` | (if R2) | — | Cloudflare account ID | +| `BACKUP_INTERVAL` | No | 86400 | Backup interval in seconds (86400=1d, 3600=1h) | +| `DNS_OVER_TLS_ENABLED` | No | false | Enable DNS-over-TLS (DoT) | +| `DNS_OVER_HTTPS_ENABLED` | No | false | Enable DNS-over-HTTPS (DoH) | + +--- + +## Post-Deployment Checklist + +- [ ] **Change admin password** — Set strong password (32+ chars with symbols) +- [ ] **Configure forwarders** — Add upstream DNS servers (1.1.1.1 or 8.8.8.8) +- [ ] **Enable ad-blocking** — Add blocklists (Adblock, StevenBlack, etc.) +- [ ] **Test DNS resolution** — Query the server from local machine +- [ ] **Verify R2 backups** (if enabled) — Check logs: `docker logs technitium-rclone` +- [ ] **Test Tunnel access** (if enabled) — Access admin console remotely +- [ ] **Configure DNSSEC** (if public DNS) — Generate keys: Admin → DNSSEC +- [ ] **Setup monitoring** — Monitor port 53 and admin console health +- [ ] **Test failover** (if clustered) — Stop primary, verify secondaries serve zones +- [ ] **Document configuration** — Keep backup of zones and settings + +--- + +## Troubleshooting + +### Zones Not Replicating (Clustered) + +**Symptom:** Secondary node doesn't show zones from primary +**Diagnosis:** Check cluster connectivity +```bash +# From secondary console +dig @ cluster. AXFR +# Should return catalog zone entries +``` +**Solution:** +- Verify PRIMARY_NODE_IP is correct (internal IP, not hostname) +- Check: Admin Password matches primary +- Firewall: Allow port 53 TCP/UDP from secondary to primary + +### R2 Backup Fails + +**Symptom:** `docker logs technitium-rclone` shows errors +**Diagnosis:** Check credentials +```bash +# View last backup log +docker exec technitium-rclone tail -20 /logs/backup-*.log +``` +**Solution:** +- Verify R2 credentials: Access Key ID, Secret Key +- Verify CF_ACCOUNT_ID in URL: `https://{CF_ACCOUNT_ID}.r2.cloudflarestorage.com` +- Check bucket name exists +- Grant R2 API token object read/write permissions + +### Admin Console Not Accessible via HTTPS + +**Symptom:** `https://dns.yourdomain.com` shows certificate error +**Diagnosis:** Let's Encrypt certificate not issued +```bash +# Check Traefik logs +docker logs traefik +``` +**Solution:** +- Verify DOMAIN resolves to Dokploy host +- Check firewall: Allow 80 (HTTP acme challenge) and 443 +- Wait 5-10 minutes for certificate issuance +- Refresh browser cache (Ctrl+Shift+R) + +### Secondary Node Won't Join Cluster + +**Symptom:** Error joining cluster in admin console +**Diagnosis:** Network or authentication issue +**Solution:** +- Verify primary is reachable: `ping ` +- Verify DNS port: `nc -zv 53` +- Confirm admin password matches +- Check primary is in ROLE=primary (not secondary) +- Restart secondary container and retry + +--- + +## Advanced: Performance Tuning + +### Large Zone Databases (1000+ zones) + +- **Increase start_period** in healthcheck (DNS server needs time to load) +- **Allocate more memory** — 2GB minimum, 4GB+ recommended for 10K+ zones +- **Enable zone caching** — Admin Console → Options → Zone Caching + +### High Query Rate Optimization + +- **Enable UDP query caching** — Admin Console → Options → Query Caching +- **Use ` DoH (DNS-over-HTTPS)` — Reduces per-query overhead +- **Add secondary nodes** — Distribute query load across cluster + +### R2 Backup Performance + +- **Increase BACKUP_INTERVAL** if backup times exceed 1 hour +- **Monitor rclone logs** — Check transfer rates +- **Use rclone parallel transfers** — Edit rclone.conf for multi-threaded sync + +--- + +## Update Procedures + +### Updating to New Technitium Version + +1. **Check upstream releases:** https://github.com/TechnitiumSoftware/DnsServer/releases +2. **Update docker-compose.yml image tag** (e.g., `14.3` → `14.4`) +3. **Redeploy:** + ```bash + docker compose pull + docker compose up -d --no-deps --build technitium + ``` +4. **Verify:** Admin Console loads + zones accessible +5. **For clustered setups:** Update secondaries one at a time (maintain availability) + +--- + +## Support & Resources + +- **GitHub:** https://github.com/TechnitiumSoftware/DnsServer +- **Official Docs:** https://docs.technitium.com +- **Community:** GitHub Discussions +- **Issues:** Report bugs via GitHub Issues + +--- + +## Production Checklist (15 items) + +Before running in production: + +- [ ] Backup zones regularly (R2 enabled or manual exports) +- [ ] Configure alerting on DNS query failures (monitoring outside this template) +- [ ] Set strong admin password (32+ chars, symbols, numbers) +- [ ] Enable DNSSEC for all zones (if public DNS) +- [ ] Test zone transfer to secondary nodes (if clustered) +- [ ] Verify failover procedure works (stop primary, check secondaries) +- [ ] Monitor R2 backup logs (hourly for Cloud, daily for Clustered) +- [ ] Set log level to `warning` or `error` (reduce noise in production) +- [ ] Document all configuration changes (for disaster recovery) +- [ ] Schedule regular backups from R2 (download monthly to cold storage) +- [ ] Test zone restoration from R2 backup (monthly procedure) +- [ ] Monitor disk space (zone database growth) +- [ ] Review admin console logs monthly (audit access, changes) +- [ ] Plan failover runbook (documented procedures) +- [ ] Test failover at least quarterly (validates procedures) + +--- + +**Version:** 1.0.0 +**Last Updated:** 2026-03-01 +**Maintainer:** Dokploy Community +**License:** MIT (Technitium DNS Server © Shreyas Zare) diff --git a/blueprints/technitium-dns/docker-compose.yml b/blueprints/technitium-dns/docker-compose.yml new file mode 100644 index 0000000..a361127 --- /dev/null +++ b/blueprints/technitium-dns/docker-compose.yml @@ -0,0 +1,119 @@ +version: '3.8' + +# Technitium DNS Server - Production-Ready Dokploy Template +# Supports: Home/Office, Clustered (Primary/Secondary), Cloud/Public DNS +# Note: DNS services require port 53 exposure (exception to "no ports" rule) +# https://docs.dokploy.io/guides/dns-server + +services: + technitium: + image: technitiumsoftware/dns-server:14.3 + restart: unless-stopped + ports: + - "53:53/tcp" + - "53:53/udp" + volumes: + - technitium-data:/etc/dns + environment: + # Core Configuration (Required) + DOMAIN: ${DOMAIN:?Set your domain for the DNS server} + TECHNITIUM_ADMIN_PASSWORD: ${TECHNITIUM_ADMIN_PASSWORD:?Set a strong admin password} + + # Clustering Configuration (Primary/Secondary) + TECHNITIUM_NODE_ROLE: ${TECHNITIUM_NODE_ROLE:-primary} + PRIMARY_NODE_IP: ${PRIMARY_NODE_IP:-} + + # Logging Configuration + LOG_LEVEL: ${LOG_LEVEL:-info} + + # Cloudflare Integration + # Tunnel (optional - enables secure remote management across mining sites) + CLOUDFLARE_TUNNEL_ENABLED: ${CLOUDFLARE_TUNNEL_ENABLED:-false} + CF_TUNNEL_TOKEN: ${CF_TUNNEL_TOKEN:-} + + # R2 Backup (optional - enabled in Clustered/Cloud presets) + R2_BACKUP_ENABLED: ${R2_BACKUP_ENABLED:-false} + R2_BUCKET_NAME: ${R2_BUCKET_NAME:-technitium-backups} + R2_ACCESS_KEY_ID: ${R2_ACCESS_KEY_ID:-} + R2_SECRET_ACCESS_KEY: ${R2_SECRET_ACCESS_KEY:-} + CF_ACCOUNT_ID: ${CF_ACCOUNT_ID:-} + + # Backup Schedule (86400s = daily for Clustered, 3600s = hourly for Cloud) + BACKUP_INTERVAL: ${BACKUP_INTERVAL:-86400} + + # DNS Features (Cloud preset only - DoT, DoH support) + DNS_OVER_TLS_ENABLED: ${DNS_OVER_TLS_ENABLED:-false} + DNS_OVER_HTTPS_ENABLED: ${DNS_OVER_HTTPS_ENABLED:-false} + labels: + - "traefik.enable=true" + - "traefik.http.routers.technitium.rule=Host(`${DOMAIN}`)" + - "traefik.http.routers.technitium.entrypoints=websecure" + - "traefik.http.routers.technitium.tls.certresolver=letsencrypt" + - "traefik.http.routers.technitium.middlewares=technitium-headers@docker" + - "traefik.http.services.technitium.loadbalancer.server.port=5380" + - "traefik.http.middlewares.technitium-headers.headers.stsSeconds=31536000" + - "traefik.http.middlewares.technitium-headers.headers.stsIncludeSubdomains=true" + - "traefik.http.middlewares.technitium-headers.headers.contentTypeNosniff=true" + - "traefik.http.middlewares.technitium-headers.headers.frameDeny=true" + healthcheck: + test: ["CMD-SHELL", "nc -z localhost 53 || exit 1"] + interval: 30s + timeout: 10s + retries: 3 + start_period: 30s + + # rclone Backup Sidecar (Only when R2_BACKUP_ENABLED=true) + rclone-backup: + image: rclone/rclone:1.67.0 + restart: unless-stopped + depends_on: + technitium: + condition: service_healthy + volumes: + - technitium-data:/etc/dns:ro + - rclone-logs:/logs + environment: + # R2 Backup Configuration (synced from Technitium service) + R2_BUCKET_NAME: ${R2_BUCKET_NAME:-technitium-backups} + R2_ACCESS_KEY_ID: ${R2_ACCESS_KEY_ID:-} + R2_SECRET_ACCESS_KEY: ${R2_SECRET_ACCESS_KEY:-} + CF_ACCOUNT_ID: ${CF_ACCOUNT_ID:-} + BACKUP_INTERVAL: ${BACKUP_INTERVAL:-86400} + R2_BACKUP_ENABLED: ${R2_BACKUP_ENABLED:-false} + # rclone R2 configuration + RCLONE_CONFIG_R2_TYPE: s3 + RCLONE_CONFIG_R2_PROVIDER: Cloudflare + RCLONE_CONFIG_R2_ACCESS_KEY_ID: ${R2_ACCESS_KEY_ID:-} + RCLONE_CONFIG_R2_SECRET_ACCESS_KEY: ${R2_SECRET_ACCESS_KEY:-} + RCLONE_CONFIG_R2_ENDPOINT: https://${CF_ACCOUNT_ID}.r2.cloudflarestorage.com + RCLONE_CONFIG_R2_ACL: private + entrypoint: > + sh -c ' + if [ "$$R2_BACKUP_ENABLED" = "true" ]; then + mkdir -p /logs + while true; do + logfile="/logs/backup-$(date +%Y-%m-%d).log" + echo "[$(date)] Starting backup sync to R2..." >> "$$logfile" + rclone sync /etc/dns "r2:$$R2_BUCKET_NAME/technitium" \ + >> "$$logfile" 2>&1 && \ + echo "[$(date)] Backup completed successfully" >> "$$logfile" || \ + echo "[$(date)] Backup failed with exit code $$?" >> "$$logfile" + sleep "$$BACKUP_INTERVAL" + done + else + echo "R2 backup disabled (R2_BACKUP_ENABLED=false). Sidecar will remain idle." + tail -f /dev/null + fi + ' + healthcheck: + test: ["CMD-SHELL", "[ -f /logs/backup-$(date +%Y-%m-%d).log ] && tail -5 /logs/backup-$(date +%Y-%m-%d).log | grep -q 'completed successfully' || exit 0"] + interval: 300s + timeout: 10s + retries: 1 + start_period: 60s + +volumes: + technitium-data: + driver: local + rclone-logs: + driver: local diff --git a/blueprints/technitium-dns/technitium-dns.svg b/blueprints/technitium-dns/technitium-dns.svg new file mode 100644 index 0000000..b37c56a --- /dev/null +++ b/blueprints/technitium-dns/technitium-dns.svg @@ -0,0 +1,56 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + DNS Server + \ No newline at end of file diff --git a/blueprints/technitium-dns/template.toml b/blueprints/technitium-dns/template.toml new file mode 100644 index 0000000..af6f0a3 --- /dev/null +++ b/blueprints/technitium-dns/template.toml @@ -0,0 +1,136 @@ +# Technitium DNS Server - Production-Ready Dokploy Template +# Official: https://github.com/TechnitiumSoftware/DnsServer +# Documentation: https://docs.dokploy.io/guides/dns-server + +[variables] +domain = "${domain:?Set admin console domain (e.g., dns.yourdomain.com)}" +admin_password = "${password:32}" + +# Clustering (required for secondary nodes) +primary_node_ip = "" + +# Cloudflare Tunnel (optional - for secure remote management across distributed sites) +cf_tunnel_token = "" + +# R2 Backup (optional - Clustered/Cloud presets only) +cf_account_id = "" +r2_bucket_name = "" +r2_access_key_id = "" +r2_secret_access_key = "" + +[[config.domains]] +serviceName = "technitium" +port = 5380 +host = "${domain}" + +# --- +# Preset 1: Home/Office +# Single DNS server for local network with ad-blocking +# No external backups, no Tunnel access (admin console only accessible locally) +# Quick setup: ~5 minutes + +[[presets]] +name = "home-office" +description = "Single DNS server for home/office network with ad-blocking" +icon = "🏠" +shortDescription = "Local DNS with ad-blocking" + +[presets.config.env] +DOMAIN = "${domain}" +TECHNITIUM_ADMIN_PASSWORD = "${admin_password}" +TECHNITIUM_NODE_ROLE = "primary" +LOG_LEVEL = "info" +CLOUDFLARE_TUNNEL_ENABLED = "false" +R2_BACKUP_ENABLED = "false" +DNS_OVER_TLS_ENABLED = "false" +DNS_OVER_HTTPS_ENABLED = "false" + +# --- +# Preset 2: Clustered - Primary Node +# Primary node managing a cluster across multiple mining sites +# Zone replication via catalog zones (no shared storage SPOF) +# Daily R2 backups + Cloudflare Tunnel for secure remote access +# Setup: ~10-15 minutes per node + +[[presets]] +name = "clustered-primary" +description = "Primary node of Technitium cluster with R2 backups and Tunnel" +icon = "⚙️" +shortDescription = "Cluster primary + R2 + Tunnel" + +[presets.config.env] +DOMAIN = "${domain}" +TECHNITIUM_ADMIN_PASSWORD = "${admin_password}" +TECHNITIUM_NODE_ROLE = "primary" +LOG_LEVEL = "info" +CLOUDFLARE_TUNNEL_ENABLED = "true" +CF_TUNNEL_TOKEN = "${cf_tunnel_token:?Set Cloudflare Tunnel token from dashboard}" +R2_BACKUP_ENABLED = "true" +R2_BUCKET_NAME = "${r2_bucket_name:?Set R2 bucket name}" +R2_ACCESS_KEY_ID = "${r2_access_key_id:?Set R2 access key ID}" +R2_SECRET_ACCESS_KEY = "${r2_secret_access_key:?Set R2 secret access key}" +CF_ACCOUNT_ID = "${cf_account_id:?Set Cloudflare account ID}" +BACKUP_INTERVAL = "86400" +DNS_OVER_TLS_ENABLED = "false" +DNS_OVER_HTTPS_ENABLED = "false" + +# --- +# Preset 3: Clustered - Secondary Node +# Secondary node joining existing cluster for zone replication +# Subscribes to primary's catalog zone for automatic zone sync +# Receives zones + DNSSEC keys via DNS protocol (no shared storage) +# Setup: ~10 minutes (requires primary node IP) + +[[presets]] +name = "clustered-secondary" +description = "Secondary node joining cluster for automatic zone replication" +icon = "↔️" +shortDescription = "Cluster secondary + R2 + Tunnel" + +[presets.config.env] +DOMAIN = "${domain}" +TECHNITIUM_ADMIN_PASSWORD = "${admin_password}" +TECHNITIUM_NODE_ROLE = "secondary" +PRIMARY_NODE_IP = "${primary_node_ip:?Set primary node IP address}" +LOG_LEVEL = "info" +CLOUDFLARE_TUNNEL_ENABLED = "true" +CF_TUNNEL_TOKEN = "${cf_tunnel_token:?Set Cloudflare Tunnel token}" +R2_BACKUP_ENABLED = "true" +R2_BUCKET_NAME = "${r2_bucket_name:?Set R2 bucket name}" +R2_ACCESS_KEY_ID = "${r2_access_key_id:?Set R2 access key ID}" +R2_SECRET_ACCESS_KEY = "${r2_secret_access_key:?Set R2 secret access key}" +CF_ACCOUNT_ID = "${cf_account_id:?Set Cloudflare account ID}" +BACKUP_INTERVAL = "86400" +DNS_OVER_TLS_ENABLED = "false" +DNS_OVER_HTTPS_ENABLED = "false" + +# --- +# Preset 4: Cloud/Public DNS +# High-availability authoritative DNS for customer-facing deployments +# Multi-instance primary/secondary setup across regions +# Hourly R2 backups (vs daily for Clustered) +# Full DNS-over-TLS (DoT) and DNS-over-HTTPS (DoH) support +# Production monitoring + failover runbook +# Setup: ~20-30 minutes + +[[presets]] +name = "cloud-authoritative" +description = "Public authoritative DNS with HA, DoT/DoH, and hourly R2 backups" +icon = "☁️" +shortDescription = "Production authoritative DNS" + +[presets.config.env] +DOMAIN = "${domain}" +TECHNITIUM_ADMIN_PASSWORD = "${admin_password}" +TECHNITIUM_NODE_ROLE = "primary" +LOG_LEVEL = "info" +CLOUDFLARE_TUNNEL_ENABLED = "true" +CF_TUNNEL_TOKEN = "${cf_tunnel_token:?Set Cloudflare Tunnel token}" +R2_BACKUP_ENABLED = "true" +R2_BUCKET_NAME = "${r2_bucket_name:?Set R2 bucket name}" +R2_ACCESS_KEY_ID = "${r2_access_key_id:?Set R2 access key ID}" +R2_SECRET_ACCESS_KEY = "${r2_secret_access_key:?Set R2 secret access key}" +CF_ACCOUNT_ID = "${cf_account_id:?Set Cloudflare account ID}" +BACKUP_INTERVAL = "3600" +DNS_OVER_TLS_ENABLED = "true" +DNS_OVER_HTTPS_ENABLED = "true" diff --git a/docs/plans/2026-03-01-technitium-dns-design.md b/docs/plans/2026-03-01-technitium-dns-design.md new file mode 100644 index 0000000..ac17b5e --- /dev/null +++ b/docs/plans/2026-03-01-technitium-dns-design.md @@ -0,0 +1,382 @@ +# Technitium DNS Server Dokploy Template - Design Document + +**Date:** March 1, 2026 +**Status:** APPROVED +**Author:** Brainstorming & Design Phase +**Use Case:** DNS infrastructure for Ryno Crypto Mining + ServerDomes Edge Data Centers + +--- + +## Executive Summary + +A production-ready Dokploy template for Technitium DNS Server (v14.3) supporting three deployment scenarios via configuration presets: +- **Home/Office** — Single instance, local network DNS with ad-blocking +- **Clustered** — Primary/Secondary across multiple mining sites with R2 backups + Cloudflare Tunnel +- **Cloud/Public DNS** — High-availability authoritative DNS with full Cloudflare stack integration + +**Key Strategic Decisions:** +1. Single `docker-compose.yml` with environment-driven behavior (no duplication) +2. Primary/Secondary clustering via Technitium's native catalog zones (no shared storage SPOF) +3. R2 backups preset-specific (Clustered/Cloud only) to minimize friction for simple deployments +4. Cloudflare Tunnel for secure remote management across geographically distributed mining facilities +5. Traefik reverse proxy for admin console HTTPS in all presets + +--- + +## Design Rationale + +### Deployment Scenarios + +#### 1. Home/Office Preset +**Target:** Small networks, ad-blocking, privacy-focused DNS + +- Single Technitium instance on internal Docker bridge +- Traefik reverse proxy → admin console HTTPS (Let's Encrypt) +- Local persistent volume for config/zones +- No R2 backup (optional manual backup via README guide) +- No Cloudflare Tunnel (admin console only accessible locally) +- **Setup time:** 5 minutes +- **Friction:** Minimal (no external credentials required) + +#### 2. Clustered Preset +**Target:** Ryno Crypto Mining operations across multiple facilities + +- Primary node: hosts zones, manages catalog zone, controls cluster +- Secondary node(s): replicate zones via AXFR/IXFR with DNS NOTIFY +- Zones sync via Technitium's native catalog zones (no shared storage) +- rclone sidecar syncs `/etc/dns` to R2 daily at 02:00 UTC +- Cloudflare Tunnel for secure remote access to primary's admin console +- Health checks: DNS port 53 + admin UI port 5380 +- **Network topology:** Each node independent, synced via DNS protocol +- **Resilience:** Zone data replicated across nodes; if primary fails, secondaries continue serving; R2 provides disaster recovery +- **Setup time:** 10-15 minutes per node + +#### 3. Cloud/Public DNS Preset +**Target:** ServerDomes customer-facing authoritative DNS + +- Multi-instance primary/secondary HA setup +- DNS-over-TLS, DNS-over-HTTPS, DNS-over-QUIC support +- rclone backup runs hourly (vs daily for Clustered) +- Cloudflare Tunnel for management + optional Workers for API authentication +- Full monitoring runbook + failover procedures +- **Setup time:** 20-30 minutes + +### Cloudflare Integration (Option E: A + B + D) + +| Service | Purpose | Presets | Why | +|---------|---------|---------|-----| +| **Tunnel (A)** | Encrypted remote access to cluster admin console | Clustered, Cloud | Secure management across mining sites without exposing admin ports | +| **R2 (B)** | Versioned zone/config backups | Clustered, Cloud | Disaster recovery, infrastructure-as-code patterns, zero egress fees | +| **Traefik HTTPS (D)** | Admin console HTTPS via Let's Encrypt | All | Standard reverse proxy, Dokploy-native, no Cloudflare dependency | +| **Workers (C)** | Optional API gateway + auth | Cloud only | For multi-tenant DNS-as-a-Service (skip for internal mining infra) | + +**Philosophy:** Privacy-first. No DNS query data transits Cloudflare—only management traffic uses Tunnel. R2 can use encryption-at-rest with customer-managed keys. + +### Clustering Strategy: Primary/Secondary + Catalog Zones + +**Why not shared storage (NFS)?** +- Shared storage creates a single point of failure (SPOF) and network latency +- Contradicts distributed mining/edge data center philosophy +- Technitium's native clustering already solves this via DNS zone transfers + +**Why Primary/Secondary + Catalog Zones?** +- Industry-standard DNS HA pattern (AXFR/IXFR with DNS NOTIFY) +- Technitium's clustering feature builds on catalog zones for automatic provisioning +- Each node runs independently, zones sync via DNS protocol (no shared disk) +- Secondaries automatically discover zones from the catalog zone on primary +- DNSSEC keys replicate automatically with zones + +**Implementation:** +- All nodes share only basic config (domain, logging, TZ) via environment variables +- Nodes do NOT share persistent volumes (no `/etc/dns` NFS) +- Primary hosts zones and the `cluster-catalog.` zone +- Secondaries subscribe to catalog zone, receive zones + DNSSEC keys automatically +- Zone transfers happen over standard DNS protocol (no custom replication logic) + +--- + +## Architecture + +### File Structure + +``` +blueprints/technitium-dns/ +├── docker-compose.yml # Single compose for all presets (env-driven) +├── template.toml # 4 presets (Home, Clustered-Primary, Clustered-Secondary, Cloud) +├── rclone.conf.template # R2 sync config (filled via env vars) +├── healthcheck.sh # DNS port 53 + admin UI health checks +└── README.md # 350+ lines: + ├── Architecture diagrams (ASCII) + ├── Preset quick-start (5 min each) + ├── Cloudflare Tunnel setup (step-by-step) + ├── R2 backup configuration and verification + ├── Primary/Secondary cluster configuration guide + ├── Zone replication via catalog zones + ├── Migration path (Home → Clustered → Cloud) + ├── Failover + monitoring runbook + ├── Performance tuning for large zone databases + └── Troubleshooting by scenario +``` + +### docker-compose.yml Behavior + +**Preset-Driven:** +- Single `docker-compose.yml` works for all 4 presets +- Environment variables control behavior: + - `TECHNITIUM_NODE_ROLE` → `primary` or `secondary` + - `R2_BACKUP_ENABLED` → `true` or `false` + - `BACKUP_INTERVAL` → `86400` (daily) or `3600` (hourly) + - `CLOUDFLARE_TUNNEL_ENABLED` → `true` or `false` + +**rclone Sidecar:** +- Only created when `R2_BACKUP_ENABLED=true` (Clustered/Cloud presets) +- Mounts Technitium's `/etc/dns` volume read-only +- Runs `rclone sync` on schedule (daily for Clustered, hourly for Cloud) +- Logs to `/logs/backup-YYYY-MM-DD.log` +- Healthcheck verifies no errors in latest log + +**Health Checks:** +```yaml +technitium: + healthcheck: + test: ["CMD-SHELL", "nc -z localhost 53 || exit 1"] + interval: 30s + timeout: 5s + retries: 3 + start_period: 20s +``` + +--- + +## Configuration: template.toml + +### Variables Section +```toml +[variables] +domain = "${domain:?Set admin console domain}" +admin_password = "${password:32}" +dns_server_domain = "${dns_server_domain:-ns1.local}" + +# Cloudflare Tunnel (if enabled) +cf_tunnel_token = "" + +# R2 Backup (if enabled) +r2_account_id = "" +r2_bucket_name = "technitium-backups" +r2_access_key_id = "" +r2_secret_access_key = "" +``` + +### Presets + +**Preset 1: Home/Office** (Default) +```toml +[[presets]] +name = "home-office" +description = "Single DNS server for local network with ad-blocking" +# No R2, no Tunnel +[env] +TECHNITIUM_NODE_ROLE = "primary" +R2_BACKUP_ENABLED = "false" +CLOUDFLARE_TUNNEL_ENABLED = "false" +``` + +**Preset 2: Clustered - Primary Node** +```toml +[[presets]] +name = "clustered-primary" +description = "Primary node of a Technitium cluster across mining sites" +# R2 daily backups + Tunnel for remote management +[env] +TECHNITIUM_NODE_ROLE = "primary" +R2_BACKUP_ENABLED = "true" +BACKUP_INTERVAL = "86400" +CLOUDFLARE_TUNNEL_ENABLED = "true" +CLOUDFLARE_TUNNEL_TOKEN = "${cf_tunnel_token:?Set Cloudflare Tunnel token}" +``` + +**Preset 3: Clustered - Secondary Node** +```toml +[[presets]] +name = "clustered-secondary" +description = "Secondary node joining an existing Technitium cluster" +# Same R2 + Tunnel, but ROLE=secondary +[env] +TECHNITIUM_NODE_ROLE = "secondary" +PRIMARY_NODE_IP = "${primary_node_ip:?Set primary node IP}" +R2_BACKUP_ENABLED = "true" +BACKUP_INTERVAL = "86400" +CLOUDFLARE_TUNNEL_ENABLED = "true" +``` + +**Preset 4: Cloud/Public DNS** +```toml +[[presets]] +name = "cloud-authoritative" +description = "Public authoritative DNS server with HA and monitoring" +# R2 hourly + full Cloudflare stack +[env] +TECHNITIUM_NODE_ROLE = "primary" +R2_BACKUP_ENABLED = "true" +BACKUP_INTERVAL = "3600" +CLOUDFLARE_TUNNEL_ENABLED = "true" +DNS_OVER_TLS_ENABLED = "true" +DNS_OVER_HTTPS_ENABLED = "true" +``` + +--- + +## Validation Checklist (Phase 4) + +### YAML/TOML Syntax +- ✅ `docker-compose.yml` is valid YAML +- ✅ `template.toml` is valid TOML +- ✅ All preset variables interpolate correctly +- ✅ rclone sidecar condition syntax valid + +### Technitium Configuration +- ✅ Image pinned to version 14.3 (no `:latest`) +- ✅ Admin password uses `${password:32}` generator +- ✅ R2 credentials use `${VAR:?error}` syntax +- ✅ Cloudflare Tunnel token required only when enabled + +### Security +- ✅ No hardcoded secrets in compose or template files +- ✅ Health checks don't expose sensitive data +- ✅ R2 bucket configured with `acl = private` +- ✅ Tunnel traffic is end-to-end encrypted +- ✅ Admin console requires password (no default) + +### Network Topology +- ✅ Two networks: `dns-internal` (bridge) + `dokploy-network` (external) +- ✅ Technitium on both networks (internal for clustering, external for Traefik) +- ✅ DNS port 53 exposed (UDP + TCP) +- ✅ Admin port 5380 only accessible via Traefik (HTTPS) + +### Clustering +- ✅ Primary/Secondary distinction via `TECHNITIUM_NODE_ROLE` env var +- ✅ Primary node can initialize cluster via UI (Cluster page) +- ✅ Secondary nodes require `PRIMARY_NODE_IP` to join +- ✅ Catalog zones documented in README + +### Backup (R2) +- ✅ rclone sidecar only created when `R2_BACKUP_ENABLED=true` +- ✅ Backup container mounts technitium volume read-only +- ✅ rclone config via environment variables (not hardcoded) +- ✅ Daily schedule for Clustered, hourly for Cloud +- ✅ Health check verifies backup success + +### Monitoring +- ✅ Health checks on DNS port 53 (primary health indicator) +- ✅ Health checks on admin UI port 5380 (secondary indicator) +- ✅ rclone sidecar healthcheck verifies backup didn't error +- ✅ Dokploy integration surfaces unhealthy containers + +--- + +## Documentation Plan (Phase 5) + +README.md will include: + +1. **Overview** — 50 lines + - What is Technitium DNS + - Key features (recursive + authoritative, clustering, encrypted DNS) + - Use cases (mining operations, edge data centers, public DNS) + +2. **Architecture Diagrams** — 80 lines (ASCII) + - Home/Office: single instance + Traefik + - Clustered: primary + secondary + R2 backup + Tunnel + - Cloud: HA setup + monitoring + +3. **Preset Quick-Start** — 150 lines + - Home/Office: 5-minute setup + - Clustered-Primary: 10-minute setup + - Clustered-Secondary: join existing cluster + - Cloud: full HA deployment + +4. **Cloudflare Integration Guides** — 200 lines + - Tunnel setup (remote admin access) + - R2 backup configuration (zone versioning) + - DNS record setup for public deployments + - Zero Trust Access (optional for admin panel) + +5. **Clustering Guide** — 150 lines + - How primary/secondary works + - Creating catalog zones + - Adding secondaries to cluster + - Zone replication verification + - Promoting secondary to primary (failover) + +6. **Migration Path** — 80 lines + - Home → Clustered upgrade path + - Clustered → Cloud scaling + - Data migration between deployments + +7. **Failover & Monitoring Runbook** — 120 lines + - Primary failure detection + - Promoting secondary to primary + - R2 backup verification + - DNS query rate monitoring + - Performance tuning (large zones, many clients) + +8. **Troubleshooting** — 150 lines + - Zones not replicating (catalog zone issues) + - Secondary not joining cluster + - Backup failures (R2 credentials) + - DNS port conflicts + - Admin console connection issues + +9. **Post-Deployment Checklist** — 50 lines + - Verify admin password changed + - Configure forwarders (upstream DNS) + - Enable block lists (ad-blocking) + - Test failover scenario + +--- + +## Success Metrics + +### Quality +- ✅ All validation checks pass (Phase 4) +- ✅ README is >300 lines with examples and diagrams +- ✅ Preset documentation is complete and self-contained +- ✅ Cloudflare Tunnel setup is step-by-step, repeatable + +### Completeness +- ✅ 3 deployment scenarios covered (Home, Clustered, Cloud) +- ✅ Primary/Secondary clustering documented +- ✅ R2 backup configuration with verification steps +- ✅ Failover and monitoring runbook included + +### Usability +- ✅ Each preset has <10 minute quick-start +- ✅ Migration paths documented (Home → Clustered → Cloud) +- ✅ Troubleshooting covers 90%+ of common issues +- ✅ Cloudflare setup is not a blocker (optional for Home preset) + +--- + +## Next Steps + +1. **Phase 3: Generation** — Use 6 progressive Dokploy skills to generate files +2. **Phase 4: Validation** — Run security + convention checks +3. **Phase 5: Documentation** — Write comprehensive README +4. **Phase 6: Index Update** — Add to blueprints/README.md in alphabetical order +5. **Final: Git Commit & PR** — Create feature branch, commit, push, open PR + +--- + +## Decision Log + +| Decision | Rationale | Alternatives Considered | +|----------|-----------|------------------------| +| Single `docker-compose.yml` | Simplicity + environment-driven behavior | Separate compose per preset (rejected: duplication) | +| Primary/Secondary clustering | No shared storage SPOF, industry standard DNS pattern | Shared NFS (rejected: single point of failure) | +| Preset-specific R2 backup | Avoid friction for simple deployments | R2 in all presets (rejected: adds cost/complexity for Home users) | +| Cloudflare Tunnel for management | Secure remote access without exposing ports | Basic auth (rejected: less secure); VPN (rejected: complexity) | +| Traefik HTTPS | Dokploy-native, Let's Encrypt built-in | Cloudflare SSL (rejected: unnecessary for admin-only console) | + +--- + +**Design Status:** ✅ **APPROVED** +**Ready for:** Phase 3 Generation (Skills Loading) + Phase 4 Validation diff --git a/docs/plans/CLAUDE.md b/docs/plans/CLAUDE.md new file mode 100644 index 0000000..59ab83f --- /dev/null +++ b/docs/plans/CLAUDE.md @@ -0,0 +1,3 @@ + + + \ No newline at end of file