daytonaio · zergzorg · May 20, 2026 · May 20, 2026 · May 22, 2026
diff --git a/authors/zergzorg.md b/authors/zergzorg.md
@@ -0,0 +1,5 @@
+Author: zergzorg Title: Open-source Contributor Description: zergzorg contributes
+practical developer workflow documentation and focused open-source improvements,
+with an emphasis on reproducible environments, clear validation steps, and safe
+handling of operational data. Author Image: Author LinkedIn: Author Twitter:
+Company Name: Independent Company Logo Dark: Company Logo White:
diff --git a/definitions/20260520_definition_transcript_redaction_workflow.md b/definitions/20260520_definition_transcript_redaction_workflow.md
@@ -0,0 +1,21 @@
+---
+title: 'Transcript Redaction Workflow'
+description: 'A repeatable process for reviewing and masking sensitive details in transcripts before sharing them.'
+date: 2026-05-20
+author: 'zergzorg'
+---
+
+# Transcript Redaction Workflow
+
+## Definition
+
+A transcript redaction workflow is a repeatable process for finding, reviewing, and masking sensitive details in generated transcripts before those transcripts are shared with another system, team, or customer.
+It usually combines deterministic checks, human review, and an audit trail that records which transcript was reviewed and when it was cleared for handoff.
+
+## Context and Usage
+
+AI transcription tools can turn recordings into useful text quickly, but raw transcripts may include email addresses, phone numbers, private hostnames, customer names, incident identifiers, API tokens, or internal project names.
+A redaction workflow gives teams a small safety gate between "the model produced text" and "the transcript is safe to paste into a ticket, knowledge base, prompt, or retrieval system."
+
+In a cloud development environment such as Daytona, teams can keep this process reproducible by storing sample commands, review checklists, local ignore rules, and redaction scripts alongside the transcription project.
+The key is to treat raw recordings and raw transcripts as temporary working files, then promote only reviewed artifacts into downstream documentation or automation.
diff --git a/guides/20260520_privacy_safe_sapat_transcription_daytona.md b/guides/20260520_privacy_safe_sapat_transcription_daytona.md
@@ -0,0 +1,353 @@
+---
+title: 'Privacy-Safe Sapat Transcription in Daytona'
+description: 'Run Sapat in Daytona, review sensitive transcript data locally, and hand off only cleared transcript artifacts.'
+date: 2026-05-20
+author: 'zergzorg'
+tags: ['daytona', 'sapat', 'ai transcription', 'privacy']
+---
+
+# Privacy-Safe Sapat Transcription in Daytona
+
+# Introduction
+
+AI transcription is useful because it turns recordings into text that teams can search, summarize, and reuse. It also creates a new review problem.
+A raw transcript can contain email addresses, phone numbers, customer names, internal service names, incident identifiers, private hostnames, or copied API tokens that someone mentioned during a demo.
+Before that text moves into a ticket, a knowledge base, a retrieval pipeline, or another model prompt, it needs a small safety gate.
+
+This guide shows how to run [Sapat](https://github.com/nkkko/sapat) inside a Daytona workspace and add a local [transcript redaction workflow](../definitions/20260520_definition_transcript_redaction_workflow.md) around it.
+Sapat converts video files to MP3 with `ffmpeg`, sends the audio to a selected provider, and writes a `.txt` file next to the source video.
+The workflow below keeps credentials and raw recordings out of commits, stages transcripts in predictable folders, applies deterministic redaction checks, and produces a reviewed artifact that is easier to share safely.
+
+![Privacy-safe Sapat transcription workflow](assets/20260520_privacy_safe_sapat_transcription_daytona_img1.svg)
+
+## TL;DR
+
+- Create a Daytona workspace from the Sapat repository so setup, commands, and working folders are repeatable.
+- Configure one supported provider with environment variables, then keep `.env`, recordings, and raw transcript files out of Git.
+- Run Sapat against one recording first, then scale to a directory of `.mp4` files only after the review path works.
+- Add a local review gate that masks common sensitive patterns and forces a human pass before the transcript is used downstream.
+- Hand off only the redacted transcript, a small manifest, and the command notes needed to reproduce the run.
+
+## Scope and non-overlap
+
+This workflow intentionally uses Sapat's existing provider path instead of adding a new transcription provider.
+The current Sapat CLI already routes `--api openai`, `--api groq`, and `--api azure`; this guide wraps that path with workspace isolation, repeatable validation, and transcript review controls.
+
+That makes the guide complementary to provider-specific work.
+If your team later adopts another Sapat provider, the same safety model still applies: keep provider secrets local, stage raw recordings in ignored workspace folders, run one small sample first, redact before handoff, and record the command that produced the reviewed artifact.
+
+## Prerequisites
+
+You need the Daytona CLI installed and authenticated, a GitHub account that can clone public repositories, Python 3.6 or newer, and `ffmpeg` available in the workspace.
+Sapat currently supports Azure OpenAI, Groq Cloud, and OpenAI transcription APIs, so you also need credentials for at least one of those providers.
+
+This guide uses OpenAI in examples because it is the shortest configuration path. The same structure works with `--api groq` or `--api azure` when the matching environment variables are present.
+Do not paste real API keys into documentation, GitHub comments, or shared chat logs. Keep secrets in `.env` and use placeholders when writing notes.
+
+## Step 1: Create the Daytona workspace
+
+Start by creating a workspace directly from the Sapat repository. The `--code` flag opens the workspace in your configured editor after Daytona finishes provisioning it.
+
+```bash
+daytona create https://github.com/nkkko/sapat --code
+```
+
+Inside the workspace, check the repository layout and install the Python dependencies.
+
+```bash
+python --version
+ffmpeg -version
+pip install -r requirements.txt
+pip install -e .
+```
+
+If `ffmpeg` is not available, install it through the package manager available in your workspace image. For a Debian or Ubuntu based image, this is usually:
+
+```bash
+sudo apt-get update
+sudo apt-get install -y ffmpeg
+```
+
+The editable install is convenient while you are reading or testing the CLI because it exposes the `sapat` command without requiring a wheel build. If you prefer the package flow from the Sapat README, you can run `python -m build` and install the generated wheel from `dist/`.
+
+Before adding credentials, run source-level checks that do not call any provider API.
+They confirm the CLI entry point and transcription modules are importable in the Daytona workspace.
+
+```bash
+sapat --help
+python -m py_compile \
+  src/sapat/script.py \
+  src/sapat/transcription/base.py \
+  src/sapat/transcription/openai.py \
+  src/sapat/transcription/groq.py \
+  src/sapat/transcription/azure.py
+```
+
+Add a short provider-upload decision note before the first real transcription run.
+This keeps the review boundary explicit and gives you a local audit trail without adding private data to the repository.
+
+```bash
+cat > workspace/review/provider-upload-decision.md <<'EOF'
+# Provider upload decision
+
+- Provider: openai
+- Recording: workspace/recordings/customer-demo.mp4
+- Approved for provider upload: yes
+- Contains customer data: no
+- Contains credentials, payment data, or private identifiers: no
+- Allowed downstream artifact: redacted transcript only
+EOF
+```
+
+## Step 2: Keep secrets and working files out of commits
+
+Create a local `.env` file for one transcription provider. The values below are placeholders. Replace them only inside your private workspace.
+
+```bash
+cat > .env <<'EOF'
+OPENAI_API_KEY=replace_with_real_key_inside_daytona_only
+OPENAI_MODEL=whisper-1
+OPENAI_API_ENDPOINT=https://api.openai.com/v1/audio/transcriptions
+OPENAI_MODEL_NAME_CHAT=gpt-4o
+EOF
+```
+
+Then create working folders and tell Git to ignore local recordings, raw transcripts, review outputs, and `.env`. Using `.git/info/exclude` keeps this safety rule local to your workspace without changing the upstream project.
+
+```bash
+mkdir -p workspace/recordings workspace/transcripts workspace/review
+cat >> .git/info/exclude <<'EOF'
+.env
+workspace/recordings/
+workspace/transcripts/
+workspace/review/*.txt
+workspace/review/SHA256SUMS
+*.mp3
+EOF
+```
+
+This matters because generated transcript files can look harmless during development. They are plain text, easy to diff, and easy to paste. Treat them as sensitive until someone reviews them.
+
+Keep the workspace boundary explicit:
+
+| Artifact | Where it stays | Handoff rule |
+| --- | --- | --- |
+| `.env` | Daytona workspace only | Never copy into tickets, docs, PRs, or chat |
+| Raw `.mp4` files | `workspace/recordings/` | Keep local unless there is a separate approved sharing path |
+| Raw `.txt` transcripts | `workspace/transcripts/` | Use only as redaction input |
+| Redacted transcript | `workspace/review/` | Share only after manual review |
+| `SHA256SUMS` | `workspace/review/` | Share with the reviewed transcript when integrity matters |
+
+## Step 3: Run a single-file transcription pass
+
+Copy one short sample recording into the workspace. Start with a short recording, such as a product demo excerpt or an internal test video, because it is easier to review the first transcript by hand.
+
+```bash
+cp ~/Downloads/customer-demo.mp4 workspace/recordings/customer-demo.mp4
+```
+
+Run Sapat with a low temperature and an explicit prompt that tells the model which product terms should be preserved. Sapat accepts a file path or a directory path. For the first pass, use a single file.
+
+```bash
+sapat workspace/recordings/customer-demo.mp4 \
+  --api openai \
+  --quality M \
+  --language en \
+  --prompt "Product names: Daytona, Sapat. Preserve speaker names only when needed for the handoff." \
+  --temperature 0 \
+  --correct
+```
+
+Sapat converts the video to MP3, sends the audio to the selected provider, writes `workspace/recordings/customer-demo.txt`, and removes the temporary MP3 file.
+The `--correct` option asks the provider's chat model to improve the transcript after the transcription pass.
+Use it when readability matters, but still review the output because a correction pass can rewrite phrasing.
+
+Copy the raw transcript to a review folder before editing. That gives you a stable input for the redaction script and a place to store the reviewed output.
+
+```bash
+cp workspace/recordings/customer-demo.txt workspace/transcripts/customer-demo.raw.txt
+```
+
+## Step 4: Add a local redaction pass
+
+The first redaction pass should be deterministic. It will not catch every sensitive detail, but it can remove common patterns before a human reads the transcript. This example masks email addresses, phone-like strings, and several common token prefixes.
+
+```bash
+cat > workspace/review/redact_transcript.py <<'PY'
+from pathlib import Path
+import re
+import sys
+
+patterns = [
+    (r"\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,}\b", "[EMAIL]"),
+    (r"\b(?:\+?\d[\d .().-]{7,}\d)\b", "[PHONE]"),
+    (r"\b(?:sk-|ghp_|xox[baprs]-)[A-Za-z0-9_-]{12,}\b", "[TOKEN]"),
+]
+
+def redact_file(src: Path, dst: Path) -> None:
+    text = src.read_text()
+    for pattern, replacement in patterns:
+        text = re.sub(pattern, replacement, text, flags=re.IGNORECASE)
+    dst.parent.mkdir(parents=True, exist_ok=True)
+    dst.write_text(text)
+
+if len(sys.argv) != 3:
+    raise SystemExit("usage: redact_transcript.py <input.txt> <output.txt>")
+
+redact_file(Path(sys.argv[1]), Path(sys.argv[2]))
+PY
+
+python workspace/review/redact_transcript.py \
+  workspace/transcripts/customer-demo.raw.txt \
+  workspace/review/customer-demo.redacted.txt
+```
+
+Add a project-specific sensitive term file for the manual pass. Keep the example file generic and store real terms only in ignored local files.
+
+```bash
+cat > workspace/review/sensitive-terms.example.txt <<'EOF'
+customer_name
+internal_project_code
+private_hostname
+EOF
+```
+
+You can validate the redaction workflow without provider credentials by using a synthetic transcript.
+This checks the local safety gate before any real audio leaves the workspace.
+
+```bash
+cat > workspace/transcripts/redaction-smoke.raw.txt <<'EOF'
+Email alex@example.com or call +1 415 555 0188. The temporary token is sk-live-redactiondemo123.
+EOF
+
+python workspace/review/redact_transcript.py \
+  workspace/transcripts/redaction-smoke.raw.txt \
+  workspace/review/redaction-smoke.redacted.txt
+
+if grep -E 'alex@example.com|415 555 0188|sk-live-redactiondemo123' workspace/review/redaction-smoke.redacted.txt; then
+  echo "redaction smoke test failed"
+  exit 1
+fi
+
+echo "redaction smoke test passed"
+```
+
+Now review `workspace/review/customer-demo.redacted.txt` manually. Search for each real sensitive term you care about, plus common words that hint at private context: `password`, `token`, `secret`, `key`, `customer`, `email`, `phone`, `incident`, `host`, `domain`, and `account`.
+
+Compare the raw and redacted files locally before approving the output.
+The diff should show only expected masking or manual cleanup, not unrelated transcript rewrites.
+
+```bash
+diff -u \
+  workspace/transcripts/customer-demo.raw.txt \
+  workspace/review/customer-demo.redacted.txt | sed -n '1,160p'
+```
+
+## Step 5: Create a review checklist
+
+Use a short checklist before any transcript leaves the workspace. The goal is not to turn transcription into a heavyweight compliance process. The goal is to make the safe path the default path.
+
+| Check | Command or action | Pass criteria |
+| --- | --- | --- |
+| Provider config is local | `git status --short .env workspace` | `.env` and workspace files are not staged |
+| Transcript exists | `test -s workspace/transcripts/customer-demo.raw.txt` | Raw transcript file is present for review |
+| Redaction file exists | `test -s workspace/review/customer-demo.redacted.txt` | Redacted file is non-empty |
+| Redaction script is smoke-tested | Run the synthetic transcript test above | Known sample email, phone, and token strings are masked |
+| Raw/redacted diff is reviewed | Run the local `diff -u` command | Changes are limited to expected masking and cleanup |
+| Common PII patterns are masked | Search the redacted file for email, phone, and token patterns | No obvious raw sensitive pattern remains |
+| Domain terms are reviewed | Search for your local sensitive term list | Private names are removed, generalized, or approved |
+| Artifact is reproducible | Save the Sapat command and provider choice in notes | Another engineer can repeat the run |
+
+After the manual pass, write a checksum for the reviewed artifact. This is useful when a teammate needs to confirm that the transcript they received is the same version that passed review.
+
+```bash
+shasum -a 256 workspace/review/customer-demo.redacted.txt > workspace/review/SHA256SUMS
+cat workspace/review/SHA256SUMS
+```
+
+Only the reviewed file and checksum should move into the next system. Keep the raw MP4, temporary MP3, raw transcript, `.env`, and sensitive term list inside the Daytona workspace.
+
+## Step 6: Scale from one file to a directory
+
+When the single-file path works, you can use Sapat's directory mode for a folder of `.mp4` recordings.
+
+```bash
+sapat workspace/recordings \
+  --api openai \
+  --quality M \
+  --language en \
+  --prompt "Preserve technical product names. Keep filler words only when they change meaning." \
+  --temperature 0
+```
+
+Directory mode processes `.mp4` files in the selected directory and writes a `.txt` file next to each video. After the run, copy the generated `.txt` files into `workspace/transcripts/`, run the redaction script for each file, and review the outputs one by one.
+
+For larger batches, keep a simple manifest:
+
+```bash
+cat > workspace/review/run-manifest.md <<'EOF'
+# Sapat transcript review run
+
+- Provider: OpenAI
+- Quality: M
+- Language: en
+- Correction pass: no
+- Source folder: workspace/recordings
+- Reviewed output folder: workspace/review
+- Reviewer: add reviewer name locally
+EOF
+```
+
+Do not commit this manifest if it contains real customer names, reviewer names, file names, or incident IDs. If you want a reusable template, commit only a sanitized example.
+
+## Common issues and troubleshooting
+
+**Problem:** `sapat` is not found after installation.
+
+**Solution:** Confirm the editable install completed and that your Python user scripts directory is on `PATH`. In a workspace, `python -m pip install -e .` is often the simplest fix.
+
+**Problem:** Sapat reports an unsupported audio size or provider upload error.
+
+**Solution:** Start with a shorter MP4 and use `--quality M` or `--quality L`. The OpenAI and Groq integrations validate uploaded audio size before sending the request, so a smaller file is easier to debug.
+
+**Problem:** The transcript contains too many product-name errors.
+
+**Solution:** Use `--prompt` to provide a short glossary of product names, acronyms, and domain terms. Keep the prompt focused. A long prompt can become another place where sensitive terms leak into logs or notes.
+
+**Problem:** The correction pass changes wording too aggressively.
+
+**Solution:** Run without `--correct` for the first pass and compare the raw transcript against the corrected transcript on a short recording. Use the corrected version only when readability improves without changing technical meaning.
+Test `--correct` with the selected provider before enabling it on a batch, because the correction step sends transcript context through an additional model call.
+
+**Problem:** Directory mode skips files you expected it to transcribe.
+
+**Solution:** Put `.mp4` files directly in the selected directory. The current Sapat directory flow processes `.mp4` files in that folder; convert or move other formats before the batch run.
+
+**Problem:** You changed `--quality`, but the next run still looks like an old audio conversion.
+
+**Solution:** Check for a leftover `.mp3` next to the video after a failed or interrupted run. Sapat normally removes the temporary MP3 after processing, but if one remains, remove it before rerunning with different quality settings.
+
+**Problem:** The redaction script misses a private term.
+
+**Solution:** Add that term to your local review checklist and search for it manually. Deterministic patterns are a guardrail, not a replacement for review. Customer names, internal project names, and private hostnames are usually organization-specific.
+
+## Conclusion
+
+Sapat gives AI engineers a straightforward way to transcribe videos with OpenAI, Groq Cloud, or Azure OpenAI.
+Daytona makes the workflow repeatable by giving the team a consistent workspace for dependencies, credentials, commands, and review artifacts.
+The missing piece is a safety gate between raw model output and downstream use.
+
+With the workflow in this guide, raw recordings stay in ignored workspace folders, provider credentials stay in `.env`, Sapat writes reproducible transcript outputs, and a local redaction pass creates a reviewed artifact for handoff.
+That is enough structure for small teams to move faster without turning every transcript into an accidental data leak.
+
+## References
+
+- [Sapat repository and README](https://github.com/nkkko/sapat)
+- [Sapat CLI source](https://github.com/nkkko/sapat/blob/main/src/sapat/script.py)
+- [Sapat transcription base flow](https://github.com/nkkko/sapat/blob/main/src/sapat/transcription/base.py)
+- [Sapat OpenAI provider source](https://github.com/nkkko/sapat/blob/main/src/sapat/transcription/openai.py)
+- [Sapat Groq provider source](https://github.com/nkkko/sapat/blob/main/src/sapat/transcription/groq.py)
+- [Sapat Azure provider source](https://github.com/nkkko/sapat/blob/main/src/sapat/transcription/azure.py)
+- [Daytona documentation](https://www.daytona.io/docs/)
+- [OpenAI audio transcription API reference](https://platform.openai.com/docs/api-reference/audio/createTranscription)
+- [Groq audio transcription documentation](https://console.groq.com/docs/speech-to-text)