diff --git a/authors/zergzorg.md b/authors/zergzorg.md new file mode 100644 index 00000000..52f94e0d --- /dev/null +++ b/authors/zergzorg.md @@ -0,0 +1,5 @@ +Author: zergzorg Title: Open-source Contributor Description: zergzorg contributes +practical developer workflow documentation and focused open-source improvements, +with an emphasis on reproducible environments, clear validation steps, and safe +handling of operational data. Author Image: Author LinkedIn: Author Twitter: +Company Name: Independent Company Logo Dark: Company Logo White: diff --git a/definitions/20260520_definition_transcript_redaction_workflow.md b/definitions/20260520_definition_transcript_redaction_workflow.md new file mode 100644 index 00000000..765d92e3 --- /dev/null +++ b/definitions/20260520_definition_transcript_redaction_workflow.md @@ -0,0 +1,21 @@ +--- +title: 'Transcript Redaction Workflow' +description: 'A repeatable process for reviewing and masking sensitive details in transcripts before sharing them.' +date: 2026-05-20 +author: 'zergzorg' +--- + +# Transcript Redaction Workflow + +## Definition + +A transcript redaction workflow is a repeatable process for finding, reviewing, and masking sensitive details in generated transcripts before those transcripts are shared with another system, team, or customer. +It usually combines deterministic checks, human review, and an audit trail that records which transcript was reviewed and when it was cleared for handoff. + +## Context and Usage + +AI transcription tools can turn recordings into useful text quickly, but raw transcripts may include email addresses, phone numbers, private hostnames, customer names, incident identifiers, API tokens, or internal project names. +A redaction workflow gives teams a small safety gate between "the model produced text" and "the transcript is safe to paste into a ticket, knowledge base, prompt, or retrieval system." + +In a cloud development environment such as Daytona, teams can keep this process reproducible by storing sample commands, review checklists, local ignore rules, and redaction scripts alongside the transcription project. +The key is to treat raw recordings and raw transcripts as temporary working files, then promote only reviewed artifacts into downstream documentation or automation. diff --git a/guides/20260520_privacy_safe_sapat_transcription_daytona.md b/guides/20260520_privacy_safe_sapat_transcription_daytona.md new file mode 100644 index 00000000..b4bc5614 --- /dev/null +++ b/guides/20260520_privacy_safe_sapat_transcription_daytona.md @@ -0,0 +1,353 @@ +--- +title: 'Privacy-Safe Sapat Transcription in Daytona' +description: 'Run Sapat in Daytona, review sensitive transcript data locally, and hand off only cleared transcript artifacts.' +date: 2026-05-20 +author: 'zergzorg' +tags: ['daytona', 'sapat', 'ai transcription', 'privacy'] +--- + +# Privacy-Safe Sapat Transcription in Daytona + +# Introduction + +AI transcription is useful because it turns recordings into text that teams can search, summarize, and reuse. It also creates a new review problem. +A raw transcript can contain email addresses, phone numbers, customer names, internal service names, incident identifiers, private hostnames, or copied API tokens that someone mentioned during a demo. +Before that text moves into a ticket, a knowledge base, a retrieval pipeline, or another model prompt, it needs a small safety gate. + +This guide shows how to run [Sapat](https://github.com/nkkko/sapat) inside a Daytona workspace and add a local [transcript redaction workflow](../definitions/20260520_definition_transcript_redaction_workflow.md) around it. +Sapat converts video files to MP3 with `ffmpeg`, sends the audio to a selected provider, and writes a `.txt` file next to the source video. +The workflow below keeps credentials and raw recordings out of commits, stages transcripts in predictable folders, applies deterministic redaction checks, and produces a reviewed artifact that is easier to share safely. + +![Privacy-safe Sapat transcription workflow](assets/20260520_privacy_safe_sapat_transcription_daytona_img1.svg) + +## TL;DR + +- Create a Daytona workspace from the Sapat repository so setup, commands, and working folders are repeatable. +- Configure one supported provider with environment variables, then keep `.env`, recordings, and raw transcript files out of Git. +- Run Sapat against one recording first, then scale to a directory of `.mp4` files only after the review path works. +- Add a local review gate that masks common sensitive patterns and forces a human pass before the transcript is used downstream. +- Hand off only the redacted transcript, a small manifest, and the command notes needed to reproduce the run. + +## Scope and non-overlap + +This workflow intentionally uses Sapat's existing provider path instead of adding a new transcription provider. +The current Sapat CLI already routes `--api openai`, `--api groq`, and `--api azure`; this guide wraps that path with workspace isolation, repeatable validation, and transcript review controls. + +That makes the guide complementary to provider-specific work. +If your team later adopts another Sapat provider, the same safety model still applies: keep provider secrets local, stage raw recordings in ignored workspace folders, run one small sample first, redact before handoff, and record the command that produced the reviewed artifact. + +## Prerequisites + +You need the Daytona CLI installed and authenticated, a GitHub account that can clone public repositories, Python 3.6 or newer, and `ffmpeg` available in the workspace. +Sapat currently supports Azure OpenAI, Groq Cloud, and OpenAI transcription APIs, so you also need credentials for at least one of those providers. + +This guide uses OpenAI in examples because it is the shortest configuration path. The same structure works with `--api groq` or `--api azure` when the matching environment variables are present. +Do not paste real API keys into documentation, GitHub comments, or shared chat logs. Keep secrets in `.env` and use placeholders when writing notes. + +## Step 1: Create the Daytona workspace + +Start by creating a workspace directly from the Sapat repository. The `--code` flag opens the workspace in your configured editor after Daytona finishes provisioning it. + +```bash +daytona create https://github.com/nkkko/sapat --code +``` + +Inside the workspace, check the repository layout and install the Python dependencies. + +```bash +python --version +ffmpeg -version +pip install -r requirements.txt +pip install -e . +``` + +If `ffmpeg` is not available, install it through the package manager available in your workspace image. For a Debian or Ubuntu based image, this is usually: + +```bash +sudo apt-get update +sudo apt-get install -y ffmpeg +``` + +The editable install is convenient while you are reading or testing the CLI because it exposes the `sapat` command without requiring a wheel build. If you prefer the package flow from the Sapat README, you can run `python -m build` and install the generated wheel from `dist/`. + +Before adding credentials, run source-level checks that do not call any provider API. +They confirm the CLI entry point and transcription modules are importable in the Daytona workspace. + +```bash +sapat --help +python -m py_compile \ + src/sapat/script.py \ + src/sapat/transcription/base.py \ + src/sapat/transcription/openai.py \ + src/sapat/transcription/groq.py \ + src/sapat/transcription/azure.py +``` + +Add a short provider-upload decision note before the first real transcription run. +This keeps the review boundary explicit and gives you a local audit trail without adding private data to the repository. + +```bash +cat > workspace/review/provider-upload-decision.md <<'EOF' +# Provider upload decision + +- Provider: openai +- Recording: workspace/recordings/customer-demo.mp4 +- Approved for provider upload: yes +- Contains customer data: no +- Contains credentials, payment data, or private identifiers: no +- Allowed downstream artifact: redacted transcript only +EOF +``` + +## Step 2: Keep secrets and working files out of commits + +Create a local `.env` file for one transcription provider. The values below are placeholders. Replace them only inside your private workspace. + +```bash +cat > .env <<'EOF' +OPENAI_API_KEY=replace_with_real_key_inside_daytona_only +OPENAI_MODEL=whisper-1 +OPENAI_API_ENDPOINT=https://api.openai.com/v1/audio/transcriptions +OPENAI_MODEL_NAME_CHAT=gpt-4o +EOF +``` + +Then create working folders and tell Git to ignore local recordings, raw transcripts, review outputs, and `.env`. Using `.git/info/exclude` keeps this safety rule local to your workspace without changing the upstream project. + +```bash +mkdir -p workspace/recordings workspace/transcripts workspace/review +cat >> .git/info/exclude <<'EOF' +.env +workspace/recordings/ +workspace/transcripts/ +workspace/review/*.txt +workspace/review/SHA256SUMS +*.mp3 +EOF +``` + +This matters because generated transcript files can look harmless during development. They are plain text, easy to diff, and easy to paste. Treat them as sensitive until someone reviews them. + +Keep the workspace boundary explicit: + +| Artifact | Where it stays | Handoff rule | +| --- | --- | --- | +| `.env` | Daytona workspace only | Never copy into tickets, docs, PRs, or chat | +| Raw `.mp4` files | `workspace/recordings/` | Keep local unless there is a separate approved sharing path | +| Raw `.txt` transcripts | `workspace/transcripts/` | Use only as redaction input | +| Redacted transcript | `workspace/review/` | Share only after manual review | +| `SHA256SUMS` | `workspace/review/` | Share with the reviewed transcript when integrity matters | + +## Step 3: Run a single-file transcription pass + +Copy one short sample recording into the workspace. Start with a short recording, such as a product demo excerpt or an internal test video, because it is easier to review the first transcript by hand. + +```bash +cp ~/Downloads/customer-demo.mp4 workspace/recordings/customer-demo.mp4 +``` + +Run Sapat with a low temperature and an explicit prompt that tells the model which product terms should be preserved. Sapat accepts a file path or a directory path. For the first pass, use a single file. + +```bash +sapat workspace/recordings/customer-demo.mp4 \ + --api openai \ + --quality M \ + --language en \ + --prompt "Product names: Daytona, Sapat. Preserve speaker names only when needed for the handoff." \ + --temperature 0 \ + --correct +``` + +Sapat converts the video to MP3, sends the audio to the selected provider, writes `workspace/recordings/customer-demo.txt`, and removes the temporary MP3 file. +The `--correct` option asks the provider's chat model to improve the transcript after the transcription pass. +Use it when readability matters, but still review the output because a correction pass can rewrite phrasing. + +Copy the raw transcript to a review folder before editing. That gives you a stable input for the redaction script and a place to store the reviewed output. + +```bash +cp workspace/recordings/customer-demo.txt workspace/transcripts/customer-demo.raw.txt +``` + +## Step 4: Add a local redaction pass + +The first redaction pass should be deterministic. It will not catch every sensitive detail, but it can remove common patterns before a human reads the transcript. This example masks email addresses, phone-like strings, and several common token prefixes. + +```bash +cat > workspace/review/redact_transcript.py <<'PY' +from pathlib import Path +import re +import sys + +patterns = [ + (r"\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,}\b", "[EMAIL]"), + (r"\b(?:\+?\d[\d .().-]{7,}\d)\b", "[PHONE]"), + (r"\b(?:sk-|ghp_|xox[baprs]-)[A-Za-z0-9_-]{12,}\b", "[TOKEN]"), +] + +def redact_file(src: Path, dst: Path) -> None: + text = src.read_text() + for pattern, replacement in patterns: + text = re.sub(pattern, replacement, text, flags=re.IGNORECASE) + dst.parent.mkdir(parents=True, exist_ok=True) + dst.write_text(text) + +if len(sys.argv) != 3: + raise SystemExit("usage: redact_transcript.py ") + +redact_file(Path(sys.argv[1]), Path(sys.argv[2])) +PY + +python workspace/review/redact_transcript.py \ + workspace/transcripts/customer-demo.raw.txt \ + workspace/review/customer-demo.redacted.txt +``` + +Add a project-specific sensitive term file for the manual pass. Keep the example file generic and store real terms only in ignored local files. + +```bash +cat > workspace/review/sensitive-terms.example.txt <<'EOF' +customer_name +internal_project_code +private_hostname +EOF +``` + +You can validate the redaction workflow without provider credentials by using a synthetic transcript. +This checks the local safety gate before any real audio leaves the workspace. + +```bash +cat > workspace/transcripts/redaction-smoke.raw.txt <<'EOF' +Email alex@example.com or call +1 415 555 0188. The temporary token is sk-live-redactiondemo123. +EOF + +python workspace/review/redact_transcript.py \ + workspace/transcripts/redaction-smoke.raw.txt \ + workspace/review/redaction-smoke.redacted.txt + +if grep -E 'alex@example.com|415 555 0188|sk-live-redactiondemo123' workspace/review/redaction-smoke.redacted.txt; then + echo "redaction smoke test failed" + exit 1 +fi + +echo "redaction smoke test passed" +``` + +Now review `workspace/review/customer-demo.redacted.txt` manually. Search for each real sensitive term you care about, plus common words that hint at private context: `password`, `token`, `secret`, `key`, `customer`, `email`, `phone`, `incident`, `host`, `domain`, and `account`. + +Compare the raw and redacted files locally before approving the output. +The diff should show only expected masking or manual cleanup, not unrelated transcript rewrites. + +```bash +diff -u \ + workspace/transcripts/customer-demo.raw.txt \ + workspace/review/customer-demo.redacted.txt | sed -n '1,160p' +``` + +## Step 5: Create a review checklist + +Use a short checklist before any transcript leaves the workspace. The goal is not to turn transcription into a heavyweight compliance process. The goal is to make the safe path the default path. + +| Check | Command or action | Pass criteria | +| --- | --- | --- | +| Provider config is local | `git status --short .env workspace` | `.env` and workspace files are not staged | +| Transcript exists | `test -s workspace/transcripts/customer-demo.raw.txt` | Raw transcript file is present for review | +| Redaction file exists | `test -s workspace/review/customer-demo.redacted.txt` | Redacted file is non-empty | +| Redaction script is smoke-tested | Run the synthetic transcript test above | Known sample email, phone, and token strings are masked | +| Raw/redacted diff is reviewed | Run the local `diff -u` command | Changes are limited to expected masking and cleanup | +| Common PII patterns are masked | Search the redacted file for email, phone, and token patterns | No obvious raw sensitive pattern remains | +| Domain terms are reviewed | Search for your local sensitive term list | Private names are removed, generalized, or approved | +| Artifact is reproducible | Save the Sapat command and provider choice in notes | Another engineer can repeat the run | + +After the manual pass, write a checksum for the reviewed artifact. This is useful when a teammate needs to confirm that the transcript they received is the same version that passed review. + +```bash +shasum -a 256 workspace/review/customer-demo.redacted.txt > workspace/review/SHA256SUMS +cat workspace/review/SHA256SUMS +``` + +Only the reviewed file and checksum should move into the next system. Keep the raw MP4, temporary MP3, raw transcript, `.env`, and sensitive term list inside the Daytona workspace. + +## Step 6: Scale from one file to a directory + +When the single-file path works, you can use Sapat's directory mode for a folder of `.mp4` recordings. + +```bash +sapat workspace/recordings \ + --api openai \ + --quality M \ + --language en \ + --prompt "Preserve technical product names. Keep filler words only when they change meaning." \ + --temperature 0 +``` + +Directory mode processes `.mp4` files in the selected directory and writes a `.txt` file next to each video. After the run, copy the generated `.txt` files into `workspace/transcripts/`, run the redaction script for each file, and review the outputs one by one. + +For larger batches, keep a simple manifest: + +```bash +cat > workspace/review/run-manifest.md <<'EOF' +# Sapat transcript review run + +- Provider: OpenAI +- Quality: M +- Language: en +- Correction pass: no +- Source folder: workspace/recordings +- Reviewed output folder: workspace/review +- Reviewer: add reviewer name locally +EOF +``` + +Do not commit this manifest if it contains real customer names, reviewer names, file names, or incident IDs. If you want a reusable template, commit only a sanitized example. + +## Common issues and troubleshooting + +**Problem:** `sapat` is not found after installation. + +**Solution:** Confirm the editable install completed and that your Python user scripts directory is on `PATH`. In a workspace, `python -m pip install -e .` is often the simplest fix. + +**Problem:** Sapat reports an unsupported audio size or provider upload error. + +**Solution:** Start with a shorter MP4 and use `--quality M` or `--quality L`. The OpenAI and Groq integrations validate uploaded audio size before sending the request, so a smaller file is easier to debug. + +**Problem:** The transcript contains too many product-name errors. + +**Solution:** Use `--prompt` to provide a short glossary of product names, acronyms, and domain terms. Keep the prompt focused. A long prompt can become another place where sensitive terms leak into logs or notes. + +**Problem:** The correction pass changes wording too aggressively. + +**Solution:** Run without `--correct` for the first pass and compare the raw transcript against the corrected transcript on a short recording. Use the corrected version only when readability improves without changing technical meaning. +Test `--correct` with the selected provider before enabling it on a batch, because the correction step sends transcript context through an additional model call. + +**Problem:** Directory mode skips files you expected it to transcribe. + +**Solution:** Put `.mp4` files directly in the selected directory. The current Sapat directory flow processes `.mp4` files in that folder; convert or move other formats before the batch run. + +**Problem:** You changed `--quality`, but the next run still looks like an old audio conversion. + +**Solution:** Check for a leftover `.mp3` next to the video after a failed or interrupted run. Sapat normally removes the temporary MP3 after processing, but if one remains, remove it before rerunning with different quality settings. + +**Problem:** The redaction script misses a private term. + +**Solution:** Add that term to your local review checklist and search for it manually. Deterministic patterns are a guardrail, not a replacement for review. Customer names, internal project names, and private hostnames are usually organization-specific. + +## Conclusion + +Sapat gives AI engineers a straightforward way to transcribe videos with OpenAI, Groq Cloud, or Azure OpenAI. +Daytona makes the workflow repeatable by giving the team a consistent workspace for dependencies, credentials, commands, and review artifacts. +The missing piece is a safety gate between raw model output and downstream use. + +With the workflow in this guide, raw recordings stay in ignored workspace folders, provider credentials stay in `.env`, Sapat writes reproducible transcript outputs, and a local redaction pass creates a reviewed artifact for handoff. +That is enough structure for small teams to move faster without turning every transcript into an accidental data leak. + +## References + +- [Sapat repository and README](https://github.com/nkkko/sapat) +- [Sapat CLI source](https://github.com/nkkko/sapat/blob/main/src/sapat/script.py) +- [Sapat transcription base flow](https://github.com/nkkko/sapat/blob/main/src/sapat/transcription/base.py) +- [Sapat OpenAI provider source](https://github.com/nkkko/sapat/blob/main/src/sapat/transcription/openai.py) +- [Sapat Groq provider source](https://github.com/nkkko/sapat/blob/main/src/sapat/transcription/groq.py) +- [Sapat Azure provider source](https://github.com/nkkko/sapat/blob/main/src/sapat/transcription/azure.py) +- [Daytona documentation](https://www.daytona.io/docs/) +- [OpenAI audio transcription API reference](https://platform.openai.com/docs/api-reference/audio/createTranscription) +- [Groq audio transcription documentation](https://console.groq.com/docs/speech-to-text) diff --git a/guides/assets/20260520_privacy_safe_sapat_transcription_daytona_img1.svg b/guides/assets/20260520_privacy_safe_sapat_transcription_daytona_img1.svg new file mode 100644 index 00000000..ec4cad92 --- /dev/null +++ b/guides/assets/20260520_privacy_safe_sapat_transcription_daytona_img1.svg @@ -0,0 +1,35 @@ + + Privacy-safe Sapat transcription workflow in Daytona + A workflow diagram showing recordings, Sapat transcription, local review, redaction, and safe handoff artifacts. + + + Privacy-safe Sapat transcription in Daytona + Keep raw recordings local to the workspace, review sensitive terms, then share only cleared transcript artifacts. + + + Recordings + MP4 files staged + inside Daytona + + Sapat + ffmpeg conversion + provider transcript + + Review Gate + PII patterns and + sensitive terms + + Handoff + Redacted text + plus checksum + + + + + + + + Workspace guardrails + .env stays uncommitted, raw recordings stay in ignored working folders, and only reviewed artifacts leave the workspace. + +