Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions authors/zergzorg.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
Author: zergzorg Title: Open-source Contributor Description: zergzorg contributes
practical developer workflow documentation and focused open-source improvements,
with an emphasis on reproducible environments, clear validation steps, and safe
handling of operational data. Author Image: Author LinkedIn: Author Twitter:
Company Name: Independent Company Logo Dark: Company Logo White:
21 changes: 21 additions & 0 deletions definitions/20260520_definition_transcript_redaction_workflow.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
---
title: 'Transcript Redaction Workflow'
description: 'A repeatable process for reviewing and masking sensitive details in transcripts before sharing them.'
date: 2026-05-20
author: 'zergzorg'
---

# Transcript Redaction Workflow

## Definition

A transcript redaction workflow is a repeatable process for finding, reviewing, and masking sensitive details in generated transcripts before those transcripts are shared with another system, team, or customer.
It usually combines deterministic checks, human review, and an audit trail that records which transcript was reviewed and when it was cleared for handoff.

## Context and Usage

AI transcription tools can turn recordings into useful text quickly, but raw transcripts may include email addresses, phone numbers, private hostnames, customer names, incident identifiers, API tokens, or internal project names.
A redaction workflow gives teams a small safety gate between "the model produced text" and "the transcript is safe to paste into a ticket, knowledge base, prompt, or retrieval system."

In a cloud development environment such as Daytona, teams can keep this process reproducible by storing sample commands, review checklists, local ignore rules, and redaction scripts alongside the transcription project.
The key is to treat raw recordings and raw transcripts as temporary working files, then promote only reviewed artifacts into downstream documentation or automation.
353 changes: 353 additions & 0 deletions guides/20260520_privacy_safe_sapat_transcription_daytona.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,353 @@
---
title: 'Privacy-Safe Sapat Transcription in Daytona'
description: 'Run Sapat in Daytona, review sensitive transcript data locally, and hand off only cleared transcript artifacts.'
date: 2026-05-20
author: 'zergzorg'
tags: ['daytona', 'sapat', 'ai transcription', 'privacy']
---

# Privacy-Safe Sapat Transcription in Daytona

# Introduction

AI transcription is useful because it turns recordings into text that teams can search, summarize, and reuse. It also creates a new review problem.
A raw transcript can contain email addresses, phone numbers, customer names, internal service names, incident identifiers, private hostnames, or copied API tokens that someone mentioned during a demo.
Before that text moves into a ticket, a knowledge base, a retrieval pipeline, or another model prompt, it needs a small safety gate.

This guide shows how to run [Sapat](https://github.com/nkkko/sapat) inside a Daytona workspace and add a local [transcript redaction workflow](../definitions/20260520_definition_transcript_redaction_workflow.md) around it.
Sapat converts video files to MP3 with `ffmpeg`, sends the audio to a selected provider, and writes a `.txt` file next to the source video.
The workflow below keeps credentials and raw recordings out of commits, stages transcripts in predictable folders, applies deterministic redaction checks, and produces a reviewed artifact that is easier to share safely.

![Privacy-safe Sapat transcription workflow](assets/20260520_privacy_safe_sapat_transcription_daytona_img1.svg)

## TL;DR

- Create a Daytona workspace from the Sapat repository so setup, commands, and working folders are repeatable.
- Configure one supported provider with environment variables, then keep `.env`, recordings, and raw transcript files out of Git.
- Run Sapat against one recording first, then scale to a directory of `.mp4` files only after the review path works.
- Add a local review gate that masks common sensitive patterns and forces a human pass before the transcript is used downstream.
- Hand off only the redacted transcript, a small manifest, and the command notes needed to reproduce the run.

## Scope and non-overlap

This workflow intentionally uses Sapat's existing provider path instead of adding a new transcription provider.
The current Sapat CLI already routes `--api openai`, `--api groq`, and `--api azure`; this guide wraps that path with workspace isolation, repeatable validation, and transcript review controls.

That makes the guide complementary to provider-specific work.
If your team later adopts another Sapat provider, the same safety model still applies: keep provider secrets local, stage raw recordings in ignored workspace folders, run one small sample first, redact before handoff, and record the command that produced the reviewed artifact.

## Prerequisites

You need the Daytona CLI installed and authenticated, a GitHub account that can clone public repositories, Python 3.6 or newer, and `ffmpeg` available in the workspace.
Sapat currently supports Azure OpenAI, Groq Cloud, and OpenAI transcription APIs, so you also need credentials for at least one of those providers.

This guide uses OpenAI in examples because it is the shortest configuration path. The same structure works with `--api groq` or `--api azure` when the matching environment variables are present.
Do not paste real API keys into documentation, GitHub comments, or shared chat logs. Keep secrets in `.env` and use placeholders when writing notes.

## Step 1: Create the Daytona workspace

Start by creating a workspace directly from the Sapat repository. The `--code` flag opens the workspace in your configured editor after Daytona finishes provisioning it.

```bash
daytona create https://github.com/nkkko/sapat --code
```

Inside the workspace, check the repository layout and install the Python dependencies.

```bash
python --version
ffmpeg -version
pip install -r requirements.txt
pip install -e .
```

If `ffmpeg` is not available, install it through the package manager available in your workspace image. For a Debian or Ubuntu based image, this is usually:

```bash
sudo apt-get update
sudo apt-get install -y ffmpeg
```

The editable install is convenient while you are reading or testing the CLI because it exposes the `sapat` command without requiring a wheel build. If you prefer the package flow from the Sapat README, you can run `python -m build` and install the generated wheel from `dist/`.

Before adding credentials, run source-level checks that do not call any provider API.
They confirm the CLI entry point and transcription modules are importable in the Daytona workspace.

```bash
sapat --help
python -m py_compile \
src/sapat/script.py \
src/sapat/transcription/base.py \
src/sapat/transcription/openai.py \
src/sapat/transcription/groq.py \
src/sapat/transcription/azure.py
```

Add a short provider-upload decision note before the first real transcription run.
This keeps the review boundary explicit and gives you a local audit trail without adding private data to the repository.

```bash
cat > workspace/review/provider-upload-decision.md <<'EOF'
# Provider upload decision

- Provider: openai
- Recording: workspace/recordings/customer-demo.mp4
- Approved for provider upload: yes
- Contains customer data: no
- Contains credentials, payment data, or private identifiers: no
- Allowed downstream artifact: redacted transcript only
EOF
```

## Step 2: Keep secrets and working files out of commits

Create a local `.env` file for one transcription provider. The values below are placeholders. Replace them only inside your private workspace.

```bash
cat > .env <<'EOF'
OPENAI_API_KEY=replace_with_real_key_inside_daytona_only
OPENAI_MODEL=whisper-1
OPENAI_API_ENDPOINT=https://api.openai.com/v1/audio/transcriptions
OPENAI_MODEL_NAME_CHAT=gpt-4o
EOF
```

Then create working folders and tell Git to ignore local recordings, raw transcripts, review outputs, and `.env`. Using `.git/info/exclude` keeps this safety rule local to your workspace without changing the upstream project.

```bash
mkdir -p workspace/recordings workspace/transcripts workspace/review
cat >> .git/info/exclude <<'EOF'
.env
workspace/recordings/
workspace/transcripts/
workspace/review/*.txt
workspace/review/SHA256SUMS
*.mp3
EOF
```

This matters because generated transcript files can look harmless during development. They are plain text, easy to diff, and easy to paste. Treat them as sensitive until someone reviews them.

Keep the workspace boundary explicit:

| Artifact | Where it stays | Handoff rule |
| --- | --- | --- |
| `.env` | Daytona workspace only | Never copy into tickets, docs, PRs, or chat |
| Raw `.mp4` files | `workspace/recordings/` | Keep local unless there is a separate approved sharing path |
| Raw `.txt` transcripts | `workspace/transcripts/` | Use only as redaction input |
| Redacted transcript | `workspace/review/` | Share only after manual review |
| `SHA256SUMS` | `workspace/review/` | Share with the reviewed transcript when integrity matters |

## Step 3: Run a single-file transcription pass

Copy one short sample recording into the workspace. Start with a short recording, such as a product demo excerpt or an internal test video, because it is easier to review the first transcript by hand.

```bash
cp ~/Downloads/customer-demo.mp4 workspace/recordings/customer-demo.mp4
```

Run Sapat with a low temperature and an explicit prompt that tells the model which product terms should be preserved. Sapat accepts a file path or a directory path. For the first pass, use a single file.

```bash
sapat workspace/recordings/customer-demo.mp4 \
--api openai \
--quality M \
--language en \
--prompt "Product names: Daytona, Sapat. Preserve speaker names only when needed for the handoff." \
--temperature 0 \
--correct
```

Sapat converts the video to MP3, sends the audio to the selected provider, writes `workspace/recordings/customer-demo.txt`, and removes the temporary MP3 file.
The `--correct` option asks the provider's chat model to improve the transcript after the transcription pass.
Use it when readability matters, but still review the output because a correction pass can rewrite phrasing.

Copy the raw transcript to a review folder before editing. That gives you a stable input for the redaction script and a place to store the reviewed output.

```bash
cp workspace/recordings/customer-demo.txt workspace/transcripts/customer-demo.raw.txt
```

## Step 4: Add a local redaction pass

The first redaction pass should be deterministic. It will not catch every sensitive detail, but it can remove common patterns before a human reads the transcript. This example masks email addresses, phone-like strings, and several common token prefixes.

```bash
cat > workspace/review/redact_transcript.py <<'PY'
from pathlib import Path
import re
import sys

patterns = [
(r"\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,}\b", "[EMAIL]"),
(r"\b(?:\+?\d[\d .().-]{7,}\d)\b", "[PHONE]"),
(r"\b(?:sk-|ghp_|xox[baprs]-)[A-Za-z0-9_-]{12,}\b", "[TOKEN]"),
]

def redact_file(src: Path, dst: Path) -> None:
text = src.read_text()
for pattern, replacement in patterns:
text = re.sub(pattern, replacement, text, flags=re.IGNORECASE)
dst.parent.mkdir(parents=True, exist_ok=True)
dst.write_text(text)

if len(sys.argv) != 3:
raise SystemExit("usage: redact_transcript.py <input.txt> <output.txt>")

redact_file(Path(sys.argv[1]), Path(sys.argv[2]))
PY

python workspace/review/redact_transcript.py \
workspace/transcripts/customer-demo.raw.txt \
workspace/review/customer-demo.redacted.txt
```

Add a project-specific sensitive term file for the manual pass. Keep the example file generic and store real terms only in ignored local files.

```bash
cat > workspace/review/sensitive-terms.example.txt <<'EOF'
customer_name
internal_project_code
private_hostname
EOF
```

You can validate the redaction workflow without provider credentials by using a synthetic transcript.
This checks the local safety gate before any real audio leaves the workspace.

```bash
cat > workspace/transcripts/redaction-smoke.raw.txt <<'EOF'
Email alex@example.com or call +1 415 555 0188. The temporary token is sk-live-redactiondemo123.
EOF

python workspace/review/redact_transcript.py \
workspace/transcripts/redaction-smoke.raw.txt \
workspace/review/redaction-smoke.redacted.txt

if grep -E 'alex@example.com|415 555 0188|sk-live-redactiondemo123' workspace/review/redaction-smoke.redacted.txt; then
echo "redaction smoke test failed"
exit 1
fi

echo "redaction smoke test passed"
```

Now review `workspace/review/customer-demo.redacted.txt` manually. Search for each real sensitive term you care about, plus common words that hint at private context: `password`, `token`, `secret`, `key`, `customer`, `email`, `phone`, `incident`, `host`, `domain`, and `account`.

Compare the raw and redacted files locally before approving the output.
The diff should show only expected masking or manual cleanup, not unrelated transcript rewrites.

```bash
diff -u \
workspace/transcripts/customer-demo.raw.txt \
workspace/review/customer-demo.redacted.txt | sed -n '1,160p'
```

## Step 5: Create a review checklist

Use a short checklist before any transcript leaves the workspace. The goal is not to turn transcription into a heavyweight compliance process. The goal is to make the safe path the default path.

| Check | Command or action | Pass criteria |
| --- | --- | --- |
| Provider config is local | `git status --short .env workspace` | `.env` and workspace files are not staged |
| Transcript exists | `test -s workspace/transcripts/customer-demo.raw.txt` | Raw transcript file is present for review |
| Redaction file exists | `test -s workspace/review/customer-demo.redacted.txt` | Redacted file is non-empty |
| Redaction script is smoke-tested | Run the synthetic transcript test above | Known sample email, phone, and token strings are masked |
| Raw/redacted diff is reviewed | Run the local `diff -u` command | Changes are limited to expected masking and cleanup |
| Common PII patterns are masked | Search the redacted file for email, phone, and token patterns | No obvious raw sensitive pattern remains |
| Domain terms are reviewed | Search for your local sensitive term list | Private names are removed, generalized, or approved |
| Artifact is reproducible | Save the Sapat command and provider choice in notes | Another engineer can repeat the run |

After the manual pass, write a checksum for the reviewed artifact. This is useful when a teammate needs to confirm that the transcript they received is the same version that passed review.

```bash
shasum -a 256 workspace/review/customer-demo.redacted.txt > workspace/review/SHA256SUMS
cat workspace/review/SHA256SUMS
```

Only the reviewed file and checksum should move into the next system. Keep the raw MP4, temporary MP3, raw transcript, `.env`, and sensitive term list inside the Daytona workspace.

## Step 6: Scale from one file to a directory

When the single-file path works, you can use Sapat's directory mode for a folder of `.mp4` recordings.

```bash
sapat workspace/recordings \
--api openai \
--quality M \
--language en \
--prompt "Preserve technical product names. Keep filler words only when they change meaning." \
--temperature 0
```

Directory mode processes `.mp4` files in the selected directory and writes a `.txt` file next to each video. After the run, copy the generated `.txt` files into `workspace/transcripts/`, run the redaction script for each file, and review the outputs one by one.

For larger batches, keep a simple manifest:

```bash
cat > workspace/review/run-manifest.md <<'EOF'
# Sapat transcript review run

- Provider: OpenAI
- Quality: M
- Language: en
- Correction pass: no
- Source folder: workspace/recordings
- Reviewed output folder: workspace/review
- Reviewer: add reviewer name locally
EOF
```

Do not commit this manifest if it contains real customer names, reviewer names, file names, or incident IDs. If you want a reusable template, commit only a sanitized example.

## Common issues and troubleshooting

**Problem:** `sapat` is not found after installation.

**Solution:** Confirm the editable install completed and that your Python user scripts directory is on `PATH`. In a workspace, `python -m pip install -e .` is often the simplest fix.

**Problem:** Sapat reports an unsupported audio size or provider upload error.

**Solution:** Start with a shorter MP4 and use `--quality M` or `--quality L`. The OpenAI and Groq integrations validate uploaded audio size before sending the request, so a smaller file is easier to debug.

**Problem:** The transcript contains too many product-name errors.

**Solution:** Use `--prompt` to provide a short glossary of product names, acronyms, and domain terms. Keep the prompt focused. A long prompt can become another place where sensitive terms leak into logs or notes.

**Problem:** The correction pass changes wording too aggressively.

**Solution:** Run without `--correct` for the first pass and compare the raw transcript against the corrected transcript on a short recording. Use the corrected version only when readability improves without changing technical meaning.
Test `--correct` with the selected provider before enabling it on a batch, because the correction step sends transcript context through an additional model call.

**Problem:** Directory mode skips files you expected it to transcribe.

**Solution:** Put `.mp4` files directly in the selected directory. The current Sapat directory flow processes `.mp4` files in that folder; convert or move other formats before the batch run.

**Problem:** You changed `--quality`, but the next run still looks like an old audio conversion.

**Solution:** Check for a leftover `.mp3` next to the video after a failed or interrupted run. Sapat normally removes the temporary MP3 after processing, but if one remains, remove it before rerunning with different quality settings.

**Problem:** The redaction script misses a private term.

**Solution:** Add that term to your local review checklist and search for it manually. Deterministic patterns are a guardrail, not a replacement for review. Customer names, internal project names, and private hostnames are usually organization-specific.

## Conclusion

Sapat gives AI engineers a straightforward way to transcribe videos with OpenAI, Groq Cloud, or Azure OpenAI.
Daytona makes the workflow repeatable by giving the team a consistent workspace for dependencies, credentials, commands, and review artifacts.
The missing piece is a safety gate between raw model output and downstream use.

With the workflow in this guide, raw recordings stay in ignored workspace folders, provider credentials stay in `.env`, Sapat writes reproducible transcript outputs, and a local redaction pass creates a reviewed artifact for handoff.
That is enough structure for small teams to move faster without turning every transcript into an accidental data leak.

## References

- [Sapat repository and README](https://github.com/nkkko/sapat)
- [Sapat CLI source](https://github.com/nkkko/sapat/blob/main/src/sapat/script.py)
- [Sapat transcription base flow](https://github.com/nkkko/sapat/blob/main/src/sapat/transcription/base.py)
- [Sapat OpenAI provider source](https://github.com/nkkko/sapat/blob/main/src/sapat/transcription/openai.py)
- [Sapat Groq provider source](https://github.com/nkkko/sapat/blob/main/src/sapat/transcription/groq.py)
- [Sapat Azure provider source](https://github.com/nkkko/sapat/blob/main/src/sapat/transcription/azure.py)
- [Daytona documentation](https://www.daytona.io/docs/)
- [OpenAI audio transcription API reference](https://platform.openai.com/docs/api-reference/audio/createTranscription)
- [Groq audio transcription documentation](https://console.groq.com/docs/speech-to-text)
Loading