diff --git a/articles/20260524_run_whispercpp_transcription_with_sapat_in_daytona.md b/articles/20260524_run_whispercpp_transcription_with_sapat_in_daytona.md new file mode 100644 index 00000000..8de1af3b --- /dev/null +++ b/articles/20260524_run_whispercpp_transcription_with_sapat_in_daytona.md @@ -0,0 +1,313 @@ +--- +title: 'Run whisper.cpp Transcription With Sapat' +description: + 'Build a private, offline transcription workflow by running Sapat and whisper.cpp inside a reproducible Daytona workspace.' +date: 2026-05-24 +author: 'Jamil Ahmadzai' +tags: ['daytona', 'sapat', 'transcription', 'whispercpp'] +--- + +# Run whisper.cpp Transcription With Sapat + +# Introduction + +Cloud transcription APIs are convenient, but they are not always the right +default. Product demos, customer calls, security reviews, and internal incident +recordings can contain details that should stay inside a controlled development +environment. That is where [offline speech-to-text](../definitions/20260524_definition_offline_speech_to_text.md) +is useful. + +This guide shows how to run a local `whisper.cpp` transcription workflow with +Sapat inside a Daytona workspace. Sapat handles the repeatable command line +experience. `whisper.cpp` handles local inference with a GGML Whisper model. +Daytona gives the workflow a clean workspace boundary so another engineer can +recreate the same setup instead of guessing what was installed on your laptop. + +The companion Sapat implementation for this guide is available in +[nibzard/sapat#45](https://github.com/nibzard/sapat/pull/45). It adds +`--api whispercpp`, validates the local binary and model path, converts input +media to 16 kHz mono WAV for the local CLI, and writes the transcript next to +the source file. + +## TL;DR + +- Use Daytona to run the Sapat workspace in a disposable, reproducible + environment. +- Build `whisper.cpp` locally and download a GGML model such as `base.en`. +- Configure `WHISPERCPP_BINARY` and `WHISPERCPP_MODEL_PATH` in `.env`. +- Run `sapat demo.mp4 --api whispercpp --language en --quality M`. +- Keep recordings, model files, prompts, and transcripts inside the workspace + unless your team explicitly approves sharing them. + +![Offline Sapat transcription workflow with whisper.cpp in Daytona](assets/20260524_run_whispercpp_transcription_with_sapat_in_daytona_workflow.svg) + +## Why Use whisper.cpp in a Daytona Workspace? + +Sapat already supports hosted APIs such as OpenAI, Groq, and Azure OpenAI. +Those are good options when you want managed infrastructure and do not mind +sending audio to a third-party provider. A local `whisper.cpp` path solves a +different problem: private and repeatable transcription without a cloud API key. + +That matters for AI engineers because transcription is often the first step in +a larger workflow. A support call becomes a bug report. A sales demo becomes +release-note evidence. An incident review becomes a timeline. If that first +step is hard to reproduce, every downstream artifact becomes harder to trust. + +Daytona helps by turning the setup into a workspace recipe. The `ffmpeg` +version, Sapat branch, `whisper.cpp` binary, model path, and transcript command +can all live in the same workspace. When a teammate needs to review the output, +they can open the same environment and inspect the exact command path. + +Here is the division of responsibility: + +| Layer | Responsibility | +| --- | --- | +| Daytona | Creates the reproducible workspace where the workflow runs. | +| Sapat | Provides one CLI for file and directory transcription. | +| ffmpeg | Converts source media into the audio format needed by the provider. | +| whisper.cpp | Runs local speech recognition against a GGML model. | +| Transcript review | Checks the output before it is used in summaries or tickets. | + +## Prepare the Workspace + +Start from the Sapat repository. While the companion provider pull request is +under review, use the provider branch directly. After it is merged, you can use +the upstream `main` branch instead. + +```bash +daytona create https://github.com/nibzard/sapat --code +``` + +Inside the workspace, fetch the provider branch: + +```bash +git remote add jamil https://github.com/jamilahmadzai/sapat.git +git fetch jamil codex/whispercpp-provider +git checkout -b whispercpp-provider jamil/codex/whispercpp-provider +``` + +Install Sapat in editable mode: + +```bash +python3 -m venv .venv +source .venv/bin/activate +python -m pip install -e . +``` + +Sapat still uses `ffmpeg` for media conversion, so confirm it is available: + +```bash +ffmpeg -version +``` + +If the command is missing, install it in the workspace. On Debian or Ubuntu +base images, this is usually: + +```bash +sudo apt-get update +sudo apt-get install -y ffmpeg cmake build-essential +``` + +The `cmake` and compiler packages are needed for the next step, where you build +`whisper.cpp`. + +## Build whisper.cpp and Download a Model + +The official `whisper.cpp` quick start builds a `whisper-cli` binary and uses +GGML-formatted Whisper models. That is exactly what the Sapat provider expects. + +Clone and build `whisper.cpp` beside the Sapat checkout: + +```bash +cd .. +git clone https://github.com/ggml-org/whisper.cpp.git +cd whisper.cpp +cmake -B build +cmake --build build -j --config Release +``` + +Download a small English model for your first run: + +```bash +sh ./models/download-ggml-model.sh base.en +``` + +You can use larger models after the workflow is proven. Larger models can +improve transcript quality, but they also need more CPU, memory, and time. For +most smoke tests, `base.en` is enough to verify the local path. + +Return to Sapat and configure the local provider: + +```bash +cd ../sapat +cp .env.example .env +``` + +Edit `.env`: + +```bash +WHISPERCPP_BINARY=/workspaces/whisper.cpp/build/bin/whisper-cli +WHISPERCPP_MODEL_PATH=/workspaces/whisper.cpp/models/ggml-base.en.bin +WHISPERCPP_THREADS=4 +WHISPERCPP_EXTRA_ARGS= +``` + +Adjust the paths to match your Daytona workspace. The two required values are +`WHISPERCPP_BINARY` and `WHISPERCPP_MODEL_PATH`. `WHISPERCPP_THREADS` is +optional, but setting it makes the command easier to reproduce across machines. + +## Run a First Transcription + +Place a short `.mp4` file in the Sapat workspace. Keep the first file short. +You want to test the full chain before spending minutes on a long recording. + +```bash +mkdir -p samples +cp /path/to/local/demo.mp4 samples/demo.mp4 +``` + +Run Sapat with the local provider: + +```bash +sapat samples/demo.mp4 --api whispercpp --language en --quality M +``` + +The provider performs three steps: + +1. Converts `samples/demo.mp4` to a temporary 16 kHz mono WAV file. +2. Runs `whisper-cli` with the configured GGML model. +3. Writes `samples/demo.txt` and removes the temporary WAV file. + +Open the transcript: + +```bash +sed -n '1,120p' samples/demo.txt +``` + +For a directory of videos, point Sapat at the directory: + +```bash +sapat samples --api whispercpp --language en --quality M +``` + +Sapat processes `.mp4` files in that directory and writes one `.txt` file per +video. This is useful for meeting folders, product demos, or training clips that +need the same model and language settings. + +## Keep the Workflow Reviewable + +The transcript is not the final artifact. Treat it as evidence that needs a +small review loop before it feeds a summary, ticket, or retrieval system. + +Use this checklist: + +- Confirm the transcript was created from the expected source file. +- Save the exact `sapat` command in a `README.md` or run log. +- Record the model file name, such as `ggml-base.en.bin`. +- Review product names, customer names, acronyms, and code terms manually. +- Mark low-confidence sections with timestamps from the source recording when + a human needs to listen again. +- Keep private audio and transcripts inside the workspace unless your policy + allows moving them elsewhere. + +If you need prompt-specific vocabulary, pass a prompt: + +```bash +sapat samples/demo.mp4 \ + --api whispercpp \ + --language en \ + --prompt "Daytona, Sapat, whisper.cpp, dev container, workspace" +``` + +The provider forwards the prompt to the local CLI. This is helpful for product +names, project codenames, and technical terms that are easy to miss in speech. + +Do not use `--correct` with this local provider. Sapat correction is currently +implemented through hosted chat APIs on the cloud providers. For a private +workflow, keep correction as a manual review step or run a separate local LLM +review after you have approved the transcript. + +## Operational Notes for Teams + +Treat the model file as part of the workflow contract. Two engineers can run +the same `sapat` command and still get different output if one uses +`ggml-base.en.bin` and the other uses a larger multilingual model. Write the +model name into your run log, and keep a small representative clip for smoke +testing changes to the workspace. + +Also decide where generated transcripts should live. For short experiments, +placing `demo.txt` next to `demo.mp4` is convenient. For team workflows, a +dedicated `transcripts/` folder with a simple naming convention is easier to +review: + +```bash +mkdir -p transcripts +cp samples/demo.txt transcripts/20260524_demo_whispercpp_base_en.txt +``` + +Finally, keep provider choice explicit. If a task can use a cloud provider, +document why. If a task should stay local, document that as well. A short note +in the pull request, incident packet, or research log prevents accidental +switching between private local transcription and hosted transcription later. + +## Troubleshooting + +**`whisper.cpp binary not found`** + +Set `WHISPERCPP_BINARY` to the full path of the compiled binary: + +```bash +export WHISPERCPP_BINARY="$PWD/../whisper.cpp/build/bin/whisper-cli" +``` + +If you installed `whisper-cli` globally, make sure it is on `PATH`: + +```bash +which whisper-cli +``` + +**`WHISPERCPP_MODEL_PATH must point to a local ggml model file`** + +Download a model and point the environment variable at the `.bin` file: + +```bash +cd ../whisper.cpp +sh ./models/download-ggml-model.sh base.en +export WHISPERCPP_MODEL_PATH="$PWD/models/ggml-base.en.bin" +``` + +**The transcript is empty** + +Start with a shorter, clearer sample. Confirm that `ffmpeg` can read the file: + +```bash +ffmpeg -i samples/demo.mp4 -f null - +``` + +Then run `whisper-cli` directly against a WAV file to isolate whether the issue +is in the model, the binary, or the Sapat wrapper. + +**The transcript is too slow** + +Use a smaller model for drafts, increase `WHISPERCPP_THREADS`, or split long +recordings into shorter clips before running Sapat. Keep the same model for +comparisons so you do not mix performance results with model-quality changes. + +## Conclusion + +Sapat plus `whisper.cpp` gives AI engineers a practical local transcription +path. Daytona makes that path repeatable. The result is a workflow where source +recordings, model configuration, transcript commands, and generated text stay +inside one workspace. + +This is not a replacement for every hosted transcription API. It is a strong +option when privacy, reproducibility, and local control matter more than a +managed provider. Start with a short sample, document the command, review the +output, and then scale the same workflow to a folder of recordings. + +## References + +- [Sapat repository](https://github.com/nibzard/sapat) +- [Sapat whisper.cpp provider pull request](https://github.com/nibzard/sapat/pull/45) +- [whisper.cpp repository and quick start](https://github.com/ggml-org/whisper.cpp) +- [Daytona repository](https://github.com/daytonaio/daytona) diff --git a/articles/assets/20260524_run_whispercpp_transcription_with_sapat_in_daytona_workflow.svg b/articles/assets/20260524_run_whispercpp_transcription_with_sapat_in_daytona_workflow.svg new file mode 100644 index 00000000..ff3ecb56 --- /dev/null +++ b/articles/assets/20260524_run_whispercpp_transcription_with_sapat_in_daytona_workflow.svg @@ -0,0 +1,42 @@ + + Offline Sapat transcription workflow with whisper.cpp in Daytona + A workflow diagram showing a Daytona workspace, local audio conversion, whisper.cpp transcription, transcript review, and reusable text artifacts. + + + + + + + + + Daytona workspace boundary + Audio, model files, commands, and transcripts stay in the same reproducible environment. + + Input media + meeting.mp4 + + Sapat + ffmpeg to WAV + + whisper.cpp + local GGML model + + Transcript + meeting.txt + + + + + No cloud upload or API key + + Review, summarize, and archive + diff --git a/authors/jamil_ahmadzai.md b/authors/jamil_ahmadzai.md new file mode 100644 index 00000000..fdd3cb41 --- /dev/null +++ b/authors/jamil_ahmadzai.md @@ -0,0 +1,8 @@ +Author: Jamil Ahmadzai +Title: Software Engineer +Description: Jamil Ahmadzai builds practical developer tooling and integration guides for AI workflows, with a focus on reproducible environments, automation, and shipping examples that engineers can run and adapt. +Company Name: Independent +Company Description: Independent software engineering and technical writing. +Author Image: +Company Logo Dark: +Company Logo White: diff --git a/definitions/20260524_definition_offline_speech_to_text.md b/definitions/20260524_definition_offline_speech_to_text.md new file mode 100644 index 00000000..ef54ce1b --- /dev/null +++ b/definitions/20260524_definition_offline_speech_to_text.md @@ -0,0 +1,28 @@ +--- +title: 'Offline Speech-to-Text' +description: 'Speech recognition that transcribes audio locally without sending recordings to a cloud API.' +date: 2026-05-24 +author: 'Jamil Ahmadzai' +--- + +# Offline Speech-to-Text + +## Definition + +Offline speech-to-text is the process of converting audio into written text on +the same machine or workspace where the audio file is stored. Instead of sending +the file to a hosted transcription API, the workflow runs a local speech +recognition model and writes the transcript back to local storage. + +## Context and Usage + +Offline speech-to-text is useful when recordings contain sensitive customer +calls, internal meetings, product demos, or incident reviews that should not +leave the development environment. It is also useful for repeatable benchmarks, +because every engineer can run the same binary, model file, prompt, and input +clip without depending on cloud quota or a provider outage. + +In a Daytona workspace, offline speech-to-text can be combined with a pinned +toolchain and environment variables. The workspace keeps the model path, +transcription command, source audio, and generated transcript close together, +which makes the workflow easier to review and reproduce.