From 3c02481ec5b86eb99e95ce42d18d2e32f684ab4c Mon Sep 17 00:00:00 2001 From: BinkyTwin Date: Thu, 21 May 2026 10:31:36 +0200 Subject: [PATCH] Add NanoGPT Sapat transcription guide Signed-off-by: BinkyTwin --- ...gpt_transcription_with_sapat_in_daytona.md | 305 ++++++++++++++++++ ...pt_transcription_with_sapat_in_daytona.svg | 34 ++ authors/binky_twin.md | 10 + ...nition_openai_compatible_speech_to_text.md | 23 ++ 4 files changed, 372 insertions(+) create mode 100644 articles/20260521_run_nanogpt_transcription_with_sapat_in_daytona.md create mode 100644 articles/assets/20260521_run_nanogpt_transcription_with_sapat_in_daytona.svg create mode 100644 authors/binky_twin.md create mode 100644 definitions/20260521_definition_openai_compatible_speech_to_text.md diff --git a/articles/20260521_run_nanogpt_transcription_with_sapat_in_daytona.md b/articles/20260521_run_nanogpt_transcription_with_sapat_in_daytona.md new file mode 100644 index 00000000..9bb3b813 --- /dev/null +++ b/articles/20260521_run_nanogpt_transcription_with_sapat_in_daytona.md @@ -0,0 +1,305 @@ +--- +title: 'Run NanoGPT Transcription With Sapat' +description: 'Build a reproducible Daytona workspace for Sapat, route audio through NanoGPT STT, and validate transcripts without committing secrets.' +date: 2026-05-21 +author: 'Binky Twin' +tags: ['Daytona', 'Sapat', 'Speech-to-Text', 'NanoGPT'] +--- + +# Run NanoGPT Transcription With Sapat + +## Introduction + +AI engineers often need a simple way to turn demo recordings, support calls, and +technical walkthroughs into text that can be searched, summarized, or handed to +an agent. The hard part is not only calling a speech-to-text provider. It is +making the workflow repeatable so another developer can open the same project, +set the same environment variables, and reproduce the transcript without +guessing which local dependency or credential was missing. + +[Sapat](https://github.com/nkkko/sapat) is a small Python CLI for this job. It +converts video files to MP3 with `ffmpeg`, sends the audio to a selected +provider, and writes a `.txt` file next to the source video. This guide adds a +NanoGPT path to that workflow and runs it inside a Daytona workspace so the +tooling, secrets, and validation steps are explicit. + +The companion implementation is available in +[nibzard/sapat#41](https://github.com/nibzard/sapat/pull/41). It adds +`--api nanogpt`, documents the `NANOGPT_*` environment variables, and includes +mocked tests for request wiring and CLI routing. + +![NanoGPT transcription workflow in Daytona](assets/20260521_run_nanogpt_transcription_with_sapat_in_daytona.svg) + +## TL;DR + +- Use Daytona to open a clean Sapat workspace instead of relying on a hand-tuned + local machine. +- Store `NANOGPT_API_KEY`, `NANOGPT_MODEL`, and `NANOGPT_API_ENDPOINT` as + environment variables. +- Run `sapat .mp4 --api nanogpt` to convert the video, send the MP3 to + NanoGPT, and save a transcript. +- Validate the transcript with a short checklist before feeding it into another + AI workflow. + +## What The NanoGPT Provider Adds + +NanoGPT exposes an +[OpenAI-compatible speech-to-text endpoint](https://docs.nano-gpt.com/api-reference/endpoint/audio-transcriptions) +at `POST https://nano-gpt.com/api/v1/audio/transcriptions`. The request accepts +multipart form data with a required audio `file`, a required `model`, and +optional parameters such as `language`. That shape maps well onto Sapat's +current provider model because Sapat already prepares an MP3 file and passes +language, prompt, temperature, and response format options to other providers. + +The NanoGPT Sapat provider keeps that same pattern: + +```text +video.mp4 -> ffmpeg MP3 conversion -> NanoGPT STT request -> video.txt +``` + +The provider reads three core environment variables: + +Variable | Purpose | Default +--- | --- | --- +`NANOGPT_API_KEY` | API key sent as `Authorization: Bearer ...` | Required +`NANOGPT_MODEL` | Speech-to-text model ID | `Whisper-Large-V3` +`NANOGPT_API_ENDPOINT` | OpenAI-compatible endpoint URL | `https://nano-gpt.com/api/v1/audio/transcriptions` + +This keeps credentials out of code and lets you switch NanoGPT models without +editing the CLI implementation. + +## Prerequisites + +Before you start, make sure you have: + +- A Daytona installation that can create workspaces. +- A NanoGPT API key with access to speech-to-text models. +- A short `.mp4` file for testing. +- Enough local or workspace disk space for Sapat to create a temporary MP3. + +The guide assumes the companion Sapat PR is available on a branch. If the PR has +not been merged yet, create the workspace from the fork and branch shown below. +After it is merged, use the upstream Sapat repository directly. + +## Create The Daytona Workspace + +Create a workspace from the Sapat branch that contains the NanoGPT provider: + +```bash +daytona create https://github.com/BinkyTwin/sapat --branch codex/add-nanogpt-transcription --code +``` + +When the editor opens, install the package in editable mode: + +```bash +python -m pip install -e . +``` + +Sapat uses `ffmpeg` to convert source video files to MP3. Check that it is +available in the workspace: + +```bash +ffmpeg -version +``` + +If `ffmpeg` is missing, install it in the workspace image or through the package +manager available in your Daytona environment. The important point is to make +that setup part of the workspace, not an undocumented local-machine step. + +## Configure NanoGPT Without Committing Secrets + +Create a local `.env` file in the workspace root: + +```bash +NANOGPT_API_KEY=replace_with_your_key +NANOGPT_MODEL=Whisper-Large-V3 +NANOGPT_API_ENDPOINT=https://nano-gpt.com/api/v1/audio/transcriptions +NANOGPT_CHAT_MODEL=replace_with_your_chat_model_for_correction +NANOGPT_CHAT_ENDPOINT=https://nano-gpt.com/api/v1/chat/completions +``` + +Do not commit `.env`. For team workflows, store those values through your +workspace secret process and keep only an `.env.example` in source control. A +safe example file can look like this: + +```bash +NANOGPT_API_KEY= +NANOGPT_MODEL=Whisper-Large-V3 +NANOGPT_API_ENDPOINT=https://nano-gpt.com/api/v1/audio/transcriptions +NANOGPT_CHAT_MODEL= +NANOGPT_CHAT_ENDPOINT=https://nano-gpt.com/api/v1/chat/completions +``` + +This is also where Daytona helps: the workspace can be rebuilt, but the secret +boundary stays clear. The code knows which variables it needs, and the +credentials remain outside the repository. + +## Run A First Transcription + +Copy a short test video into the workspace, then run: + +```bash +sapat demo.mp4 --api nanogpt --quality M --language en +``` + +Sapat will: + +1. Convert `demo.mp4` to `demo.mp3`. +2. Send `demo.mp3` to NanoGPT with the configured model. +3. Save the returned transcript as `demo.txt`. +4. Remove the temporary MP3 file after the transcript is written. + +For a higher bitrate MP3, use: + +```bash +sapat demo.mp4 --api nanogpt --quality H --language en +``` + +For domain-specific words, pass a prompt: + +```bash +sapat demo.mp4 --api nanogpt --language en --prompt "Product names: Daytona, Sapat, NanoGPT" +``` + +The prompt is useful for product names, speaker names, acronyms, and internal +tool names that would otherwise be easy for a speech model to misspell. + +## Process A Small Recording Folder + +Sapat can also process every `.mp4` file in a directory. This is useful when you +have a handful of short demos from the same feature review or a sequence of +screen recordings from one debugging session. + +Create a folder for the recordings: + +```bash +mkdir recordings +``` + +Copy the videos into that folder, then run: + +```bash +sapat recordings --api nanogpt --quality M --language en --prompt "Product names: Daytona, Sapat, NanoGPT" +``` + +Sapat will create one `.txt` file for each `.mp4` file. Keep the file names +descriptive before you run the command. A transcript named +`checkout_error_reproduction.txt` is much easier to reuse than +`screen-recording-4.txt`. + +For batch runs, start with two or three recordings before sending a larger +folder. That gives you a quick check on cost, file size, and transcript quality. +If the first pass looks good, scale the same command to the rest of the folder. +If it looks weak, fix the prompt or audio quality before spending credits on the +full batch. + +## Compare Provider Behavior + +One reason to add NanoGPT to Sapat is provider comparison. A Daytona workspace +lets you run a repeatable test without changing machines or hidden shell state. +Keep one short sample file and run the same source through two providers: + +```bash +sapat demo.mp4 --api nanogpt --quality M --language en +mv demo.txt demo.nanogpt.txt +sapat demo.mp4 --api openai --quality M --language en +mv demo.txt demo.openai.txt +``` + +Then compare the transcripts: + +```bash +diff -u demo.nanogpt.txt demo.openai.txt +``` + +The goal is not to declare a universal winner from one file. The goal is to +spot the practical differences that matter for your team: acronyms, noisy +audio, punctuation, code terms, cost, latency, and failure modes. + +## Validate The Transcript + +Do not hand a raw transcript straight to another agent. Run a small validation +pass first: + +Check | What To Look For +--- | --- +Completeness | The transcript covers the full video, not just the first segment. +Names | Product, speaker, and company names match the prompt vocabulary. +Numbers | Dates, amounts, version numbers, and ports are accurate. +Boundaries | Private customer data or secrets are removed before sharing. +Follow-up readiness | The transcript is clear enough for summarization or issue creation. + +For quick review, open the generated text: + +```bash +sed -n '1,160p' demo.txt +``` + +If the transcript will feed a planning agent, add a short header manually: + +```text +Source: demo.mp4 +Provider: NanoGPT / Whisper-Large-V3 +Reviewed: yes +Notes: Speaker names corrected, timestamps not included +``` + +That small bit of provenance prevents confusion later when the transcript moves +between tools. + +## Troubleshooting + +If the CLI fails before sending the request, check the local setup first: + +- `ffmpeg` is installed and reachable from the workspace shell. +- The input path points to a real `.mp4` file. +- The workspace has permission to write the `.mp3` and `.txt` sidecar files. + +If the request reaches NanoGPT but fails, check the provider configuration: + +- `NANOGPT_API_KEY` is set in the workspace session. +- `NANOGPT_MODEL` names a speech-to-text model available to your account. +- `NANOGPT_API_ENDPOINT` points to the OpenAI-compatible transcription endpoint. +- The file is small enough for the selected provider and model. + +If the transcript quality is weak, improve the input instead of only changing +models. Trim silence, avoid background music, use `--quality H` for important +recordings, and pass a prompt with product names or vocabulary that appears in +the audio. + +## How To Keep This Reproducible + +A one-off transcript is easy. A repeatable transcription workflow needs a little +discipline: + +1. Keep provider selection in the command: `--api nanogpt`. +2. Keep provider configuration in environment variables. +3. Keep a tiny sample file for smoke testing. +4. Keep validation notes next to the generated transcript. +5. Keep the workspace setup in Daytona so other contributors can reproduce it. + +That makes Sapat useful beyond a single video. You can run the same command for +release demos, support calls, design reviews, incident walkthroughs, or training +material, then compare provider behavior by changing only `--api` and the +environment variables. + +## Conclusion + +The NanoGPT provider gives Sapat another practical transcription route while +keeping the CLI shape familiar: convert the video, call a provider, save a text +file. Running the workflow in Daytona makes the surrounding details visible: +where credentials live, how `ffmpeg` is provided, which command produced the +transcript, and how the output was checked before reuse. + +For AI engineering teams, that reproducibility matters as much as the transcript +itself. Clean transcripts become prompts, test fixtures, issue notes, release +summaries, and knowledge-base entries. A workspace-backed Sapat flow keeps that +pipeline simple enough to trust. + +## References + +- [Sapat repository](https://github.com/nkkko/sapat) +- [NanoGPT OpenAI-compatible STT endpoint](https://docs.nano-gpt.com/api-reference/endpoint/audio-transcriptions) +- [NanoGPT Speech-to-Text overview](https://docs.nano-gpt.com/api-reference/speech-to-text) +- [Daytona documentation](https://www.daytona.io/docs/) +- [Companion NanoGPT provider PR](https://github.com/nibzard/sapat/pull/41) diff --git a/articles/assets/20260521_run_nanogpt_transcription_with_sapat_in_daytona.svg b/articles/assets/20260521_run_nanogpt_transcription_with_sapat_in_daytona.svg new file mode 100644 index 00000000..c8b56945 --- /dev/null +++ b/articles/assets/20260521_run_nanogpt_transcription_with_sapat_in_daytona.svg @@ -0,0 +1,34 @@ + + NanoGPT transcription workflow in Daytona + A diagram showing Daytona workspace setup, Sapat conversion, NanoGPT transcription, and transcript review. + + + Sapat + NanoGPT in a Daytona Workspace + A reproducible path from video input to review-ready transcript without committing API keys. + + + Daytona + create workspace + set env secrets + + Sapat CLI + convert to MP3 + route --api nanogpt + + NanoGPT + STT endpoint + OpenAI-compatible + + Review + check text + save .txt + + + + + + + + sapat demo.mp4 --api nanogpt --quality M --language en + + diff --git a/authors/binky_twin.md b/authors/binky_twin.md new file mode 100644 index 00000000..8b77c849 --- /dev/null +++ b/authors/binky_twin.md @@ -0,0 +1,10 @@ +Author: Binky Twin +Title: +Description: Binky Twin contributes practical developer workflow guides focused +on AI tooling, reproducible workspaces, and automation that can be verified from +source code. +Company Name: +Company Description: +Author Image: +Company Logo Dark: +Company Logo White: diff --git a/definitions/20260521_definition_openai_compatible_speech_to_text.md b/definitions/20260521_definition_openai_compatible_speech_to_text.md new file mode 100644 index 00000000..61de0c57 --- /dev/null +++ b/definitions/20260521_definition_openai_compatible_speech_to_text.md @@ -0,0 +1,23 @@ +--- +title: 'OpenAI-Compatible Speech-to-Text' +description: 'A speech transcription API shape that accepts OpenAI-style audio transcription requests while routing them to another provider.' +date: 2026-05-21 +author: 'Binky Twin' +--- + +# OpenAI-Compatible Speech-to-Text + +## Definition + +OpenAI-compatible speech-to-text is an API pattern where a provider accepts +requests shaped like OpenAI audio transcription calls, usually multipart uploads +with a `file`, `model`, optional `language`, and optional prompt or output +format parameters. + +## Context and Usage + +This compatibility layer lets developers switch transcription providers without +rewriting their entire workflow. A tool can keep one request-building path, then +change credentials, endpoint URL, and model name through environment variables. +It is especially useful in reproducible workspaces where the same command should +run against several providers for cost, latency, or availability comparisons.