Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
280 changes: 280 additions & 0 deletions articles/20260525_run_oracle_cloud_ai_speech_with_sapat_in_daytona.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,280 @@
---
title: "Run Oracle Cloud AI Speech with Sapat in Daytona"
description: "Build a Daytona workspace that transcribes video with Sapat, OCI Object Storage, and Oracle Cloud AI Speech."
date: 2026-05-25
author: "Sebastien Andersson"
tags: ["daytona", "sapat", "oracle", "oci", "speech-to-text"]
---

# Run Oracle Cloud AI Speech with Sapat in Daytona

## Introduction

[Sapat](https://github.com/nkkko/sapat) is a small Python command-line tool for turning video files into text transcripts. It converts video to MP3 with `ffmpeg`, sends the audio to a selected transcription provider, and writes a `.txt` transcript next to the original media file.

That makes it a useful fit for AI engineers who want a repeatable way to turn demos, customer calls, lectures, or product walkthroughs into searchable project material.

This guide shows how to run Sapat in a Daytona workspace with [Oracle Cloud AI Speech](/definitions/20260525_definition_oracle_cloud_ai_speech.md) as the transcription backend.

The companion Sapat provider branch adds `--api oracle`, uploads the generated MP3 to OCI Object Storage, creates an Oracle Cloud AI Speech transcription job, polls the task, and reads the transcript JSON back from Object Storage.

The result is a workflow you can rebuild in a clean Daytona workspace instead of keeping provider setup hidden on one developer laptop.

## TL;DR

- Use Daytona to create a clean workspace for the Sapat repository.
- Install Sapat with the Oracle optional dependency: `pip install "sapat[oracle]"`.
- Configure OCI credentials, compartment ID, namespace, and Object Storage buckets in `.env`.
- Run Sapat with `--api oracle` and an explicit locale such as `en-US`.
- Keep OCI keys, `.env` files, audio uploads, and transcripts out of git.

![Oracle Cloud AI Speech with Sapat workflow](assets/20260525_run_oracle_cloud_ai_speech_with_sapat_in_daytona_workflow.svg)

## What the Oracle provider adds to Sapat

The current Sapat workflow is intentionally direct. A user points the CLI at a video file or a directory of `.mp4` files, chooses a quality level, and picks an API with `--api`. Existing providers handle OpenAI-compatible request flows.

Oracle Cloud AI Speech is different because the service is built around asynchronous jobs and Object Storage locations.

The Oracle provider keeps that difference behind the same Sapat interface. After Sapat creates an MP3 from your video, the provider uploads that MP3 to an input bucket, creates a speech transcription job, and waits for the transcription task to succeed.

It then downloads the transcript JSON from the output bucket, extracts the text, and lets Sapat save the `.txt` result as usual.

That means users still run a simple CLI command:

```bash
sapat meeting.mp4 --quality M --language en-US --api oracle
```

Behind the scenes, the command uses the OCI Python SDK. The provider reads its settings from environment variables so the same code works in a Daytona workspace, a CI validation job, or a local shell without hard-coding any tenancy details.

## Prerequisites

Before starting, make sure you have these pieces ready:

- A Daytona installation and a configured Daytona target.
- Python 3.10 or later in the workspace.
- `ffmpeg`, because Sapat converts video files to MP3 before transcription.
- An Oracle Cloud Infrastructure tenancy with Object Storage and AI Speech access.
- An OCI config file or equivalent authentication setup available inside the workspace.
- Two Object Storage buckets, or one bucket used for both input and output.

The guide assumes you are using the standard OCI config file at `~/.oci/config`. If your team uses instance principals, resource principals, or a secret manager-backed setup, keep the Sapat environment variable names the same but adapt the authentication layer to your platform policy.

## Step 1: Create a Daytona workspace

Create a workspace from the Sapat repository:

```bash
daytona create https://github.com/nkkko/sapat --code
```

If your Daytona CLI opens the repository in your editor, use the integrated terminal for the rest of the commands. Otherwise, enter the workspace shell from your Daytona dashboard or CLI and move into the repository directory:

```bash
cd sapat
```

Confirm that Python and `ffmpeg` are available:

```bash
python --version
ffmpeg -version
```

If `ffmpeg` is missing, install it with the package manager used by your workspace image. For Debian or Ubuntu based images, that is usually:

```bash
sudo apt-get update
sudo apt-get install -y ffmpeg
```

## Step 2: Install Sapat with Oracle support

Oracle support uses the OCI Python SDK, so install Sapat with its Oracle extra:

```bash
python -m pip install --upgrade pip
python -m pip install "sapat[oracle]"
```

When you are testing the provider from a checked-out branch, install it in editable mode instead:

```bash
python -m pip install -e ".[oracle]"
```

Check that the CLI exposes the Oracle provider:

```bash
sapat --help
```

The API option should include `oracle` alongside the existing providers:

```text
--api [openai|groq|azure|oracle]
```

The companion implementation used for this guide is tracked in the [Sapat Oracle provider PR](https://github.com/nibzard/sapat/pull/48).

## Step 3: Prepare OCI Object Storage

Oracle Cloud AI Speech reads media from Object Storage and writes transcription output back to Object Storage. Create or choose:

- An input bucket for temporary MP3 uploads.
- An output bucket for transcript JSON files.
- A prefix for Sapat transcript job output, such as `sapat-transcripts`.

The IAM policy needs to let the identity used in your workspace read and write the chosen buckets and create AI Speech transcription jobs in the compartment. Keep the policy scoped to the compartment and buckets used for this workflow where possible.

You also need the Object Storage namespace. You can find it in the OCI Console under Object Storage settings, or with the OCI CLI if it is configured:

```bash
oci os ns get
```

Do not put tenancy OCIDs, private keys, or generated transcripts into the repository. Treat recorded audio and transcripts as sensitive project data unless you have a clear retention policy.

## Step 4: Configure the Sapat Oracle environment

Create a local `.env` file in the workspace:

```bash
touch .env
```

Add the Oracle settings:

```bash
OCI_CONFIG_FILE=~/.oci/config
OCI_PROFILE=DEFAULT
OCI_COMPARTMENT_ID=ocid1.compartment.oc1..example
OCI_OBJECT_STORAGE_NAMESPACE=your_namespace
OCI_SPEECH_INPUT_BUCKET=sapat-input
OCI_SPEECH_OUTPUT_BUCKET=sapat-output
OCI_SPEECH_OUTPUT_PREFIX=sapat-transcripts
OCI_SPEECH_MODEL_TYPE=ORACLE
OCI_SPEECH_LANGUAGE_CODE=en-US
OCI_SPEECH_WAIT_SECONDS=900
OCI_SPEECH_POLL_INTERVAL_SECONDS=5
OCI_SPEECH_CLEANUP_INPUT=true
```

The `OCI_SPEECH_MODEL_TYPE` value defaults to `ORACLE`. With that model, pass an explicit locale like `en-US`, `es-ES`, or `pt-BR`.

If you configure an OCI Whisper model such as `WHISPER_MEDIUM` or `WHISPER_LARGE_V2`, you can adapt the language behavior to that model, but keep the guide's first run simple: use `ORACLE` and `en-US`.

The cleanup flag controls whether Sapat deletes the uploaded input object after the transcription attempt finishes. Leaving `OCI_SPEECH_CLEANUP_INPUT=true` is a good default for development workspaces because it prevents old media files from piling up in the input bucket.

The generated transcript JSON remains in the output bucket so you can inspect the raw response if something looks wrong.

## Step 5: Add a sample video

Copy a short `.mp4` file into the workspace. A small file is better for the first run because it keeps upload, transcription, and polling time low:

```bash
mkdir -p samples
cp ~/Downloads/team-demo.mp4 samples/team-demo.mp4
```

If you do not have a sample video, create a tiny test clip with `ffmpeg`:

```bash
ffmpeg -f lavfi -i testsrc=duration=5:size=1280x720:rate=30 \
-f lavfi -i sine=frequency=1000:duration=5 \
-shortest samples/test-video.mp4
```

That generated clip is enough to verify the media conversion and job flow, but it will not produce a meaningful natural-language transcript. Use a real spoken sample before treating the workflow as production-ready.

## Step 6: Run the transcription

Run Sapat with the Oracle provider:

```bash
sapat samples/team-demo.mp4 --quality M --language en-US --api oracle
```

Sapat will create a temporary MP3 from the video file, upload it to the configured input bucket, create an Oracle Cloud AI Speech job, poll the transcription task, and read the transcript object from the output bucket.

When the command succeeds, check for a `.txt` file next to the input video:

```bash
ls samples
cat samples/team-demo.txt
```

For a directory of videos, pass the directory instead of one file:

```bash
sapat samples --quality M --language en-US --api oracle
```

Use the directory mode when you want the same provider settings applied to a batch of meeting recordings, lecture segments, or product walkthroughs.

## Step 7: Confirm the workflow from OCI

After the first run, confirm both sides of the workflow:

1. In the Sapat workspace, confirm the `.txt` transcript exists and contains text from the recording.
2. In OCI Object Storage, confirm the output bucket contains a transcript JSON file under `sapat-transcripts`.
3. In OCI AI Speech, confirm the transcription job reached a succeeded state.
4. If cleanup is enabled, confirm the temporary input MP3 object was removed from the input bucket.

This check matters because cloud transcription problems often happen at the edges: the media uploaded successfully but the job could not read it, the job succeeded but output went to a different prefix, or the transcript exists but the local workflow is reading the wrong object.

Verifying both the Daytona workspace and OCI resources catches those issues early.

## Troubleshooting

**Problem:** Sapat says an Oracle setting is missing.

**Solution:** Check `.env` and confirm the required variables are set: `OCI_COMPARTMENT_ID`, `OCI_OBJECT_STORAGE_NAMESPACE`, `OCI_SPEECH_INPUT_BUCKET`, and `OCI_SPEECH_OUTPUT_BUCKET`. Restart the shell or rerun the command from the directory where `.env` exists.

**Problem:** OCI authentication fails.

**Solution:** Confirm `OCI_CONFIG_FILE` points to a readable config file inside the Daytona workspace and that `OCI_PROFILE` exists in that file. If the private key is mounted into the workspace, verify the path in the config file matches the mounted location.

**Problem:** The transcription job times out.

**Solution:** Increase `OCI_SPEECH_WAIT_SECONDS` for longer media files. For example, set it to `1800` for a 30 minute wait window. Also check the AI Speech job in the OCI Console to see whether it is queued, running, or failed.

**Problem:** The command works but the transcript is empty or low quality.

**Solution:** Start with a short, clear audio sample and `--quality H`. Confirm the source file actually contains speech. For noisy recordings, preprocess the audio before running Sapat or use a clearer sample to verify the integration first.

**Problem:** Output objects accumulate in the bucket.

**Solution:** Keep the output prefix dedicated to Sapat, then apply an Object Storage lifecycle policy that expires old transcript JSON files after your retention window. Do not use automatic deletion until you know the transcripts are no longer needed.

## Security and cleanup notes

A transcription workflow handles sensitive material by default. Even a short meeting clip can include customer names, internal roadmap details, or private credentials spoken during a demo. Keep these rules in place:

- Never commit `.env`, OCI config files, private keys, media files, MP3 conversions, or generated transcripts.
- Use least-privilege IAM policies for the buckets and compartment.
- Keep input and output buckets separate when your retention rules differ.
- Use short-lived workspace secrets or a secrets manager when you move beyond local development.
- Clean up sample media from the workspace after validation.

In a Daytona workflow, the cleanest operating model is to keep the repository reproducible and the secrets external. The repo should explain the variables and commands, while the workspace receives credentials through your approved secret path.

## Conclusion

Running Sapat with Oracle Cloud AI Speech gives AI engineers a repeatable batch transcription workflow that fits OCI-heavy environments. Daytona provides the clean workspace, Sapat handles video conversion and provider routing, and Object Storage gives AI Speech stable input and output locations.

The Oracle provider turns the asynchronous job flow into a normal `sapat ... --api oracle` command.

The important part is not just adding another provider name. The useful workflow is the complete loop: create a clean workspace, configure credentials without committing secrets, run a real sample, verify both local output and OCI job state, and document how to troubleshoot the moving pieces.

Once that loop works, the same pattern can support larger transcription batches, cleaner handoff notes, and auditable transcript generation for AI projects.

## References

- [Sapat repository](https://github.com/nkkko/sapat)
- [Daytona documentation](https://www.daytona.io/docs)
- [Oracle Cloud AI Speech documentation](https://docs.oracle.com/en-us/iaas/Content/speech/using/speech.htm)
- [OCI Python SDK AI Speech client](https://docs.oracle.com/en-us/iaas/tools/python/latest/api/ai_speech/client/oci.ai_speech.AIServiceSpeechClient.html)
- [OCI Object Storage documentation](https://docs.oracle.com/en-us/iaas/Content/Object/home.htm)
- [ffmpeg documentation](https://ffmpeg.org/documentation.html)
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 2 additions & 0 deletions authors/sebastien_andersson.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
Author: Sebastien Andersson Title: Software Engineer Description: Sebastien builds practical automation, developer tooling, and AI-assisted software workflows with a focus on reproducible systems, clear tests, and maintainable integrations.
Author Image: <https://github.com/LubuSeb.png> Author GitHub: <https://github.com/LubuSeb>
20 changes: 20 additions & 0 deletions definitions/20260525_definition_oracle_cloud_ai_speech.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
---
title: "Oracle Cloud AI Speech"
description: "Oracle Cloud AI Speech is a managed service for converting audio and video speech into text through asynchronous transcription jobs."
date: 2026-05-25
author: "Sebastien Andersson"
---

# Oracle Cloud AI Speech

## Definition

Oracle Cloud AI Speech is a managed speech-to-text service in Oracle Cloud Infrastructure. It accepts audio or video input from Object Storage, creates an asynchronous transcription job, and writes the transcription output back to Object Storage as a JSON result.

## Context and Usage

AI engineers use Oracle Cloud AI Speech when they need transcription in an OCI-based environment, want cloud-hosted batch processing, or already keep media files in Object Storage.

In an application pipeline, the usual flow is to upload an audio object, create a speech transcription job with a compartment OCID and output location, poll the job task until it succeeds, then read the generated transcript JSON from the configured bucket.

The model configuration controls whether the job uses Oracle's standard speech model or a Whisper model variant exposed by OCI. For the standard Oracle model, use explicit locale codes such as `en-US`, `es-ES`, or `pt-BR` rather than short language codes.