Skip to content

Whisper#2

Open
uditmanav17 wants to merge 23 commits into
mainfrom
whisper
Open

Whisper#2
uditmanav17 wants to merge 23 commits into
mainfrom
whisper

Conversation

@uditmanav17
Copy link
Copy Markdown
Owner

@uditmanav17 uditmanav17 commented Apr 23, 2026

Summary by Sourcery

Introduce a Streamlit-based speech-to-text application using OpenAI Whisper with optional YouTube audio input and Dockerized deployment.

New Features:

  • Add Streamlit UI to transcribe uploaded audio files or YouTube videos using the Whisper base model with optional timestamps and language selection.
  • Add SRT-like postprocessed transcription output with in-app viewing and basic error handling for invalid or long YouTube inputs.

Enhancements:

  • Provide Streamlit configuration for dark-themed UI and basic app settings.
  • Add Dockerfile and docker-compose setup to containerize and run the Streamlit transcription app locally or in cloud environments.

Build:

  • Define Python dependencies and system packages (including ffmpeg and Whisper-related libraries) required to run the transcription app.

Documentation:

  • Add README documenting app usage, deployment via Docker and Docker Playground, and planned future improvements.

@sourcery-ai
Copy link
Copy Markdown

sourcery-ai Bot commented Apr 23, 2026

Reviewer's Guide

Adds a new Streamlit-based Whisper speech-to-text application (with YouTube/audio upload support) plus containerization and deployment scaffolding for local, Docker Playground, and Streamlit Cloud deployments.

Sequence diagram for Whisper transcription flow in Streamlit app

sequenceDiagram
    actor User
    participant Browser
    participant StreamlitApp
    participant WhisperModel
    participant YTDLP
    participant YouTube
    participant FileSystem

    User->>Browser: Open_app_url
    Browser->>StreamlitApp: HTTP_GET_app
    StreamlitApp-->>Browser: Render_UI(title_settings_inputs)

    User->>Browser: Enter_youtube_url_or_upload_audio
    User->>Browser: Click_Transcribe_button
    Browser->>StreamlitApp: Submit_form(text_input,audio,with_timestamps,language)

    StreamlitApp->>StreamlitApp: load_model_cached()
    StreamlitApp->>WhisperModel: Initialize_if_not_cached
    WhisperModel-->>StreamlitApp: Model_instance

    StreamlitApp->>FileSystem: Delete_existing_audio_m4a_if_exists

    alt Youtube_URL_provided
        StreamlitApp->>YTDLP: download_yt_audio(url)
        YTDLP->>YouTube: Fetch_video_audio_stream
        YouTube-->>YTDLP: Audio_stream
        YTDLP->>FileSystem: Write_audio_m4a
        YTDLP-->>StreamlitApp: Return_status
    else Uploaded_audio_file
        StreamlitApp->>FileSystem: Write_uploaded_bytes_to_audio_m4a
    end

    StreamlitApp->>FileSystem: Check_audio_m4a_exists
    alt Audio_exists
        StreamlitApp-->>Browser: Display_audio_player
        StreamlitApp->>WhisperModel: transcribe(audio_m4a,language,word_timestamps)
        WhisperModel-->>StreamlitApp: Predictions_dict

        StreamlitApp->>StreamlitApp: postprocess_transcription(predictions,include_timestamps)
        StreamlitApp-->>Browser: Show_transcription_in_expander

        StreamlitApp->>FileSystem: Delete_audio_m4a
    else Audio_missing
        StreamlitApp-->>Browser: Show_error_audio_generation_failed
    end

    User->>Browser: Click_Refresh_App_button
    Browser->>StreamlitApp: Refresh_request
    StreamlitApp->>FileSystem: Delete_audio_m4a_if_exists
    StreamlitApp->>WhisperModel: Reload_model_cached
Loading

Class-style diagram for functional components in Whisper app.py

classDiagram
    class AppModule {
        +load_model() torch_module
        +duration_check(info, incomplete) str
        +download_yt_audio(yt_url) None
        +postprocess_transcription(predictions, include_timestamps) str
        +main() None
    }

    class WhisperModelRuntime {
        +transcribe(audio_path, verbose, word_timestamps, language) dict
    }

    class YTDLWrapper {
        +download(yt_url) int
    }

    class StreamlitUI {
        +set_page_config()
        +title()
        +sidebar_settings()
        +text_input()
        +file_uploader()
        +button()
        +toast()
        +spinner()
        +audio()
        +expander()
        +write()
        +error()
        +info()
        +success()
    }

    class FileSystemHelper {
        +write_audio_file(bytes_data)
        +delete_audio_file()
        +exists_audio_file() bool
    }

    AppModule --> WhisperModelRuntime : uses
    AppModule --> YTDLWrapper : uses
    AppModule --> StreamlitUI : uses
    AppModule --> FileSystemHelper : uses

    WhisperModelRuntime <.. AppModule : model_instance
    YTDLWrapper <.. AppModule : youtube_download
    FileSystemHelper <.. AppModule : audio_m4a_management
Loading

Flow diagram for main transcription logic in app.py

flowchart TD
    A["Start_app_main"] --> B["Render_title_and_sidebar_settings"]
    B --> C["Get_youtube_url_text_input"]
    C --> D["Get_audio_file_upload"]
    D --> E["User_clicks_Transcribe_button?"]

    E -->|No| F["Show_info_request_input"]
    E -->|Yes| G["Have_youtube_url_or_audio?"]

    G -->|No| F
    G -->|Yes| H["Delete_existing_audio_m4a_if_any"]

    H --> I["Show_running_toast"]
    I --> J{"Youtube_URL_provided?"}

    J -->|Yes| K["Call_download_yt_audio_with_yt_dlp"]
    J -->|No| L["Write_uploaded_audio_bytes_to_audio_m4a"]

    K --> M["Check_audio_m4a_exists"]
    L --> M

    M -->|No| N["Show_error_audio_generation_failed"]
    M -->|Yes| O["Display_audio_player_for_audio_m4a"]

    O --> P["Call_model_transcribe_with_language_and_word_timestamps"]
    P --> Q["Call_postprocess_transcription_with_timestamp_option"]

    Q --> R{"Transcription_non_empty?"}
    R -->|No| S["No_text_to_display"]
    R -->|Yes| T["Store_in_session_state_and_show_in_expander"]

    T --> U["Delete_audio_m4a"]
    S --> U

    U --> V["End_request_wait_for_next_interaction"]

    F --> V

    subgraph Refresh_flow
        W["User_clicks_Refresh_App_button"] --> X["Delete_audio_m4a_if_exists"]
        X --> Y["Reload_model_via_load_model_cache"]
    end
Loading

File-Level Changes

Change Details Files
Implement Streamlit Whisper transcription app supporting YouTube downloads and file uploads with optional timestamped output.
  • Configure Streamlit page settings and sidebar options for timestamp inclusion and language selection.
  • Add cached Whisper model loader using the base model variant.
  • Implement YouTube audio download helper using yt-dlp with a 10-minute duration filter and ffmpeg-based audio extraction.
  • Implement transcription post-processing to optionally render SRT-style timestamped segments.
  • Wire UI interactions to download or save audio, invoke Whisper transcription, display audio player and results, and handle error/toast messaging and refresh logic.
speech-to-text/app.py
Document the app’s purpose and provide Docker-based deployment instructions (local and Docker Playground/cloud).
  • Describe project objective and public Streamlit deployment URL.
  • Explain directory structure and components (app, packages, docker-compose).
  • Provide step-by-step local Docker and Docker Playground deployment commands.
  • List future improvement ideas such as SRT export and streaming transcription.
speech-to-text/README.md
Add Streamlit configuration and theming for the new app.
  • Define Streamlit server config placeholders and file watcher settings (mostly commented out).
  • Set a dark theme base and font family in config.
speech-to-text/.streamlit/config.toml
Containerize the app and define a docker-compose setup for local development.
  • Create a Python 3.11-slim based Dockerfile that installs ffmpeg and Python dependencies, copies app code, exposes port 8501, and runs Streamlit.
  • Create docker-compose service that builds the image, maps port 8501, mounts the source directory, and assigns an app profile and bridge network.
speech-to-text/Dockerfile
speech-to-text/docker-compose.yml
Declare application Python dependencies and ancillary project files.
  • Add requirements for streamlit, pyperclip, yt-dlp, and openai-whisper (with commented torch/ffmpeg-python).
  • Add empty or placeholder support files like packages.txt and .gitignore for the speech-to-text app.
speech-to-text/requirements.txt
packages.txt
speech-to-text/.gitignore

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link
Copy Markdown

@sourcery-ai sourcery-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 6 issues, and left some high level feedback:

  • In docker-compose.yml, the command overrides the Dockerfile ENTRYPOINT and tries to execute /app/app.py directly; consider either removing command or invoking streamlit run app.py --server.port 8501 so the container starts correctly.
  • In postprocess_transcription, predictions.get("segments", {}) should default to a list (e.g. []) instead of a dict, and the timestamp formatting via str(0) + str(timedelta(...)) + ",000" is a bit opaque—using an explicit SRT-style formatter will be clearer and less error-prone.
  • There are unused imports/dependencies (e.g. pyperclip, torch and the commented torch requirement) which could be removed to keep the runtime image smaller and the code easier to maintain.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- In `docker-compose.yml`, the `command` overrides the Dockerfile `ENTRYPOINT` and tries to execute `/app/app.py` directly; consider either removing `command` or invoking `streamlit run app.py --server.port 8501` so the container starts correctly.
- In `postprocess_transcription`, `predictions.get("segments", {})` should default to a list (e.g. `[]`) instead of a dict, and the timestamp formatting via `str(0) + str(timedelta(...)) + ",000"` is a bit opaque—using an explicit SRT-style formatter will be clearer and less error-prone.
- There are unused imports/dependencies (e.g. `pyperclip`, `torch` and the commented `torch` requirement) which could be removed to keep the runtime image smaller and the code easier to maintain.

## Individual Comments

### Comment 1
<location path="speech-to-text/app.py" line_range="71" />
<code_context>
+    if not include_timestamps:
+        return predictions.get("text")
+    result = []
+    for segment in predictions.get("segments", {}):
+        startTime = str(0) + str(timedelta(seconds=int(segment["start"]))) + ",000"
+        endTime = str(0) + str(timedelta(seconds=int(segment["end"]))) + ",000"
</code_context>
<issue_to_address>
**issue (bug_risk):** Use a list as the default for `segments` to avoid iterating over a dict's keys.

If `segments` is missing, `predictions.get("segments", {})` returns a dict, so the loop iterates over its keys (strings) and `segment["start"]` will raise at runtime. Using an empty list as the default keeps the type consistent with Whisper’s output and avoids this failure:

```python
for segment in predictions.get("segments", []):
    ...
```
</issue_to_address>

### Comment 2
<location path="speech-to-text/docker-compose.yml" line_range="8-9" />
<code_context>
+    build:
+      context: .
+      dockerfile: Dockerfile
+    command: >
+      /app/app.py --server.port 8501
+    ports:
+      - "8501:8501"
</code_context>
<issue_to_address>
**issue (bug_risk):** The custom `command` conflicts with the Dockerfile ENTRYPOINT and likely breaks `streamlit` startup.

In the Dockerfile, `ENTRYPOINT` is `[
</issue_to_address>

### Comment 3
<location path="speech-to-text/docker-compose.yml" line_range="17-19" />
<code_context>
+    profiles:
+      - app
+
+networks:
+  app:
+    driver: bridge
\ No newline at end of file
</code_context>
<issue_to_address>
**suggestion (bug_risk):** The declared `app` network is not attached to the `streamlit` service.

As written, `streamlit` will still use the default network because it never references `app`. If this network is needed, attach it explicitly under the `streamlit` service:

```yaml
  streamlit:
    ...
    networks:
      - app

networks:
  app:
    driver: bridge
```

If you don’t need a custom network, you can remove the `networks` section entirely.
</issue_to_address>

### Comment 4
<location path="speech-to-text/README.md" line_range="17" />
<code_context>
+
+## Code Structure / Services
+- `app` - Complete application code built in streamlit.
+- `packages` - List of linux dependencies required to deploy code on streamlit cloud.
+- `docker-compose` - Compose file which starts application.
+
</code_context>
<issue_to_address>
**suggestion (typo):** Capitalize proper nouns like "Linux" and brand names consistently.

For example, you could rephrase this bullet as: "List of Linux dependencies required to deploy the code on Streamlit Cloud."

Suggested implementation:

```
Application is deployed on Streamlit Cloud [here](https://transcribe-whisper.streamlit.app/).

```

```
- `app` - Complete application code built in Streamlit.
- `packages` - List of Linux dependencies required to deploy the code on Streamlit Cloud.
- `docker-compose` - Compose file which starts the application.

```
</issue_to_address>

### Comment 5
<location path="speech-to-text/README.md" line_range="24" />
<code_context>
+## Deployment
+- Local deployment
+    - Install Docker. Instructions available [here](https://docs.docker.com/engine/install/). Make sure docker is up and running before proceeding.
+    - Install Git. Instruction [here](https://git-scm.com/book/en/v2/Getting-Started-Installing-Git).
+    - Clone repo and run compose
+    ```
</code_context>
<issue_to_address>
**nitpick (typo):** Use plural "Instructions" to match the linked content.

Change the text to: "Install Git. Instructions [here](https://git-scm.com/book/en/v2/Getting-Started-Installing-Git)."

```suggestion
    - Install Git. Instructions [here](https://git-scm.com/book/en/v2/Getting-Started-Installing-Git).
```
</issue_to_address>

### Comment 6
<location path="speech-to-text/README.md" line_range="31" />
<code_context>
+    git switch whisper && cd ./speech-to-text
+    docker compose --profile app up
+    ```
+    - `--profile app` will start on `localhost:8501` and `localhost:8501` ports.
+
+- Docker Playground Cloud Deployment
</code_context>
<issue_to_address>
**question (typo):** Duplicated port number and plural "ports" may be confusing.

This bullet lists `localhost:8501` twice but refers to "ports". If only one port is exposed, list it once and use "port". If multiple ports are intended, update the second value to the correct port number.

```suggestion
    - `--profile app` will start on `localhost:8501` port.
```
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment thread speech-to-text/app.py
if not include_timestamps:
return predictions.get("text")
result = []
for segment in predictions.get("segments", {}):
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (bug_risk): Use a list as the default for segments to avoid iterating over a dict's keys.

If segments is missing, predictions.get("segments", {}) returns a dict, so the loop iterates over its keys (strings) and segment["start"] will raise at runtime. Using an empty list as the default keeps the type consistent with Whisper’s output and avoids this failure:

for segment in predictions.get("segments", []):
    ...

Comment on lines +8 to +9
command: >
/app/app.py --server.port 8501
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (bug_risk): The custom command conflicts with the Dockerfile ENTRYPOINT and likely breaks streamlit startup.

In the Dockerfile, ENTRYPOINT is `[

Comment on lines +17 to +19
networks:
app:
driver: bridge No newline at end of file
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (bug_risk): The declared app network is not attached to the streamlit service.

As written, streamlit will still use the default network because it never references app. If this network is needed, attach it explicitly under the streamlit service:

  streamlit:
    ...
    networks:
      - app

networks:
  app:
    driver: bridge

If you don’t need a custom network, you can remove the networks section entirely.

Comment thread speech-to-text/README.md

## Code Structure / Services
- `app` - Complete application code built in streamlit.
- `packages` - List of linux dependencies required to deploy code on streamlit cloud.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (typo): Capitalize proper nouns like "Linux" and brand names consistently.

For example, you could rephrase this bullet as: "List of Linux dependencies required to deploy the code on Streamlit Cloud."

Suggested implementation:

Application is deployed on Streamlit Cloud [here](https://transcribe-whisper.streamlit.app/).

- `app` - Complete application code built in Streamlit.
- `packages` - List of Linux dependencies required to deploy the code on Streamlit Cloud.
- `docker-compose` - Compose file which starts the application.

Comment thread speech-to-text/README.md Outdated
Comment thread speech-to-text/README.md Outdated
uditmanav17 and others added 2 commits April 23, 2026 23:19
Co-authored-by: sourcery-ai[bot] <58596630+sourcery-ai[bot]@users.noreply.github.com>
Co-authored-by: sourcery-ai[bot] <58596630+sourcery-ai[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant