Skip to content

Conversation

@coleleavitt
Copy link

Summary

Add support for NVIDIA's Canary speech-to-text models via NeMo toolkit.

Models Added

Model ID WER Speed Notes
Canary 1B v2 multilang_canary_1b_v2 4.89% 630x RTF Default, 5x faster than Whisper
Canary Qwen 2.5B multilang_canary_qwen Better Slower Higher accuracy variant

Features

  • GPU acceleration (CUDA/ROCm) via NeMo
  • Automatic model download from HuggingFace
  • Translation support (s2t_translation task)
  • Punctuation restoration
  • Follows existing fasterwhisper_engine patterns

Files Changed

  • New: src/canary_engine.hpp, src/canary_engine.cpp
  • Modified: models_manager.h/cpp, speech_service.cpp, CMakeLists.txt, config/models.json

Requirements

pip install nemo_toolkit[asr]

Why Canary?

Per the Open ASR Leaderboard:

  • Canary 1B v2 achieves 4.89% WER (better than Whisper Large V3's 4.91%)
  • 5x faster inference (630x vs 126x real-time factor)
  • Native NVIDIA optimization for modern GPUs

Testing

  • Build tested on Linux with Qt dev tools
  • Runtime tested with NeMo toolkit installed
  • GPU acceleration verified

Add support for NVIDIA's Canary speech-to-text models via NeMo toolkit:

- Canary 1B v2: 4.89% WER, 630x RTF (5x faster than Whisper)
- Canary Qwen 2.5B: Higher accuracy variant for demanding use cases

Both models use NeMo's EncDecMultiTaskModel architecture with automatic
model download via HuggingFace. Supports GPU acceleration (CUDA/ROCm),
translation (s2t_translation), and punctuation restoration.

New files:
- src/canary_engine.hpp: Engine class definition
- src/canary_engine.cpp: NeMo Python integration via py_executor

Modified:
- models_manager.h/cpp: Add stt_canary engine type and feature flags
- speech_service.cpp: Engine instantiation and type checking
- CMakeLists.txt: Add canary_engine source files
- config/models.json: Add both Canary model entries

Requires: pip install nemo_toolkit[asr]
Check for nemo.collections.asr module availability at startup.
This enables dsnote to automatically detect if NeMo is installed
and show/hide Canary models accordingly in the UI.

- py_tools.hpp: Add nemo_asr to libs_availability_t
- py_tools.cpp: Add nemo.collections.asr import check
- speech_service.cpp: Map nemo_asr availability to stt_canary
- Update CMakeLists.txt to use Qt6 instead of Qt5
- Update cmake/*.cmake files for Qt6 compatibility
- Replace deprecated Qt5 APIs with Qt6 equivalents:
  - QRegExp -> QRegularExpression
  - QX11Info -> QNativeInterface::QX11Application
  - QMediaPlayer::State -> QMediaPlayer::PlaybackState
  - QMediaPlayer::stateChanged -> playbackStateChanged
  - setMedia(QMediaContent) -> setSource(QUrl)
  - QAudioInput (recording) -> QAudioSource
  - QAudioDeviceInfo -> QAudioDevice + QMediaDevices
  - QAudioFormat::setSampleSize/setCodec -> setSampleFormat
  - QNetworkRequest::FollowRedirectsAttribute -> RedirectPolicyAttribute
  - Remove Qt::AA_EnableHighDpiScaling (default in Qt6)
  - Remove QTextCodec usage
  - Remove QQuickStyle::availableStyles() (not in Qt6)
- Fix GCC 15 type strictness (std::clamp/max int vs qsizetype)
- Update qhotkey external project to build with Qt6
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant