WFL-ASR is a configurable deep learning model designed for automatic phoneme segmentation using frame-level BIO tagging. It supports both Whisper and WavLM as audio encoders, and is structured for flexible and efficient training on phoneme-aligned datasets.
This model performs frame-level phoneme labeling using the BIO tag format (B-, I-, O).
.labfiles define phoneme segments using HTK format.- Each segment is converted into BIO tags aligned to time frames based on
frame_duration(hardcoded to 20ms for Whisper compatibility). - Tags are stored along with the audio path in a training JSON.
- Whisper or WavLM encoders process the audio waveform into frame-wise feature vectors.
- Whisper uses fixed 20ms frame stride.
- WavLM offers flexible windowing via HuBERT-style encoding.
The encoded features go through a stack of optional, configurable layers:
BiLSTM- sequential modeling (optional)Conformer Blocks- long + short-term feature modelingDilated Conv Stack- local context enhancement (optional)
- A linear layer maps each time step to a BIO tag.
- Predict BIO tags from audio.
- Optional smoothing (median filtering) and merging for better boundary clarity.
- Convert tags back to
.labsegments.
- Whisper/WavLM encoder support
- Frame-level BIO tag training
- Configurable architecture (BiLSTM, Conformer, Conv)
- HTK-compatible
.laboutput format - Optional waveform augmentation via the
augmentationconfig section
The config.yaml file now includes an optional augmentation section used during training. When enabled it randomly applies volume scaling and Gaussian noise:
augmentation:
enable: true
noise_std: 0.005 # standard deviation of Gaussian noise
prob: 0.5 # probability to augment a sample
volume_range: [0.9, 1.1] # random scaling of audio volumeDisable augmentation by setting enable: false.
Phonemes can be merged across languages by defining merged_phoneme_groups in
config.yaml. Each group starts with a merge label such as merged_1 (can be anything) followed
by language specific phonemes:
training:
# define phonemes group that has the same sound (like-phoneme) throughout the dataset across labeling systems
merged_phoneme_groups:
- ["merged_1", "en/ah", "ja/a"]
- ["merged_2", "en/ih", "ja/i"]
- ["custom_var", "en/AP", "ja/AP"]
- ["CustomVar", "en/SP", "ja/SP"]During preprocessing these phonemes are replaced with the merged label. For TensorBoard visualisation and inference, the labels are mapped back to the original phoneme for the sample's language