AV-TS-ASR Parakeet

This repository contains the official implementation of our CHiME-9 MCoRec AV-TS-ASR system based on AV-Hubert and Nvidia Parakeet-0.6b v2.

Environment Setup

Create a new conda environment: conda create -n av_parakeet python=3.11 -y and activate it using conda activate av_parakeet.
Install ffmpeg conda install -c conda-forge "ffmpeg<8" -y.
Install the python dependencies: pip install -r requirements.txt.
Download AV-Hubert model finetuned on MCoRec: wget https://huggingface.co/MCoRecChallenge/MCoRec-baseline/resolve/main/model-bin.zip unzip model-bin.zip; unzip model-bin.zip

Data Setup

The following data setup is required if you want to train our models. If you want to use it for inference only, you can skip this section and continue to Inference section below.

Our training codebase uses Lhotse manifests. For inference, you can run our model on single video file, directory containing video files, or MCoRec data.

MCoRec Data Setup

To prepare the filled-in speaker tracks and Lhotse manifests, run:

./scripts/data_prep/prepare_mcorec.sh {path_to_mcorec_dataset}

The path should point to a directory with train and dev subdirectories (i.e., MCoRec dataset root).

General Data Setup

To prepare datasets like LRS2, LRS3, AVYT, ..., use scripts/data_prep/create_lrs_lhotse_manifests.py.

This script assumes one directory at the input, that contains subdirectories that describe the data parts. Each subdirectory (data part) must contain {fname}.video, {fname}.label, and {fname}.sample_id:

.video extension is not a video format, it is just for convenience to support multiple formats by default and was done by LRS2; hence, we adopted it.
.label file contains a single line with the transcript.
.sample_id file contains a single line with the id of the sample. It should be unique across all the files in the particular subset.

Here is an example of such file structure:

LRS2/
├── train/
     ...
│   ├── 0000000001.label
│   ├── 0000000001.sample_id
│   ├── 0000000001.video
│   └── ...
├── valid/
     ...
│   ├── 0000000002.label
│   ├── 0000000002.sample_id
│   ├── 0000000002.video
│   └── ...
└── test/
     ...
│   ├── 0000000003.label
│   ├── 0000000003.sample_id
│   ├── 0000000003.video
│   └── ...

To prepare the Lhotse manifests, run:

# Larger number of workers = faster processing.
python scripts/data_prep/create_lrs_lhotse_manifests.py \
    --data_dir {path_to_data_root} \
    --output_manifest_dir ./manifests \
    --num_workers 4

Inference

We currently support two inference modes: MCoRec (CHiME-9) and standard per-video inference.

(Optional) Download the MCoRec data from HuggingFace.
Make sure you have access to: BUT-FIT/AV-Parakeet_v0.1.

If you want to infer MCoRec data, run the following inference command:

python infer_mcorec.py \
    +session_dir={path_to_mcorec_data}/dev/ \
    +output_dir=predictions \
    +timestamps=true \
    +mode=full

If you want to infer arbitrary video/dictionary full of videos, run:

python infer.py --input {path_to_dir}/{video}.mp4 --output-dir output_transcripts

or

python infer.py --input "{path_to_dir}" --output-dir output_transcripts

The output of infer_mcorec.py is the in CHiME-9 MCoRec task format.

The output infer.py is a directory with a single ctm file per video ({output_directory}/{video_name}.ctm).

Training

The training is built on top of the Nvidia NeMo toolkit. We recommend getting familiar with the basics, although it is not fully required.

We use WandB for logging by default, make sure you are locally logged in, or change the logging to tensorboard by setting create_tensorboard_logger: true and create_wandb_logger: false in conf/av_parakeet.yaml.

If you have changed any paths, go to conf/av_parakeet.yaml and change the particular values. Otherwise, you can keep it intact.

To run the training with the default settings, run:

python train.py +exp_dir="exps/"

It will automatically create ./exps/av_parakeet directory with checkpoints.

📚 Citation

If you use our models or code, please cite the following works:

@misc{klement2026descriptionchime9mcorecchallenge,
      title={BUT System Description for CHiME-9 MCoRec Challenge}, 
      author={Dominik Klement and Alexander Polok and Nguyen Hai Phong and Prachi Singh and Lukáš Burget},
      year={2026},
      eprint={2604.27436},
      archivePrefix={arXiv},
      primaryClass={eess.AS},
      url={https://arxiv.org/abs/2604.27436}, 
}

🤝 Contributing

Contributions are welcome. If you’d like to improve the code, add new features, or extend the training pipeline, please open an issue or submit a pull request.

📬 Contact

For questions or collaboration, please contact: iklement@fit.vut.cz

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
conf		conf
scripts/data_prep		scripts/data_prep
src		src
utils		utils
.gitignore		.gitignore
README.md		README.md
infer.py		infer.py
infer_mcorec.py		infer_mcorec.py
requirements.txt		requirements.txt
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AV-TS-ASR Parakeet

Environment Setup

Data Setup

MCoRec Data Setup

General Data Setup

Inference

Training

📚 Citation

🤝 Contributing

📬 Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

AV-TS-ASR Parakeet

Environment Setup

Data Setup

MCoRec Data Setup

General Data Setup

Inference

Training

📚 Citation

🤝 Contributing

📬 Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages