Skip to content
This repository was archived by the owner on May 19, 2025. It is now read-only.

tensorsense/gemamba

Repository files navigation

Gemamba

This repository contains training code for the Gemamba multimodal language model.

Gemamba is the first multimodal LLM to combine a Mamba-based video encoder with performant and flexible Gemma transformer LLM in a LLaVA-style architecture.

Getting started

We recommend using Dev Containers to create the environment using pre-made configuration.

  1. Install PyTorch.

  2. Install Python dependencies.

pip3 install -r requirements.txt

Install VideoMamba dependencies:

pip3 install -e llava/model/multimodal_encoder/videomamba/causal-conv1d
pip3 install -e llava/model/multimodal_encoder/videomamba/mamba

[optional] Update transformers to get Phi3 support:

pip3 install git+https://github.com/huggingface/transformers
  1. Download pretrained weights for VideoMamba:
wget https://huggingface.co/OpenGVLab/VideoMamba/resolve/main/videomamba_m16_25M_f8_res224.pth
  1. Refer to run_finetune.ipynb to learn how to load a checkpoint and run inference.

Pretrained checkpoints

Pretrained checkpoint for the model can be found here: HF 🤗.

  • The model's projector has been pretrained for 1 epoch on the Valley dataset.
  • LLM and the projector have been jointly fine-tuned using the Video-ChatGPT dataset.

Training

We inherit most of the training workflow from the original LLaVA. Please refer to scripts/train to see configurations used for training the model. See scripts/eval for scripts used to calculate benchmark scores.

About

This repository contains training code for the Gemamba VLM

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors