Feature/qlora finetune by MatteoPerona · Pull Request #31 · OpenHelix-Team/cobra

MatteoPerona · 2025-12-01T16:31:08Z

Title

Add LoRA fine-tuning pipeline for Cobra VLM (qlora_finetune)

Summary

This PR introduces a self-contained LoRA fine-tuning pipeline for Cobra VLM, including training, inference, and reproducible environment setup. It enables quickly fine-tuning Cobra (cobra+3b etc.) on LLaVA-CoT-100k (or local JSONL data) and saving LoRA adapters under timestamped qlora_outputs_* folders for downstream use.

Note: many files are still labeled as qlora because the original idea was to use QLoRA which adds quantization ontop of lora. Because we are using a MAMBA-based model this was not feasible, so we went with the normal strategy.

What’s Included

1. Environment & Dependencies

qlora_finetune/requirements.txt
- Pinned versions for:
  - torch==2.1.0, torchvision==0.16.0, torchaudio==2.1.0, triton==2.1.0
  - transformers==4.34.1, tokenizers>=0.14,<0.15, accelerate==0.26.1
  - peft==0.7.1, bitsandbytes>=0.41.0,<0.43.0, trl==0.7.4, datasets>=2.14.0,<2.18.0
  - Utility deps: einops, timm==0.9.10, wandb, jsonlines, rich, tqdm, etc.
- Version ranges chosen to avoid:
  - huggingface_hub vs datasets conflicts
  - transformers vs trl API breakages
  - torch vs bitsandbytes binary mismatches
qlora_finetune/install_requirements.sh
- Creates/activates ./env venv (Python 3.10).
- Installs build deps: pip, setuptools, wheel, packaging.
- Installs PyTorch stack, Transformers ecosystem, vision/utils, QLoRA-specific deps, and:
  - Installs mamba-ssm<2.0.0 with --no-build-isolation so it can see the already-installed torch.

2. Config & Dataset Handling

qlora_finetune/config.py
- QLoRAConfig dataclass with:
  - Model: model_id, pretrained_checkpoint, hf_token
  - Dataset: dataset_name, dataset_root, dataset_proportion, dataset_max_samples, dataset_seed
  - Training: output_dir, per_device_train_batch_size, gradient_accumulation_steps, learning_rate, num_train_epochs, max_steps, etc.
  - Training settings: fp16/bf16, logging_steps, save_steps, eval_steps, save_total_limit
  - Tracking: report_to (default ["wandb"]), wandb_project, wandb_entity
- __post_init__ enforces sane defaults and value ranges, and resolves .hf_token files into actual tokens.
qlora_finetune/dataset_loader.py
- load_llava_cot_dataset(...):
  - Supports:
    - HF dataset (default Xkev/LLaVA-CoT-100k) via load_dataset(dataset_name, split=...) (no trust_remote_code to stay compatible with datasets 2.14.x).
    - Local JSONL via dataset_root / "train.jsonl".
  - Supports:
    - dataset_max_samples (absolute cap, takes precedence).
    - dataset_proportion (fraction of the dataset).
  - Deterministic sampling with dataset_seed.
- format_for_sft(...):
  - Flattens LLaVA-style conversations into a single "text" field with USER: / ASSISTANT: prefixes and injects an <image> token for the first user turn.

3. Model Loading & QLoRA Preparation

qlora_finetune/model_loader.py
- load_cobra_for_qlora(model_id, pretrained_checkpoint, hf_token, freeze_vision_encoder):
  - Loads the full Cobra VLM either from:
    - HF hub (model_id), or
    - Local checkpoint (pretrained_checkpoint), using the core cobra.models.load.load.
  - Applies Mamba-specific workarounds:
    - Disables fused_add_norm, swaps RMSNorm for a safe LayerNorm to avoid Triton issues.
    - Optionally freezes the vision encoder.
- prepare_model_for_qlora(model, target_modules, lora_r, lora_alpha, lora_dropout, lora_bias):
  - Wraps target modules with LoRA via PEFT:
    - Auto-detects typical Mamba SSM linear layers (in_proj, out_proj, x_proj, dt_proj, etc.).
    - Applies LoRA rank/alpha/dropout, and handles bias for mamba-ssm compatibility.
- load_and_prepare_model(...):
  - Top-level function returning (vlm, llm_backbone_with_lora) ready for training.

4. Training Script

qlora_finetune/train_qlora.py
- Main training entrypoint with:
  - CUDA / LD_LIBRARY_PATH setup for Triton and CUDA libs.
  - Imports and sys.path hacks so qlora_finetune + cobra modules work when run from different roots.
- main(config: QLoRAConfig):
  - Loads prepared model via load_and_prepare_model.
  - Loads tokenizer from:
    - vlm.llm_backbone.tokenizer if available, or
    - Directly from HF (xiuyul/mamba-2.8b-zephyr / state-spaces/mamba-2.8b) as a fallback.
  - Prints parameter counts (total/trainable).
  - Loads dataset via load_llava_cot_dataset, applies sampling (dataset_max_samples / dataset_proportion), and maps to "text" format.
  - Attempts gradient checkpointing where supported, with graceful fallback.
  - Builds TrainingArguments with:
    - output_dir, LR, epochs, warmup, weight decay, grad norm, fp16/bf16, logging/save cadence, seed, dataloader workers, remove_unused_columns, report_to, run_name.
    - Optional max_steps, eval_steps.
    - Before constructing TrainingArguments, checks WandB availability:
      - If WANDB_API_KEY is not set and no stored login is found, automatically strips "wandb" from report_to and falls back to ["none"], printing a warning.
  - Optional WandB init:
    - Only if "wandb" still present in report_to.
    - Logs key hyperparameters to WandB.
  - Uses trl.SFTTrainer with:
    - dataset_text_field="text".
    - Custom DataCollatorForLanguageModeling (causal LM, pad_to_multiple_of=8).
    - max_seq_length=512 (explicitly set to reduce memory usage with big Mamba models).
  - Runs trainer.train(), saves final adapter + tokenizer to config.output_dir, and prints PEFT loading instructions.
- CLI (if __name__ == "__main__":):
  - Arguments: --config, --model_id, --dataset_proportion, --dataset_max_samples, --output_dir, --hf_token, --per_device_train_batch_size, --gradient_accumulation_steps.
  - If --config JSON is provided and exists, initializes QLoRAConfig from it; otherwise uses defaults and CLI overrides.

5. Inference Utilities

qlora_finetune/inference.py
- load_qlora_model(base_model_id, lora_adapter_path, hf_token=None, merge_weights=False):
  - Loads base Cobra VLM via cobra.models.load.load.
  - Loads LoRA adapters from lora_adapter_path into the LLM backbone via PeftModel.from_pretrained.
  - Optionally merges LoRA weights into the base model (merge_and_unload).
- generate_with_qlora(model, vlm, image, prompt, max_new_tokens=512, temperature=0.7, do_sample=True):
  - Uses the Cobra VLM’s generate API with the fine-tuned weights.
- save_merged_model(model, output_path, tokenizer=None):
  - Saves merged model + tokenizer in HF-style layout for downstream inference.

6. Run Scripts

qlora_finetune/run_100samples.sh

Activates ./env if present.
Creates a timestamped output dir: ./qlora_outputs_100samples_<timestamp>.

Runs:

python train_qlora.py \
  --model_id cobra+3b \
  --dataset_max_samples 100 \
  --output_dir "${OUTPUT_DIR}" \
  --per_device_train_batch_size 1 \
  --gradient_accumulation_steps 1

Emits a clear banner and final summary path.

(You’ve also derived run_1000samples.sh, run_10000samples.sh in your workspace following the same pattern.)

7. Git Ignore

qlora_finetune/.gitignore

Now ignores QLoRA output dirs and local artifacts:

checkpoints/
outputs/
runs/
qlora_outputs*/
qlora_outputs_*/
*.pt
*.pth
*.ckpt
wandb/
*.log

How to Use

One-time setup

cd qlora_finetune
bash install_requirements.sh

Fine-tune with 100 examples

cd qlora_finetune
source env/bin/activate
./run_100samples.sh

Outputs go to a folder like:

qlora_outputs_100samples_20251130_225122/
  adapter_model.bin
  adapter_config.json
  tokenizer.json
  tokenizer_config.json
  special_tokens_map.json
  README.md

Evaluate / Inference

Use qlora_finetune/inference.py as documented in qlora_finetune/README.md, e.g.:

from pathlib import Path
from qlora_finetune.inference import load_qlora_model, generate_with_qlora
from PIL import Image

model, vlm = load_qlora_model(
    base_model_id="cobra+3b",
    lora_adapter_path=Path("./qlora_finetune/qlora_outputs_100samples_..."),
    merge_weights=False,
)

image = Image.open("path/to/image.jpg")
prompt = "What is going on in this image?"
response = generate_with_qlora(model, vlm, image, prompt)
print(response)

Notes / Trade-offs

Dependency pinning is intentionally strict to avoid the many incompatibilities we hit (mamba-ssm, datasets, huggingface_hub, trl, bitsandbytes, torch).
WandB:
- By default, training will track to WandB if you’ve run wandb login or set WANDB_API_KEY.
- If no key is available (e.g., non-interactive cluster job), WandB is automatically disabled and report_to falls back to ["none"].
Memory:
- max_seq_length=512, batch_size=1, and gradient checkpointing (where supported) are chosen to fit the 3B Mamba-based Cobra model on a single L4‑class GPU.

Checklist

QLoRA training loop for Cobra VLM (train_qlora.py).
Dataset loader + formatting for LLaVA-CoT‑style conversations.
Reproducible environment via requirements.txt + install_requirements.sh.
Inference utilities to load LoRA adapters and optionally merge them.
Run scripts to quickly launch small-scale fine-tunes (e.g., 100 samples).
Ignore training artifacts (qlora_outputs_*, wandb logs) from Git.

…nd save BLEU scores with timestamps. Add new entries to .gitignore for output files.

…n_table files

Scratchpad Reasoning + Benchmark

…ctionality for clearing GPU memory and saving BLEU scores has been integrated into the main workflow.

Update notebook: Clear GPU RAM and save image of results

This commit introduces a comprehensive set of files and scripts for fine-tuning the Cobra VLM on the LLaVA-CoT-100k dataset. Key additions include: - **Dataset Preparation**: A script to download, validate, and prepare the dataset for training. - **Custom Dataset Loader**: A new loader that supports JSONL format and integrates with existing training infrastructure. - **Fine-Tuning Script**: A dedicated script for fine-tuning the model using the prepared dataset. - **Documentation**: Detailed guides and summaries for setup and usage. These changes enhance the model's reasoning capabilities by leveraging structured reasoning annotations in the dataset.

X Merge branch 'main' of https://github.com/AndrasFerenczy/thinking_cobra

This commit introduces the foundational structure for the Cobra Evaluation System, including: - **Main Module**: The entry point for running evaluations with command-line argument parsing. - **Configuration Management**: A dedicated module for handling CLI arguments and settings. - **Registry System**: A registry for managing generators and metrics, allowing for extensibility. - **Generators**: Implementations for baseline, scratchpad, and external generation methods. - **Metrics**: Initial implementations for BLEU and BERTScore metrics. - **Utilities**: Functions for GPU management, JSON I/O, and visualization of results. These changes establish a modular framework for evaluating visual language models with various inference strategies and metrics, enhancing the system's extensibility and usability.

This commit introduces significant improvements to the evaluation process, including: - **Method Comparison**: Added functionality to run and compare results from multiple methods (baseline and scratchpad) within the same evaluation session. - **Visualization Enhancements**: Implemented a new comparison visualization that displays results side-by-side for easier analysis of generated captions and metrics. - **BERTScore Metric Updates**: Enhanced the BERTScore metric to store per-sample scores, allowing for detailed performance analysis. - **Code Refactoring**: Cleaned up the main evaluation logic for better readability and maintainability. These changes improve the usability and analytical capabilities of the Cobra Evaluation System, facilitating more comprehensive evaluations of visual language models.

This commit introduces several improvements to the evaluation process, including: - **Dynamic Output Directories**: Results are now saved in timestamped directories for better organization, allowing users to easily manage multiple runs. - **Comparison Statistics**: Added functionality to compute and save comparison statistics between baseline and scratchpad methods, including win rates and metric differences. - **Visualization Updates**: Enhanced the comparison visualization to include detailed metrics and reasoning traces, improving the clarity of results. These changes enhance the usability and analytical capabilities of the Cobra Evaluation System, facilitating more comprehensive evaluations of visual language models.

This commit modifies the .gitignore file to ensure that shell scripts are ignored and removes the output.png file, which is no longer needed. These changes help streamline the project by keeping unnecessary files out of version control.

This commit modifies the .gitignore file to include the __pycache__ directory, ensuring that Python bytecode files are not tracked in version control. This helps maintain a cleaner project structure by excluding unnecessary files.

This commit introduces several new files and enhancements to the Cobra Evaluation System, including: - **New Scripts**: Added `analyze_significance.py`, `compare_scratchpad_passes.py`, and `visualize_scratchpad_passes.py` for analyzing and visualizing scratchpad performance across multiple passes. - **Checkpointing Guide**: Introduced `CHECKPOINTING_GUIDE.md` to document the new automatic checkpointing feature for long-running evaluations. - **Improved Documentation**: Added `SCRATCHPAD_COMPARE_MODE.md`, `SCRATCHPAD_DEGRADATION_ANALYSIS.md`, and `SCRATCHPAD_IMPROVEMENTS.md` to provide insights into scratchpad methods and their performance. - **New Data Files**: Included various JSON and PNG files for results and visualizations from recent evaluations. These changes enhance the analytical capabilities and usability of the evaluation system, facilitating better understanding and comparison of different methods in visual language models.

This commit introduces support for external model API clients, allowing users to run evaluations using models such as GPT-5, Gemini, Claude, and Llama. Key changes include: - **New Inference Methods**: Added options for external models in the evaluation workflow. - **API Key Management**: Introduced command-line arguments for specifying API keys and model configurations. - **Conditional Model Loading**: Updated the main evaluation logic to skip local model loading when using external models. - **Checkpointing Improvements**: Enhanced checkpointing functionality to support overwriting the latest checkpoint file. These updates significantly expand the evaluation options and flexibility of the Cobra Evaluation System, facilitating integration with various external AI models.

…ith 10,100,1000,10000 examples

…ing scripts as well as my install requirements script.

…r_it MMStar and accuracy eval added (& benchmarked for 1000 images on both COCO and MMStar)

… into feature/qlora-finetune

andrasferenczy and others added 30 commits November 8, 2025 00:12

finally working

ffd319a

intro benchmarking

5602331

running benchmark for the full dataset

904f50e

benchmarking finished

accff81

benchmarking finished #2

dd06273

for isaac

a1b8113

Update benchmark notebook to clear GPU memory before loading models a…

ae39c1e

…nd save BLEU scores with timestamps. Add new entries to .gitignore for output files.

Update .gitignore to include bleu_scores_output and caption_compariso…

159991b

…n_table files

Use scratchpad reasoning and benchmark it

199b624

introduced bertscore

67f4984

Merge pull request #1 from AndrasFerenczy/scratchpad_reasoning

83d1d72

Scratchpad Reasoning + Benchmark

Remove benchmark notebook as it is no longer needed. The previous fun…

2927c0b

…ctionality for clearing GPU memory and saving BLEU scores has been integrated into the main workflow.

Merge main into notebook-updates: restore benchmark.ipynb from main

552f78b

Merge pull request #2 from AndrasFerenczy/notebook-updates

4e88bd7

Update notebook: Clear GPU RAM and save image of results

benchmarking with bertscore

f8689bd

^X

f097be8

X Merge branch 'main' of https://github.com/AndrasFerenczy/thinking_cobra

Update .gitignore and remove output image

82a6176

This commit modifies the .gitignore file to ensure that shell scripts are ignored and removes the output.png file, which is no longer needed. These changes help streamline the project by keeping unnecessary files out of version control.

Merge branch 'main' of https://github.com/AndrasFerenczy/thinking_cobra

6d36b76

big txt file

08efe25

Update .gitignore to ignore __pycache__ directory

87ea06c

This commit modifies the .gitignore file to include the __pycache__ directory, ensuring that Python bytecode files are not tracked in version control. This helps maintain a cleaner project structure by excluding unnecessary files.

added mmstar and accuracy evaluation

f00f97e

benchmarked on 1000 images for both coco and mmstar

d56d66a

renamed COCO and MMStar test results

7335ff5

Add QLoRA finetune implementation

d3f118a

finetuning works but running into memory constraints"

e02605f

philosophercode and others added 8 commits December 1, 2025 02:12

deleted the old finetuning script and finetuned 4 versions of cobra w…

d1b800a

…ith 10,100,1000,10000 examples

removed .sh condition from gitignore so I could add all of my finetun…

3f5410e

…ing scripts as well as my install requirements script.

preparing to merge

c2e73c9

preparing to merge #2

4147131

Merge pull request #3 from AndrasFerenczy/mmstar_and_accuracy_eval_fo…

749cdd8

…r_it MMStar and accuracy eval added (& benchmarked for 1000 images on both COCO and MMStar)

Merge branch 'main' of https://github.com/AndrasFerenczy/thinking_cobra…

e158567

… into feature/qlora-finetune

updated readme

0a181e6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/qlora finetune#31

Feature/qlora finetune#31
MatteoPerona wants to merge 38 commits intoOpenHelix-Team:mainfrom
AndrasFerenczy:feature/qlora-finetune

MatteoPerona commented Dec 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

MatteoPerona commented Dec 1, 2025

Title

Summary

What’s Included

1. Environment & Dependencies

2. Config & Dataset Handling

3. Model Loading & QLoRA Preparation

4. Training Script

5. Inference Utilities

6. Run Scripts

7. Git Ignore

How to Use

One-time setup

Fine-tune with 100 examples

Evaluate / Inference

Notes / Trade-offs

Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants