Pipeline:
- Stage 1: Train Q-Former for image-text alignment (contrastive objective)
- Stage 2: Train VLM with LoRA on top of small LLM using Q-Former visual tokens
- Evaluate on COCO val subset:
- Caption metrics (BLEU/ROUGE)
- Retrieval metrics (I2T/T2I Recall@K)
- Training/eval subsets were built locally from COCO
train2017andval2017. - Local dataset paths used during runs:
dataset/coco_subsets/train2017_50k.jsonldataset/coco_subsets/train2017_50k_imagesdataset/coco_subsets/val2017_1k.jsonldataset/coco_subsets/val2017_1k_images
dataset/is excluded from git by design.
- GPU: RTX A5000 (24GB VRAM), 9 vCPU (RunPod)
- Epochs: 10
- Batch size: 8
- Best val loss:
0.0622(epoch 4)
Final stage-1 artifact used:
models/from_pod/trained_qformer_50k_unimodal_fresh/best
- Base LLM:
HuggingFaceTB/SmolLM-135M-Instruct - Epochs: 5
- Batch size: 8
- Grad accumulation: 4
- Best val loss:
2.0672(step 7020) - Final epoch val loss:
2.0957
Final stage-2 artifact used:
models/from_pod/vlm_peft/best
Source:
- Computed from a 500-sample local subset of COCO
val2017with multiple references per image. - Committed artifact:
inference_results/val2017_500_metrics.json
Results:
- BLEU:
22.4538 - ROUGE-1:
0.4084 - ROUGE-2:
0.1549 - ROUGE-L:
0.3691 - ROUGE-Lsum:
0.3690
Source:
- Computed on the same 500-sample local COCO
val2017subset. - Committed artifacts:
inference_results/retrieval_val2017_500_metrics.jsoninference_results/similarity_grid.jpg
Results:
- I2T R@1:
0.3860 - I2T R@5:
0.8100 - I2T R@10:
0.9300 - T2I R@1:
0.4040 - T2I R@5:
0.7960 - T2I R@10:
0.9340
uv run -m vlm_train.qformer_train --manifest-path dataset/coco_subsets/train2017_50k.jsonl --images-dir dataset/coco_subsets/train2017_50k_images --model-id trained_qformer_50k_unimodal_fresh --epochs 10 --batch-size 8uv run -m vlm_train.lm_train --qformer-model-path models/trained_qformer_50k_unimodal_fresh/best --manifest-path dataset/coco_subsets/train2017_50k.jsonl --images-dir dataset/coco_subsets/train2017_50k_images --model-id vlm_peft --epochs 5 --batch-size 8uv run -m vlm_train.basic_inf --image "dataset/coco_subsets/train2017_50k_images/<image>.jpg" --checkpoint-dir "models/from_pod/vlm_peft/best" --qformer-model-path "models/from_pod/trained_qformer_50k_unimodal_fresh/best"uv run -m vlm_train.batch_inf --num-samples 500 --manifest-path "dataset/coco_subsets/val2017_1k.jsonl" --images-dir "dataset/coco_subsets/val2017_1k_images" --checkpoint-dir "models/from_pod/vlm_peft/best" --qformer-model-path "models/from_pod/trained_qformer_50k_unimodal_fresh/best" --out-path "inference_results/val2017_500_preds.jsonl"
uv run -m vlm_train.eval_captions --preds-jsonl "inference_results/val2017_500_preds.jsonl" --out-json "inference_results/val2017_500_metrics.json" --out-csv "inference_results/val2017_500_metrics_per_sample.csv" --skip-bertscoreuv run -m vlm_train.retrieval_eval --num-samples 500 --manifest-path "dataset/coco_subsets/val2017_1k.jsonl" --images-dir "dataset/coco_subsets/val2017_1k_images" --qformer-path "models/from_pod/trained_qformer_50k_unimodal_fresh/best" --out-json "inference_results/retrieval_val2017_500_metrics.json" --save-grid --grid-path "inference_results/similarity_grid.jpg"Two-stage vision-language training project inspired by:
https://github.com/avbiswas/vlm/tree/main


