The goal of this project is to analyze how different fine-tuning strategies affect the performance of pretrained multimodal models. Specifically, we investigate whether performance gains in zero-shot classification stem from fine-tuning pretrained representations, the choice of encoder architecture, or hyperparameter ablations.
We explore the transition from contrastive learning (CLIP) to generative multimodal models (Qwen2-VL) to understand the impact of model scale and training complexity on classification accuracy.
We use the Oxford Flowers 102 dataset, consisting of 102 flower categories common in the UK.
- Images: 102 categories.
- Labels & Splits: Using the official
labels.npzandcat_to_name.jsonfor category mapping. - Custom Split Logic: In accordance with the challenge requirements, we swapped the standard splits:
- Training Set: Original
valid+tstidindices. - Test Set: Original
trnid(training) indices.
- Training Set: Original
train.py: Main script for fine-tuning CLIP using a contrastive loss (InfoNCE) with AdamW.baseline_eval.py: Evaluates a vanilla, pretrained CLIP model (zero-shot) to establish a performance floor.test_models.py: Utility script to compare parameter counts across CLIP architectures (e.g., ViT-B/32 vs ViT-L/14).
dataset_Description.py: Generates a bar chart showing the class distribution across the 102 categories and prints image counts per split.class_distribution.py: Computes and plots the Kernel Density Estimation (KDE) of Hue, Saturation, and Value (HSV) for a specific flower class.isomap.py: Performs Isomap dimensionality reduction on class centroids (HSV features) and visualizes the manifold using representative images.
web.py: A Streamlit-based web application to explore the dataset, filter by class name, and view images in grid or single-focus modes.
We compare different CLIP backbones to see if larger vision encoders (e.g., ViT-L) outperform smaller ones even without extensive fine-tuning.
- Command:
python test_models.py
Establishing the baseline with vanilla CLIP vs. fine-tuning on the specialized prompt: "an image of the {} flower".
- Baseline:
python baseline_eval.py --architecture ViT-B/32 - Fine-tuned:
python train.py --lr 5e-6 --epochs 5 --architecture ViT-B/32
The training objective is based on the InfoNCE loss, which aligns image and text representations:
-
Alignment: The loss maximizes the similarity of the
$N$ matching pairs in a batch while minimizing the similarity of the$N^2 - N$ incorrect pairs. -
Temperature (
$\tau$ ): A scalar scaling factor that controls the sharpness of the similarity distribution. Lower$\tau$ values force the model to concentrate on the hardest negative samples.
- Monitor with WandB: All runs track Top-1 accuracy and Top-3/5 retrieval.
- Launch Web Explorer:
streamlit run web.py
- Run Distribution Analysis:
python dataset_exploration/dataset_Description.py