Vision-Language Model for Flowers Classification

c2dfeed3ffe67da5a5a7d19a74ca873a58b55449

🎯 Challenge Objective

The goal of this project is to analyze how different fine-tuning strategies affect the performance of pretrained multimodal models. Specifically, we investigate whether performance gains in zero-shot classification stem from fine-tuning pretrained representations, the choice of encoder architecture, or hyperparameter ablations.

We explore the transition from contrastive learning (CLIP) to generative multimodal models (Qwen2-VL) to understand the impact of model scale and training complexity on classification accuracy.

📊 Dataset & Data Split

We use the Oxford Flowers 102 dataset, consisting of 102 flower categories common in the UK.

Images: 102 categories.
Labels & Splits: Using the official labels.npz and cat_to_name.json for category mapping.
Custom Split Logic: In accordance with the challenge requirements, we swapped the standard splits:
- Training Set: Original valid + tstid indices.
- Test Set: Original trnid (training) indices.

🛠️ Project Structure

1. CLIP Fine-Tuning (Contrastive)

train.py: Main script for fine-tuning CLIP using a contrastive loss (InfoNCE) with AdamW.
baseline_eval.py: Evaluates a vanilla, pretrained CLIP model (zero-shot) to establish a performance floor.
test_models.py: Utility script to compare parameter counts across CLIP architectures (e.g., ViT-B/32 vs ViT-L/14).

2. Dataset Exploration & Visualization (`data_Exploration/`)

dataset_Description.py: Generates a bar chart showing the class distribution across the 102 categories and prints image counts per split.
class_distribution.py: Computes and plots the Kernel Density Estimation (KDE) of Hue, Saturation, and Value (HSV) for a specific flower class.
isomap.py: Performs Isomap dimensionality reduction on class centroids (HSV features) and visualizes the manifold using representative images.

3. Interactive Tools

web.py: A Streamlit-based web application to explore the dataset, filter by class name, and view images in grid or single-focus modes.

🚀 Key Experiments

A. Architecture Ablation

We compare different CLIP backbones to see if larger vision encoders (e.g., ViT-L) outperform smaller ones even without extensive fine-tuning.

Command: python test_models.py

B. Fine-Tuning vs. Zero-Shot

Establishing the baseline with vanilla CLIP vs. fine-tuning on the specialized prompt: "an image of the {} flower".

Baseline: python baseline_eval.py --architecture ViT-B/32
Fine-tuned: python train.py --lr 5e-6 --epochs 5 --architecture ViT-B/32

📚 Theoretical Component: InfoNCE Loss

The training objective is based on the InfoNCE loss, which aligns image and text representations:

Alignment: The loss maximizes the similarity of the $N$ matching pairs in a batch while minimizing the similarity of the $N^2 - N$ incorrect pairs.
Temperature ($\tau$): A scalar scaling factor that controls the sharpness of the similarity distribution. Lower $\tau$ values force the model to concentrate on the hardest negative samples.

📈 Monitoring & Usage

Monitor with WandB: All runs track Top-1 accuracy and Top-3/5 retrieval.
Launch Web Explorer:
```
streamlit run web.py
```

Run Distribution Analysis:

python dataset_exploration/dataset_Description.py

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
Data_Exploration		Data_Exploration
.gitignore		.gitignore
README.md		README.md
accuracy_distribution_comparison.png		accuracy_distribution_comparison.png
accuracy_params.py		accuracy_params.py
accuracy_vs_params.png		accuracy_vs_params.png
arch_ablation_results.py		arch_ablation_results.py
baseline_eval.py		baseline_eval.py
class_accuracy.py		class_accuracy.py
class_accuracy_distribution_BASELINE.png		class_accuracy_distribution_BASELINE.png
class_accuracy_distribution_csv_source.png		class_accuracy_distribution_csv_source.png
class_comparison.py		class_comparison.py
class_distribution.png		class_distribution.png
convert.py		convert.py
create_complex_dataset.py		create_complex_dataset.py
create_dataset.py		create_dataset.py
create_descriptions.py		create_descriptions.py
dataset.py		dataset.py
eval_fast_2.py		eval_fast_2.py
eval_parallel.py		eval_parallel.py
final_arch.py		final_arch.py
inference_viz.png		inference_viz.png
inference_viz_1.png		inference_viz_1.png
intersection_test.py		intersection_test.py
models.py		models.py
per_class_accuracy_BASELINE.png		per_class_accuracy_BASELINE.png
per_class_accuracy_csv_source.png		per_class_accuracy_csv_source.png
table_ft.py		table_ft.py
test_models.py		test_models.py
train.py		train.py
train_retrieval.py		train_retrieval.py
train_retrieval_grained.py		train_retrieval_grained.py
viz_inference.py		viz_inference.py
web.py		web.py
web_data.py		web_data.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Vision-Language Model for Flowers Classification

🎯 Challenge Objective

📊 Dataset & Data Split

🛠️ Project Structure

1. CLIP Fine-Tuning (Contrastive)

2. Dataset Exploration & Visualization (`data_Exploration/`)

3. Interactive Tools

🚀 Key Experiments

A. Architecture Ablation

B. Fine-Tuning vs. Zero-Shot

📚 Theoretical Component: InfoNCE Loss

📈 Monitoring & Usage

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Vision-Language Model for Flowers Classification

🎯 Challenge Objective

📊 Dataset & Data Split

🛠️ Project Structure

1. CLIP Fine-Tuning (Contrastive)

2. Dataset Exploration & Visualization (data_Exploration/)

3. Interactive Tools

🚀 Key Experiments

A. Architecture Ablation

B. Fine-Tuning vs. Zero-Shot

📚 Theoretical Component: InfoNCE Loss

📈 Monitoring & Usage

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

2. Dataset Exploration & Visualization (`data_Exploration/`)

Packages