Skip to content

Latest commit

 

History

History
13 lines (11 loc) · 1.25 KB

File metadata and controls

13 lines (11 loc) · 1.25 KB

02456_Deep_Learning_Project

Automated Key Point Description for Vision Transformers using Vision-Language Models A recent work has shown that features extracted by Vision Transformers (ViTs) trained using self-supervised learning can perform unsupervised key point matching between two images with high precision (https://arxiv.org/abs/2112.05814). However, since the key points are identified in an unsupervised manner, human evaluation is necessary to describe the key point that are discovered. The goal of this project is to automatically create textual descriptions of the key points using recent advances in vision-language modeling (https://proceedings.mlr.press/v139/radford21a/radford21a.pdf). The project will be focused on fine-grained classification and has direct links to ongoing research on explainability.

Data Source : https://www.kaggle.com/datasets/wenewone/cub2002011

Important paper links : https://proceedings.neurips.cc/paper_files/paper/2019/file/adf7ee2dcf142b0e11888e72b43fcb75-Paper.pdf https://openaccess.thecvf.com/content/CVPR2023/papers/Nauta_PIP-Net_Patch-Based_Intuitive_Prototypes_for_Interpretable_Image_Classification_CVPR_2023_paper.pdf https://arxiv.org/abs/2105.02968 https://dino-vit-features.github.io/ https://arxiv.org/abs/2103.00020