02456_Deep_Learning_Project

Automated Key Point Description for Vision Transformers using Vision-Language Models A recent work has shown that features extracted by Vision Transformers (ViTs) trained using self-supervised learning can perform unsupervised key point matching between two images with high precision (https://arxiv.org/abs/2112.05814). However, since the key points are identified in an unsupervised manner, human evaluation is necessary to describe the key point that are discovered. The goal of this project is to automatically create textual descriptions of the key points using recent advances in vision-language modeling (https://proceedings.mlr.press/v139/radford21a/radford21a.pdf). The project will be focused on fine-grained classification and has direct links to ongoing research on explainability.

Data Source : https://www.kaggle.com/datasets/wenewone/cub2002011

Important paper links : https://proceedings.neurips.cc/paper_files/paper/2019/file/adf7ee2dcf142b0e11888e72b43fcb75-Paper.pdf https://openaccess.thecvf.com/content/CVPR2023/papers/Nauta_PIP-Net_Patch-Based_Intuitive_Prototypes_for_Interpretable_Image_Classification_CVPR_2023_paper.pdf https://arxiv.org/abs/2105.02968 https://dino-vit-features.github.io/ https://arxiv.org/abs/2103.00020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

02456_Deep_Learning_Project

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

02456_Deep_Learning_Project