Skip to content

raschedh/ViT_MNIST

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

43 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Vision Transformer on the MNIST Dataset

A custom implementation of Vision Transformer (ViT) for digit classification, extended to encoder-decoder modeling for digit sequence recognition.


🚀 Overview

This project reimplements the Vision Transformer architecture from the original ViT paper, trained on the MNIST dataset for handwritten digit recognition. In addition to a standard ViT encoder, the model is extended into an encoder-decoder architecture inspired by the Transformer model from Attention Is All You Need. The encoder-decoder model processes a grid of digit images to predict sequences. Both implementations are coded from scratch (no AI - you will have to trust me) following the original papers.

🔧 Architecture Notes

  • Encoder: Based on the original ViT paper, using pre-layer norm with skip connections.
  • Decoder: Follows post-layer norm as in the original Transformer paper.
  • This repo mixes both approaches for the encoder-decoder. Feel free to change.

Training

Create the val/test data:

python models/utils.py

To train the models:

# For standard ViT encoder on MNIST
python models/vit_enc.py

# For encoder-decoder on digit grids
python models/vit_enc_dec.py

Streamlit

To run the streamlit app:

streamlit run app.py

Alternatively, the Dockerfile is standalone:

docker build -t vit-mnist-app .
docker run -p 8501:8501 vit-mnist-app


About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors