A custom implementation of Vision Transformer (ViT) for digit classification, extended to encoder-decoder modeling for digit sequence recognition.
This project reimplements the Vision Transformer architecture from the original ViT paper, trained on the MNIST dataset for handwritten digit recognition. In addition to a standard ViT encoder, the model is extended into an encoder-decoder architecture inspired by the Transformer model from Attention Is All You Need. The encoder-decoder model processes a grid of digit images to predict sequences. Both implementations are coded from scratch (no AI - you will have to trust me) following the original papers.
- Encoder: Based on the original ViT paper, using pre-layer norm with skip connections.
- Decoder: Follows post-layer norm as in the original Transformer paper.
- This repo mixes both approaches for the encoder-decoder. Feel free to change.
Create the val/test data:
python models/utils.pyTo train the models:
# For standard ViT encoder on MNIST
python models/vit_enc.py
# For encoder-decoder on digit grids
python models/vit_enc_dec.pyTo run the streamlit app:
streamlit run app.pyAlternatively, the Dockerfile is standalone:
docker build -t vit-mnist-app .
docker run -p 8501:8501 vit-mnist-app