Impact of Activation Functions on Vision Transformers for Image Classification

Capstone Project | VIT Chennai | 2026

Authors

Jesseman Devamirtham N (22BRS1112) - Lead
R. Ritish Reddy (22BRS1291)
Preeti Chanda Patra (22BRS1066)

Guide: Prof. Prasad M

Abstract

A systematic study comparing the impact of different activation functions (GELU, ReLU, SiLU/Swish, Tanh, Mish) on Vision Transformer architectures (ViT, CaiT, Swin Transformer) for image classification. Uses a two-phase training pipeline: DINO self-supervised pretraining followed by supervised fine-tuning with advanced data augmentation.

Key Findings (ViT on CIFAR-10)

Activation	Best Val Accuracy	Rank
GELU	93.68%	1st
SiLU (Swish)	93.12%	2nd
ReLU	92.71%	3rd
Mish	91.19%	4th
Tanh	88.50%	5th

Non-monotonic activations (GELU, SiLU) outperform monotonic ones (ReLU, Tanh) by 1-5%
DINO self-supervised pretraining provides +1.44% average accuracy gain across all activations
Tanh benefits most from SSL pretraining (+2.00%)

Project Structure

.
├── ViT/                          # Vision Transformer experiments
│   ├── models/                   # Model architectures with different activations
│   │   ├── vit.py               # Base ViT (GELU)
│   │   ├── vitReLU.py           # ViT with ReLU
│   │   ├── vitELU.py            # ViT with ELU
│   │   ├── vitSiLU.py           # ViT with SiLU/Swish
│   │   ├── vitTanh.py           # ViT with Tanh
│   │   ├── cait.py              # CaiT architecture
│   │   ├── caitELU.py           # CaiT with ELU
│   │   ├── caitSiLU.py          # CaiT with SiLU
│   │   ├── swin.py              # Swin Transformer
│   │   ├── swinELU.py           # Swin with ELU
│   │   ├── swinSiLU.py          # Swin with SiLU
│   │   └── build_model.py       # Unified model factory
│   ├── utils/                    # Training utilities
│   │   ├── autoaug.py           # AutoAugment (CIFAR10 policy)
│   │   ├── cosine_annealing_with_warmup.py
│   │   ├── dataloader.py        # Data loading & transforms
│   │   ├── drop_path.py         # Stochastic depth
│   │   ├── load_checkpoint.py   # Checkpoint loading
│   │   ├── logger_dict.py       # Metric logging
│   │   ├── losses.py            # Label smoothing CE loss
│   │   ├── mix.py               # CutMix & Mixup
│   │   ├── print_progress.py    # Training progress display
│   │   ├── projection_head.py   # DINO projection head
│   │   ├── random_erasing.py    # Random erasing augmentation
│   │   ├── sampler.py           # RASampler (repeated augmentation)
│   │   ├── scheduler.py         # LR scheduler
│   │   ├── training_functions.py # Train/eval loops
│   │   ├── transforms.py        # Data transforms
│   │   └── utils_ssl.py         # SSL utilities (DINO)
│   ├── pretraining_vit.ipynb    # Phase 1: DINO SSL pretraining
│   ├── finetune.ipynb           # Phase 2: Supervised fine-tuning
│   └── mix.py                   # CutMix/Mixup utilities
│
├── CaiT/                         # CaiT-specific experiments
│   ├── models/                   # Extended activation variants
│   │   ├── cait.py, vit.py, swin.py
│   │   ├── vitReLU.py, vitELU.py, vitTanh.py
│   │   ├── vitLeakyReLU.py, vitParametricReLU.py
│   │   ├── vitSigmoid.py, vitSwish.py
│   │   └── build_model.py
│   ├── utils/                    # Same utility suite
│   ├── pretraining_vit.ipynb
│   └── finetune.ipynb
│
└── Swin/                         # Swin Transformer experiments (WIP)

Training Pipeline

Phase 1: Self-Supervised Pretraining (DINO)

Teacher-Student framework with EMA (momentum: 0.996 -> 1.0)
Multi-crop strategy: 2 global + 8 local crops
200 epochs, AdamW (lr=0.0001), cosine schedule with 30-epoch warmup

Phase 2: Supervised Fine-tuning

SSL pretrained weight initialization
Label Smoothing Cross-Entropy loss
Advanced augmentation: AutoAugment + CutMix + Mixup + Random Erasing
100 epochs, AdamW (lr=0.001), CosineAnnealingWarmupRestarts
Regularization: Stochastic Depth (0.1), Weight Decay (0.04), Gradient Clipping (3.0)

Architecture Configurations

Property	ViT	CaiT	Swin
Embedding Dim	192	192	96
Depth	9	24	[2, 6, 4]
Attention Heads	12	4	[3, 6, 12]
MLP Ratio	2	2	2
Patch Size	4x4	4x4	2x2
Parameters	~2.7M	~2.8M	~3.1M

Datasets

CIFAR-10 (completed)
CIFAR-100 (planned)
Tiny-ImageNet (planned)
Biomedical dataset (planned)

Requirements

Python 3.8+
PyTorch 2.0+
torchvision
timm
tensorboard

References

Dosovitskiy et al., "An Image is Worth 16x16 Words" (ICLR 2021)
Touvron et al., "Going Deeper with Image Transformers (CaiT)" (ICCV 2021)
Liu et al., "Swin Transformer" (ICCV 2021)
Caron et al., "DINO" (ICCV 2021)
Ramachandran et al., "Searching for Activation Functions" (2017)
Misra, "Mish: A Self Regularized Non-Monotonic Activation" (BMVC 2020)
Yun et al., "CutMix" (ICCV 2019)
Zhang et al., "mixup" (ICLR 2018)

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
CaiT		CaiT
ViT		ViT
.gitignore		.gitignore
README.md		README.md
biomedical_experiments.py		biomedical_experiments.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Impact of Activation Functions on Vision Transformers for Image Classification

Authors

Abstract

Key Findings (ViT on CIFAR-10)

Project Structure

Training Pipeline

Phase 1: Self-Supervised Pretraining (DINO)

Phase 2: Supervised Fine-tuning

Architecture Configurations

Datasets

Requirements

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Impact of Activation Functions on Vision Transformers for Image Classification

Authors

Abstract

Key Findings (ViT on CIFAR-10)

Project Structure

Training Pipeline

Phase 1: Self-Supervised Pretraining (DINO)

Phase 2: Supervised Fine-tuning

Architecture Configurations

Datasets

Requirements

References

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages