The Cyclic Attention Transformer (CAT) is a novel transformer architecture that introduces cyclic attention mechanisms to enhance contextual learning. This non-pretrained model demonstrates exceptional performance on text classification tasks, achieving state-of-the-art results on multiple benchmarks without the need for extensive pretraining.
- 🔄 Cyclic Attention Mechanism: Advanced contextual modeling through cyclic shifts and gating
- 🚀 Zero-Shot Architecture: Efficient training from scratch without pretraining requirements
- 📊 Strong Benchmark Results:
- AG News Dataset: 91.00% accuracy
- DBpedia Dataset: 98.05% accuracy
- 🎯 Efficient Training: Optimized for both speed and performance
-
Cyclic Attention Block
- Innovative cyclic shift mechanism
- Adaptive gating for attention filtering
- Enhanced global dependency capture
-
Multi-Head Attention System
- Hierarchical attention layers
- Intermediate normalization
- Advanced feedforward networks
-
Processing Pipeline
- Custom n-gram tokenization
- Global pooling for sequence aggregation
- Specialized classification head
| Parameter | Default Value | Description |
|---|---|---|
| embed_dim | 1024 | Embedding dimension |
| num_heads | 8 | Number of attention heads |
| ff_dim | 2048 | Feedforward layer dimension |
| num_layers | 3 | Number of transformer layers |
| batch_size | 128 | Training batch size |
| epochs | 5 | Training epochs |
| Metric | Score |
|---|---|
| Accuracy | 91.00% |
| F1 Score | 90.99% |
| Precision | 91.02% |
| Recall | 91.00% |
Dataset Details:
- Vocabulary Size: 50,002
- Training Samples: 120,000
| Metric | Score |
|---|---|
| Accuracy | 98.05% |
| F1 Score | 98.05% |
| Precision | 98.06% |
| Recall | 98.05% |
Dataset Details:
- Training Set: 560,000 samples
- Test Set: 70,000 samples
- Vocabulary Size: 50,002
| Epoch | Loss |
|---|---|
| 1 | 0.1299 |
| 2 | 0.0681 |
| 3 | 0.0520 |
| 4 | 0.0416 |
| 5 | 0.0344 |
-
Tokenization
- Custom tokenizer supporting unigram and bigram tokenization
- Vocabulary size: 50,002 (including special tokens)
-
Attention Implementation
- Cyclic shift attention mechanism
- Gated attention filtering
- Multi-head attention processing
-
Training Configuration
- Optimizer: AdamW
- Loss Function: Cross-entropy
- Learning Rate: 5e-5
- Python 3.x
- PyTorch
- Transformers library
- Jupyter NotebookClone the repository:
git clone https://github.com/VijayendraDwari/CAT.git
cd CATInstall dependencies:
pip install -r requirements.txtRun Jupiter Notebook:
jupyter notebook📚 Documentation
For detailed information about:
Model architecture Training procedures Dataset preparation Evaluation metrics
Please refer to the notebooks in the repository.
If you use this implementation in your research, please cite:
@misc{dwari2025cat,
title={Cyclic Attention Transformer (CAT)},
author={Vijayendra Dwari},
year={2025},
publisher={GitHub},
journal={GitHub repository},
howpublished={\url{https://github.com/VijayendraDwari/CAT}}
}